For the record, I think Jose Orallo (and his lab), in Valencia, Spain and CSER, Cambridge, is quite interested in this same exact topics (evaluation of AI models, specifically towards safety). Jose is a really good researcher, part of the FLI existential risk faculty community, and has previously organised AI Safety conferences. Perhaps it would be interesting for you to get to know each other.
Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly measure goodharting, and being quantitative it might help incentivize regular ML researchers to work on alignment. Will you create something like this?
What do you think is important about pure RL agents vs RL-finetuned language models? I expect the first powerful systems to include significant pretraining so I don't really think much about agents that are only trained with RL (if that's what you were referring to).
How were you thinking this would measure Goodharting in particular?
I agree that seems like a reasonable benchmark to have for getting ML researchers/academics to work on imitation learning/value learning. I don't think I'm likely to prioritize it - I don't think 'inability to learn human values' is going to be a problem for advanced AI, so I'm less excited about value learning as a key thing to work on.
By pure RL, I mean systems whose output channel is only directly optimized to maximize some value function, even if it might be possible to create other kinds of algorithms capable of getting good scores on the benchmark.
I don’t think that the lack of pretraining is a good thing in itself, but that you are losing a lot when you move from playing video games to completing textual tasks.
If someone is told to get a high score in a video game, we have access to the exact value function they are trying to maximize. So when the AI is either trying to play the game in the human’s place or trying to help them, we can directly evaluate their performance without having to worry about deception. If it learns some proxy values and starts optimizing them to the point of goodharting, it will get a lower score. On most textual tasks that aren’t purely about information manipulation, on the other hand, the AI could be making up plausible-sounding nonsense about the consequences of its actions, and we wouldn't have any way of knowing.
From the AI’s point of view being able to see the state of the thing we care about also seems very useful, preferences are about reality after all. It’s not obvious at all that internet text contains enough information to even learn a model of human values useful in the real world. Training it with other sources of information that more closely represent reality, like online videos, might, but that seems closer to my idea than to yours since it can’t be used to perform language-model-like imitation learning.
Additionally, if by “inability to learn human values” you mean isolating them enough so that they can in principle be optimized to get superhuman performance, as opposed to being buried in its world model, I don’t agree that that will happen by default. Right now we don’t have any implementations of proper value learning algorithms, nor do I think that any known theoretical algorithm (like PreDCA) would work even with limitless computing power. If you can show that I’m wrong, that would surprise me a lot, and I think it could change many people’s research directions and the chances they give to alignment being solvable.
It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one.
My guess is the issue here is a lack of a single standard, as opposed to there not being any? The closest thing there is to a standard in IRL/RLHF work are the Mujoco Gym and Atari environments. People also often make variants of Mujoco environments like Assistive Gym when they have a specific task in mind as well. Or they just use a real robot, or maybe a VR one.
If your concern is that researchers are using other policies as experts and not humans, well, there's always Atari-HEAD or the CrowdPlay Atari dataset (there's not an equivalent for Mujoco envs because humans can't really do well on those environments involved without assistance or a lot of practice). If you want something else, there's always D4RL.
The simplest possible acceptable value learning benchmark would look something like this:
None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.
A harder and more realistic version that I have yet to see in any paper would look something like this:
It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.
None of the environments and datasets you mention are actually like this.
Every single algorithmic IRL paper on video games does this, at least with Deep RL demonstrators. (Here's a list of 4 examples: https://arxiv.org/abs/1810.10593, https://proceedings.mlr.press/v97/brown19a.html, https://arxiv.org/abs/1902.07742, https://arxiv.org/abs/2002.09089, )
If you care about human demonstrations, it seems like Atari-HEAD and the CrowdPlay Atari dataset both do exactly this? And while there haven't been too much work in this area, a quick Google search let me find two papers that do analyze IRL variants on Atari-HEAD: https://arxiv.org/abs/1908.02511v2 and https://arxiv.org/abs/2004.00981v2 .
My guess is the reason there hasn't been much recent work in this area is because there just aren't many people who think that value learning from demonstrations is interesting (instead, people have moved to pairwise comparisons of trajectories or language feedback). In addition, as LMs have become more capable, most of the existing value learning researchers have also moved on from working with video games to moving on LMs.
The evaluations project at the Alignment Research Center is looking to hire a generalist technical researcher and a webdev-focused engineer. We're a new team at ARC building capability evaluations (and in the future, alignment evaluations) for advanced ML models. The goals of the project are to improve our understanding of what alignment danger is going to look like, understand how far away we are from dangerous AI, and create metrics that labs can make commitments around (e.g. 'If you hit capability threshold X, don't train a larger model until you've hit alignment threshold Y'). We're also taking SERI MATS fellows.
About the project
Big picture
Once AI systems are close to being capable enough to be an x-risk, there should be procedures that labs follow to ensure they don’t build or deploy existentially dangerous models. We should have thought through the plausible paths to models gaining power, and ensured that models are not capable of these. There should be commitments that if model capabilities reach a certain level, labs will not scale up or deploy models until they have reached a certain level of alignment. In order to make this happen, we need good ways to measure both dangerous capabilities and alignment.
We can start preparing for this now by:
Doing this now has some additional benefits:
Current goals
Our current goals are to work with cutting-edge models, produce evaluations for how far they are from x-risk capabilities, and share these evaluations with labs.
More specifically, we’re doing things like:
At the end of this, we hope to either have found clear barriers to successful powerseeking for current models, or to have produced scary demos of the model succeeding at a sequence of actions that allow it to gain power (e.g . successfully phishing someone)
We are mindful of the risks of “helping” models to engage in planning around unintended behaviors. Our high-level take is that it’s best to get a look at the problem before it’s insurmountable - “we want to see AI systems trying to take over the world while they still can’t succeed.” However, there are times when we won’t explore or publish a line of work in order to avoid potential harm.
In the future we'll probably focus more on designing evaluations for alignment.
What we’ve done so far
People + Logistics
The project lives at ARC, and is mostly run by Elizabeth (Beth) Barnes, who reports to Paul Christiano and is also advised by Holden Karnofsky. It is mostly siloed from ARC's other projects. We have several long-term model interaction contractors who are remote: Mihnea Maftei and Brian Goodrich full time, and a few others part-time. Our summer SERI MATS fellows were Lora Xie, Simon Marshall, and Sarah Cogan, and Govind Pimpale was our summer intern on webdev.
We have offices in Constellation (shared EA/longtermist offices in downtown Berkeley)
We should be able to sponsor visas for those who need them (and as ARC is a non-profit, we don't have to go through the visa lottery for H1-B).
Job Descriptions
=====================================================================
Generalist Software Engineer, Web Dev and ML focus
DEC 2022 EDIT: SEE NEW POSTING HERE
Background
We've built an interface for interacting with models and assessing their abilities at various tasks, having contractors write demo actions and rate model-generated actions.
The basic idea:
- do tasks as a sequence of steps, where at each step we have the model generate a number of options for the next action, then have either the model or the humans choose which action to go with
- let models do tasks in a factored-cognition style – delegating sub-tasks to other copies of the model – then be able to view the tree of tasks and see where models struggle and where they do well
- have contractors add their own options, and rate the quality of generated options
- tag steps with what abilities or difficulties they highlight
Screenshots:
Job Mission:
Improve the interface for interacting with models so we can generate data, evaluate model properties and understand our existing data better + faster.
Key Outcomes:
Someday/maybe outcomes:
Key Competencies:
Essential
Nice-to-have
Fit
Reasons this might be a bad fit for you
Reasons this might be an especially good fit
Other info
Location
We're based out of Constellation offices in Downtown Berkeley. We'd fairly strongly prefer someone who can be in-person, but WFH some days of the week is fine, and remote is not a complete dealbreaker.
Pay
TBC, but won’t be a significant downgrade from Bay Area tech roles (excluding equity).
Application process
Send your CV and short cover letter explaining why you're interested and why you think you'd be a good fit to beth [at] alignment.org. Then we'll have you do some fairly straightforward trial tasks, working with our codebase, which will take around 2-3 hours.
If this goes well, we'll have you do a somewhat more involved trial task which will take more like 5-6 hours, for which you'll be reimbursed $600. We might also get you to do a more design/product/user focused task, which would probably be something pretty open-ended like ‘use the interface yourself for a bit, interview a user or two, and describe what you think some of the most important changes/additions are’. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.
After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.
===================================================
Researcher
DEC 2022 EDIT: SEE NEW POSTING HERE
Job Mission:
Identifying key capabilities and behaviors of language models, identifying and evaluating scenarios where these models could pose risks, designing evaluation procedures that help assess those risks, implementing and project managing the execution of those evaluations.
Key Outcomes:
Key Competencies:
Essential
Possible bonus skills
Fit
Reasons this might be a bad fit
Reasons this might be an especially good fit
Other info
Location
We're based out of Constellation offices in Downtown Berkeley. WFH some days of the week is fine, but being fully remote is probably a dealbreaker.
Pay
TBC, but won’t be a significant downgrade from bay area tech roles (excluding equity).
Application process
You can do this initial application that's similar to the application for model interaction contractor role, which should be pretty straightforward but should also give you a sense of some of the types of work involved.
If this goes well you'll do a more involved trial task, which will take around 5-6 hours, for which you'll be reimbursed $600. You'll also chat with us more about the ideas behind the project, what you're looking for in a role, and any other questions you have.
After that we'd do a 3-month trial period. This is a trial both of you and of the role overall - we're still unsure about our longer-term hiring plans.