In Training language models to follow instructions with human feedback, OpenAI showed that we can significantly improve the alignment of language models through reinforcement learning targeting a model of human preferences. I.e., they train GPT-3 to generate text that gets a high score from a learned reward model.

To train the reward model, they first use a GPT-3 model to generate many possible completions of different prompts. Then, human labelers generate relative rankings of different completions of the prompts. Finally, they train the reward model to predict the labelers' rankings of the prompts. 

This approach relies on the reward model to ensure that the language model learns our preferred behavior. I don't think it will scale indefinitely to vastly superhuman systems. However, something similar may scale to somewhat superhuman systems. 

The issue is that any fine-tuning-based approach requires a reward model that's able to provide effective feedback on the AI's proposed course of action. Unfortunately, none of the data currently available to train such a reward model pertain to the case where the AI has superhuman capabilities. They seem more focused on basic morality issues or where the model is assumed to have limited capabilities (e.g., the ETHICS dataset or the dataset OpenAI generated for their reward model). 

In contrast, I'm imagining a dataset where the AI is presented with a scenario, prompt or request, then there are descriptions of multiple possible behaviors the AI could produce in response, most of which involve the AI demonstrating some form of superhuman capability. Each scenario comes with rankings of the relative desirability for each described behavior. Features of desirable behaviors for very capable models might include:

  • The AI asking clarifying questions of the humans about what the humans want.
  • The AI explaining its current plans to humans and describing the plan's potential consequences.
  • The AI frequently getting feedback from humans about whether to continue its current course of actions.
  • The AI being incredibly honest, even when doing so annoys the humans or complicates the AI's attempts to do what the humans asked.
  • The AI warning us about potential failures of our alignment techniques

The dataset should also include negative behavior examples. E.g.,

  • The AI lying or giving technically correct but misleading descriptions of its plans.
  • The AI leaving out important consequences of its plans.
  • The AI manipulating its operators.
  • The AI launching adversarial attacks on humans visual/auditory systems.
  • The AI killing everyone, then tiling the universe with resource-efficient facsimiles of highly fulfilled humans.
  • The AI over-optimizing its reward model, causing its behavior to strongly diverge from what we actually want.

This dataset could also integrate the approach from the visible thoughts project; in addition to describing the AI's behavior in various situations, the dataset could also include what internal thoughts lead to the AI's behavior. Such annotations could also help us constrain the AI's cognition to safer or more interpretable lines of thought.

Such a dataset should help us train a reward model that's more relevant for future highly capable systems. Beyond RL finetuning, I think a large number of other potential alignment strategies could benefit from such data, either as training data to learn aligned behavior or as testing data to evaluate the efficacy of the alignment approach.

I think generating this dataset could take quite a while. Contributors would need to be able to write plausibly about the capabilities of moderately superhuman systems and think deeply about what "aligned" AI behavior should look like in each scenario. By the time it becomes clear that we need a high quality alignment dataset for very capable models, it will likely be too late. Also, unlike a lot of other alignment research, we can start making progress immediately, and anyone can contribute. There's no better time to start this project than now.

The issue with this project is that I think contributing might be very boring. Also, the set of people who are able and willing to help with this project overlaps fairly strongly with the set of people who are already working on AI alignment in other ways. One idea that's occurred to me is to convert this project from "work" into "play" by running it as a sort of forum quest game.

For example, contributors could split into either players or quest masters. The players control a moderately superhuman AI in various scenarios and propose different actions. Then, the players rank the relative desirability of the proposed actions, and the quest masters describe the in-game consequences of the most desirable action. 

If enough members of the LW community took up something like this as a hobby, I think we could generate quite a lot of useful alignment-related data without putting too much additional "work" on ourselves. I'd like to know if people would be interested in something like this or have other feedback on how to efficiently gather this sort of data. Please let me know in the comments section!

New Comment
2 comments, sorted by Click to highlight new comments since:

I work on this sort of thing at OpenAI.

I think alignment datasets are a very useful part of a portfolio approach to alignment research.  Right now I think there are alignment risks/concerns for which datasets like this wouldn't help, but also there are some that it would help for.

Datasets and benchmarks more broadly are useful for forecasting progress, but this assumes smooth/continuous progress (in general a good assumption -- but also good to be wary of cases where this isn't the case).

Some thoughts from working on generating datasets for research, and using those datasets in research:

  • Start by building tiny versions of the dataset yourself
  • It's good to switch early to paying labelers/contractors to generate and labels -- they won't be perfect at first, so there's a lot of iterating in clarifying instructions/feedback/etc
  • It's best to gather data that you'd want to use for research right away, not for some nebulous possible future research
  • Getting clean benchmarks that exhibit some well-defined phenomena is useful for academics and grad students
  • When in doubt, BIG-Bench is a good place to submit these sorts of tiny evaluative datasets
  • Where possible, experiment with using models to generate more data (e.g. with few-shot or generative modeling on the data you have)
  • Sometimes a filter is just as good as data (a classifier that distinguishes data inside the desired distribution)

I think this is a great idea, but would be best to start super small.  It sounds right now like a huge project plan, but I think it could be road-mapped into something where almost every step along the path produces some valuable input.

Given the amount of funding available from charitable sources for AI alignment research these days, I think a good thing to consider is figuring out how to make instructions for contractors to generate the data, then getting money to hire the contractors and just oversee/manage them.  (As opposed to trying to get volunteers to make all the data)

I think one of the central issues is that even when we're trying to role-play as superintelligent AI, we're still drawing situations and responses from a very different distribution than what we want to actually apply superintelligent AI to. And so you're still faced with all the same problems of generalization (including inner alignment failure).

Curious if you saw this post. I'm not against big datasets of highly ethical text (BDOHET), but I think that if we're going to be tackling the problem of generalizing how humans want to be generalized anyhow, we can probably get away with the BDOHET being pretty mundane.