8mo

A key problem in alignment research is how to align superhuman models whose behavior humans cannot reliably supervise. If we use today’s standard post-training approach to align models with human-specified behaviors (e.g., RLHF), we might train models to tell us what we want to hear even if it’s wrong, or do things that seem superficially good but are actually very different from what we intended.

We introduce a new unsupervised algorithm to address this problem. This algorithm elicits a pretrained model’s latent capabilities by fine-tuning it on its own labeled data alone, without any external labels.

Abstract

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for... (read 568 more words →)

Replying toWhat Discovering Latent Knowledge Did and Did Not Find

Collin3y

What Discovering Latent Knowledge Did and Did Not Find

Thanks for running these experiments and writing this up! I’m very excited to see this sort of followup work, and I think there are a lot of useful results here. I agree with most of this, and mostly just have a few nitpicks about how you interpret some things.

Reactions to the summary of your experimental results:

CCS does so better than random, but not by a huge margin: on average, random linear probes have a 75% accuracy on some “easy” datasets;
- I think it’s cool that random directions sometimes do so well; this provides a bit of additional evidence that high-accuracy directions can be quite salient.
- It’s not too surprising to me that this holds for UQA

... (read 1107 more words →)

Replying to[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision".

Collin3y*

[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision".

Thanks for writing this! I think there are a number of interesting directions here.

I think in (very roughly) increasing order of excitement:

Connections to mechanistic interpretability
- I think it would be nice to have connections to mechanistic interpretability. My main concern here is just that this seems quite hard to me in general. But I could imagine some particular sub-questions here being more tractable, such as connections to ROME/MEMIT in particular.
Improving the loss function + using other consistency constraints
- In general I’m interested in work that makes CCS more reliable/robust; it’s currently more of a prototype than something ready for practice. But I think some types of practical improvements seem more conceptually deep than others.
- I

Vael Gates

Vael Gates, Collin

I (Vael Gates) recently ran a small pilot study with Collin Burns in which we showed ML researchers (randomly selected NeurIPS / ICML / ICLR 2021 authors) a number of introductory AI safety materials, asking them to answer questions and rate those materials.

Summary

We selected materials that were relatively short and disproportionally aimed at ML researchers, but we also experimented with other types of readings.^[1] Within the selected readings, we found that researchers (n=28) preferred materials that were aimed at an ML audience, which tended to be written by ML researchers, and which tended to be more technical and less philosophical.

In particular, for each reading we asked ML researchers (1) how much they liked that... (read 492 more words →)

175

Replying toHow "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

I think this is likely right by default in many settings, but I think ground-level truth does provide additional accuracy in predicting next tokens in at least some settings -- such as in "Claim 1" in the post (but I don't think that's the only setting) -- and I suspect that will be enough for our purposes. But this is certainly related to stuff I'm actively thinking about.

Replying toHow "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

There were a number of iterations with major tweaks. It went something like:

I spent a while thinking about the problem conceptually, and developed a pretty strong intuition that something like this should be possible.
I tried to show it experimentally. There were no signs of life for a while (it turns out you need to get a bunch of details right to see any real signal -- a regime that I think is likely my comparative advantage) but I eventually got it to sometimes work using a PCA-based method. I think it took some work to make that more reliable, which led to what we refer to in the paper as CRC-TPC.
That method

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Thanks! I personally think of it as both "contrastive" and "unsupervised," but I do think similar contrastive techniques can be applied in the supervised case too -- as some prior work like https://arxiv.org/abs/1607.06520 has done. I agree it's less clear how to do this for open-ended questions compared to boolean T/F questions, but I think the latter captures the core difficulty of the problem. For example, in the simplest case you could do rejection sampling for controllable generation of open-ended outputs. Alternatively, maybe you want to train a model to generate text that both appears useful (as assessed by human supervision) while also being correct (as assessed by a method like CCS). So I agree supervision seems useful too for some parts of the problem.

Replying toHow "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Thanks for the detailed comment! I agree with a lot of this.

So I'm particularly interested in understanding whether these methods work for models like Go policies that are not pre-trained on a bunch of true natural language sentences.

Yep, I agree with this; I'm currently thinking about/working on this type of thing.

I think "this intuition is basically incorrect" is kind of an overstatement, or perhaps a slight mischaracterization of the reason that people aren't more excited about unsupervised methods. In my mind, unsupervised methods mostly work well if the truth is represented in a sufficiently simple way. But this seems very similar to the quantitative assumption required for regularized supervised methods to work.

This... (read 976 more words →)

Replying toHow "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Thanks Ansh!

It seems pretty plausible to me that a human simulator within GPT-n (the part producing the "what a human would say" features) could be pretty confident in its beliefs in a situation where the answers derived from the two features disagree. This would be particularly likely in scenarios where humans believe they have access to all pertinent information and are thus confident in their answers, even if they are in fact being deceived in some way or are failing to take into account some subtle facts that the model is able to pick up on. This also doesn't feel all that worst-case to me, but maybe we disagree on that point.

I agree there are plenty of examples whether humans would be confident when they shouldn't be. But to clarify, we can choose whatever examples we want in this step, so we can explicitly choose examples where we know humans have no real opinion about what the answer should be.

Replying toHow "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin3y

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

I agree this proposal wouldn't be robust enough to optimize against as-stated, but this doesn't bother me much for a couple reasons:

This seems like a very natural sub-problem that captures a large fraction of the difficulty of the full problem while being more tractable. Even just from a general research perspective that seems quite appealing -- at a minimum, I think solving this would teach us a lot.
It seems like even without optimization this could give us access to something like aligned superintelligent oracle models. I think this would represent significant progress and would be a very useful tool for more robust solutions.

I have some more detailed thoughts about how we could... (read more)

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

Introduction

A few collaborators and I recently released a new paper: Discovering Latent Knowledge in Language Models Without Supervision. For a quick summary of our paper, you can check out this Twitter thread.

In this post I will describe how I think the results and methods in our paper fit into a broader scalable alignment agenda. Unlike the paper, this post is explicitly aimed at an alignment audience and is mainly conceptual rather than empirical.

Tl;dr: unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.

Disclaimers: I have tried... (read 4576 more words →)

244

Replying toLessons learned from talking to >100 academics about AI safety

Collin3y

Lessons learned from talking to >100 academics about AI safety

Thanks for writing this up! I basically agree with most of your findings/takeaways.

In general I think getting the academic community to be sympathetic to safety is quite a bit more tractable (and important) than most people here believe, and I think it's becoming much more tractable over time. Right now, perhaps the single biggest bottleneck for most academics is having long timelines. But most academics are also legitimately impressed by recent progress, which I think has made them much more open to considering AGI than they used to be at least, and I think this trend will likely accelerate over the next few years as we see much more impressive models.

LESSWRONG
LW

LESSWRONG
LW

Collin

Collin

Unsupervised Elicitation of Language Models

What AI Safety Materials Do ML Researchers Find Compelling?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Collin

Collin

Unsupervised Elicitation of Language Models

What AI Safety Materials Do ML Researchers Find Compelling?

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

Abstract

Summary

Introduction