LESSWRONG
LW

Peter Hase — LessWrong

8mo

A key problem in alignment research is how to align superhuman models whose behavior humans cannot reliably supervise. If we use today’s standard post-training approach to align models with human-specified behaviors (e.g., RLHF), we might train models to tell us what we want to hear even if it’s wrong, or do things that seem superficially good but are actually very different from what we intended.

We introduce a new unsupervised algorithm to address this problem. This algorithm elicits a pretrained model’s latent capabilities by fine-tuning it on its own labeled data alone, without any external labels.

Abstract

To steer pretrained language models for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for... (read 568 more words →)

Replying to[Linkpost] A survey on over 300 works about interpretability in deep networks

Peter Hase3y

[Linkpost] A survey on over 300 works about interpretability in deep networks

Nice work! Two good points from the paper:

"Works should evaluate how their techniques perform on randomly or adversarially sampled tasks"
"...highlights a need for techniques that allow a user to discover failures that may not be in a typical dataset or easy to think of in advance. This represents one of the unique potential benefits of interpretability methods compared to other ways of evaluating models such as test performance"

Replying toThe Alignment Forum should have more transparent membership standards

Peter Hase5y

The Alignment Forum should have more transparent membership standards

But the biggest obstacle is probably just operational capacity.

I see. I know the team has its limits and has already been in a lot of work to propping up AF/LW, which is generally appreciated!

I think I am most confused what you mean by "access to the discourse".

I mean the ability to freely participate in discussion, by means of directly posting and commenting on threads where the discussion is occurring. Sorry for not making this clearer. I should have more clearly distinguished this from the ability to read the discussion, and the ability to participate in the discussion after external approval.

But clearly the relevant comparison isn't "has no means of becoming an AF

... (read 894 more words →)

Replying toThe Alignment Forum should have more transparent membership standards

Peter Hase5y

The Alignment Forum should have more transparent membership standards

Thanks for the informative reply! This clarifies a lot about the forum to me, and I'm glad you found the post helpful in some ways.

Let me also add on to some of the points above.

Use the Intercom button in the bottom right corner!

This is good to know about! I simply never knew that was a chat button, and I guess Owen and our mod intermediary didn't know about it since it didn't come up? I bet we could have saved a lot of trouble if we'd first talked through this a few months ago.

In particular, the application is not something we respond to The right term for the applications might be more something

... (read 1357 more words →)

The Alignment Forum should have more transparent membership standards

Peter Hase

This is a public complaint about the Alignment Forum that I hope will improve the overall health of the AI Safety community as it exists there and elsewhere. This post is basically a short story about how I was refused membership by the AF mod team after a long and opaque process, with some accompanying reasons for why the fact that this happened is a bad signal for the community.

You may have come across this recent post on the AF: Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers (LW version here). I started writing this post sometime in August of 2020, and I was really happy to have Owen... (read 1276 more words →)

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

[anonymous]

Peter Hase
UNC Chapel Hill

Owen Shen
UC San Diego

With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post.

Introduction

Model interpretability was a bullet point in Concrete Problems in AI Safety (2016). Since then, interpretability has come to comprise entire research directions in technical safety agendas (2020); model transparency appears throughout An overview of 11 proposals for building safe advanced AI (2020); and explainable AI has a Twitter hashtag, #XAI. (For more on how interpretability is relevant to AI safety, see here or here.) Interpretability is now a very popular area of research. The interpretability area was the most popular in terms of video views at ACL last year. Model interpretability... (read 30451 more words →)

142

LESSWRONG
LW

LESSWRONG
LW

Peter Hase

Peter Hase

Peter Hase

Unsupervised Elicitation of Language Models

The Alignment Forum should have more transparent membership standards

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

Peter Hase

Peter Hase

Peter Hase

Unsupervised Elicitation of Language Models

The Alignment Forum should have more transparent membership standards

Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers

Abstract

Introduction