User Comment Replies

dmz6mo144

(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)

Re not believing Anthropic's statement:

we believe the risk of substantial under-elicitation is low

To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.

jacquesthibs's Shortform

dmz10mo32

Unfortunate name collision: you're looking at numbers on the AI2 Reasoning Challenge, not Chollet's Abstraction & Reasoning Corpus.

1Morpheus10mo

Thanks for clarifying! I just tried a few simple ones by prompting gpt-4o and gpt-4 and it does absolutely horrific job! Maybe trying actually good prompting could help solving it, but this is definitely already an update for me!

High-stakes alignment via adversarial training [Redwood Research report]

dmz2yΩ230

That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday.

Takeaways from our robust injury classifier project [Redwood Research]

dmz3yΩ110

The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3yΩ230

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

High-stakes alignment via adversarial training [Redwood Research report]

dmz3yΩ230

I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

4paulfchristiano3y

Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?

High-stakes alignment via adversarial training [Redwood Research report]

dmz3yΩ230

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3y20

My worry is that it's not clear what exactly we would learn. We might get performance degradation just because the classifier has been trained on such a different distribution (including a different generator). Or it might be totally fine because almost none of those tasks involve frequent mention of injury, making it trivial. Either way IMO it would be unclear how the result would transfer to a more coherent setting.

(ETA: That said, I think it would be cool to train a new classifier on a more general language modeling task, possibly with a different notion of catastrophe, and then benchmark that!)

1aog3y

Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook's OPT-175B, so that future researchers can build on it. I'm planning on working on technical AI safety full-time this summer. Right now I'm busy applying to a few different programs, but I'll definitely follow up on this idea with you.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3y20

Given that our classifier has only been trained on 3 sentence prompts + 1 sentence completions, do you think it can be applied to normal benchmarks? It may well transfer okay to other formats.

1aog3y

For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite -- rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either. For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator's QA skills. GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks -- you'd simply need to be competitive with unfiltered GPT-Neo.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3yΩ10180

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Pa... (read more)

High-stakes alignment via adversarial training [Redwood Research report]

dmz3y20

I don't think I believe a strong version of the Natural Abstractions Hypothesis, but nevertheless my guess is GPT-3 would do quite a bit better. We're definitely interested in trying this.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3y60

Thanks! :)

Good question. Surge ran some internal auditing processes for all our quality data collection. We also checked 100 random comparisons ourselves for an earlier round of data and they seemed reasonable: we only disagreed with 5 of them, and 4 of those were a disagreement between "equally good" / "equally bad" and an answer one way or the other. (There were another 10 that seemed a bit borderline.) We don't have interrater reliability numbers here, though - that would be useful.

2aog3y

Would you be interested in running a wider array of few-shot performance benchmarks? No performance loss on few shot generation is a bold but defensible claim, and it would be great to have stronger evidence for it. I’d be really interested in doing the legwork here if you’d find it useful. Fantastic paper, makes a strong case that rejection sampling should be part of the standard toolkit for deploying pretrained LMs on downstream tasks.

High-stakes alignment via adversarial training [Redwood Research report]

dmz3y10

Neat! We tried one or two other saliency scores but there's definitely a lot more experimentation to be done.

DMZ's Shortform

dmz4y*30

Some small errors in the full version of PainScience.com's article about patellofemoral pain syndrome:

Argues that running is good for your joints using a correlational study stating that 9% of marathon runners have arthritis vs 18% of non-runners. (Obvious explanation: people with arthritis are less likely to run marathons.)
Incorrect unit conversions in pressure: e.g. 4000 psi is ~280 kg-force / cm^2, not 715.

LESSWRONG
LW

All of dmz's Comments + Replies