LESSWRONG
LW

tamera — LessWrong

The idea here is that we shouldn't trust CoT blindly - instead, we should measure whether or not it is faithful, and use that as a criterion for if it is a good mechanism for oversight or not. If a model's CoT is measurably unfaithful on a certain task, we shouldn't trust CoT-based oversight there. Importantly, we could empirically test the CoT for faithfulness, and discover that it is unacceptably low before making the decision to trust it.

If we only want to trust our models to take high-stakes actions when we can provide adequate oversight via the CoT, and we run the tests and find that the CoT isn't faithful for large... (read more)

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez

TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting — answering questions by breaking them into subquestions — improves faithfulness while maintaining good task performance.

Paper Abstracts

Measuring Faithfulness in Chain-of-Thought Reasoning

Large language models (LLMs) perform better when they produce step-by-step, “Chain-of -Thought” (CoT)

... (read 1547 more words →)

111

Replying toExternalized reasoning oversight: a research direction for language model alignment

tamera4y

Externalized reasoning oversight: a research direction for language model alignment

The main thing I want to address with this research strategy is language models using reasoning that we would not approve of, which could run through convergent instrumental goals like self-preservation, goal-preservation, power-seeking, etc. It doesn't seem to me that the failure mode you've described depends on the AI doing reasoning of which we wouldn't approve. Even if this research direction were wildly successful, there would be many other failure modes for AI; I'm just trying to address this particularly pernicious one.

It's possible that the transparency provided by authentic, externalized reasoning could also be useful for reducing other dangers associated with powerful AI, but that's not the main thrust of the research direction I'm presenting here.

Replying toExternalized reasoning oversight: a research direction for language model alignment

tamera4y

Externalized reasoning oversight: a research direction for language model alignment

Ideally, if reasoning is displayed that isn't really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove "-REMOVE"), and we could likely develop better tests along these same lines.

Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy used by DeepMind breaks the reasoning into multiple context windows, which might reduce the chance of long chains of motivated reasoning.

Externalized reasoning oversight: a research direction for language model alignment

tamera

Summary

In this post, I’ll present a research direction that I’m interested in for alignment of pretrained language models.

TL;DR: Force a language model to think out loud, and use the reasoning itself as a channel for oversight. If this agenda is successful, it could defeat deception, power-seeking, and other forms of disapproved reasoning. This direction is broadly actionable now.

In recent publications, prompting pretrained language models to work through logical reasoning problems step-by-step has provided a boost to their capabilities. I claim that this externalized reasoning process may be used for alignment if three conditions are met:

We develop trustworthy tests to check if the reported reasoning is authentic (complete, straightforward, and causally responsible for

... (read 1735 more words →)

140

Replying toRelaxed adversarial training for inner alignment

tamera4y

Relaxed adversarial training for inner alignment

I'm not sure what's going on with the types in this equation, at the start of the formalization section:

I'd think that the left side represents a pseudo-input, while the right represents an action. Am I missing something?