All of tamera's Comments + Replies

tameraΩ9185

The idea here is that we shouldn't trust CoT blindly - instead, we should measure whether or not it is faithful, and use that as a criterion for if it is a good mechanism for oversight or not. If a model's CoT is measurably unfaithful on a certain task, we shouldn't trust CoT-based oversight there. Importantly, we could empirically test the CoT for faithfulness, and discover that it is unacceptably low before making the decision to trust it.

If we only want to trust our models to take high-stakes actions when we can provide adequate oversight via the CoT, a... (read more)

7Sam Bowman
I agree, though I'll also add:  - I don't think our results clearly show that faithfulness goes down with model size, just that there's less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There's probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!) - Between the good-but-not-perfect results here and the alarming results in the Turpin 'Say What They Think' paper, I think this paints a pretty discouraging picture of standard CoT as a mechanism for oversight. This isn't shocking! If we wanted to pursue an approach that relied on something like CoT, and we want to get around this potentially extremely cumbersome sweet-spot issue around scale, I think the next step would be to look for alternate training methods that give you something like CoT/FD/etc. but have better guarantees of faithfulness.
tameraΩ71312

The main thing I want to address with this research strategy is language models using reasoning that we would not approve of, which could run through convergent instrumental goals like self-preservation, goal-preservation, power-seeking, etc. It doesn't seem to me that the failure mode you've described depends on the AI doing reasoning of which we wouldn't approve. Even if this research direction were wildly successful, there would be many other failure modes for AI; I'm just trying to address this particularly pernicious one. 

It's possible that the t... (read more)

tamera20

Ideally, if reasoning is displayed that isn't really causally responsible for the conclusion that should be picked up by the tests we create. Tests 2 and 4 in the linked doc start to get at this (https://hackmd.io/@tamera/HJ7iu0ST5-REMOVE, but remove "-REMOVE"), and we could likely develop better tests along these same lines.

Even if we failed to create tests that check for this, the architectures / training processes / prompting strategies that we create might make us more assured that the reasoning is legit. For example, the selection-inference strategy u... (read more)

tameraΩ340

I'm not sure what's going on with the types in this equation, at the start of the formalization section:

I'd think that the left side represents a pseudo-input, while the right represents an action. Am I missing something?

2evhub
Actions are just language outputs—and since we ask for an action that describes a pseudo-input, hopefully we should be able to interpret it that way.