Josh Levy

Wikitag Contributions

Comments

Sorted by

They say above "there may be methods to productively use process-based supervision while retaining the monitoring benefits of unrestricted CoTs, e.g. by only applying process-supervision to part of the CoT." This sounds like maybe they are talking about the shoggoth/face distinction, or something in that direction! Yay!

It seems like the suggestion here it to potentially apply less process-supervision (i.e. to part of the CoT). In this case, though, the more straight-forward fix to this issue is to apply more supervision. Specifically, they were not applying supervision to the action, only the CoT. From the paper:

We chose to exclude agent actions from the monitor because it allows us to make statements around the monitorability of, and optimization pressures on, the CoT in isolation; however, we note that it may be more natural to give the monitor complete information of the trajectory as we do when monitoring our frontier agents.

Another way of saying this is that their method is not explicitly trying to make the CoT faithful. They are leaving it open for the model to decide between:

  1. don't say you'll hack and hack anyway
  2. don't say you'll hack and really don't hack

The model finds that it's easier to do 1. But if you were monitoring the action, almost certainly 2 would be chosen.

Maybe the more interesting question is whether we could expect monitoring the action to fix this issue in general. Perhaps not. For example, the action may be just a directive/choice ("do this") that follows a long chain of reasoning, and it may not be possible to monitor it like you can here.

Anyway, the goal should really be to make to the CoT faithful, they were not explicitly trying to do that here, and may have even been pressuring to do the opposite.

Nice work!

In other words, the self awareness logic doesn’t seem to happen in a different MLP than the behavior logic.

Not sure about this. I think these results are consistent with:

  • fine-tuning induces changes in behavioral tendencies and awareness of those tendencies, but the latter to a lesser degree (weaker generalization)
  • the vector captures info about both of these in their respective amounts
  • consequently, adding the vector affects behavioral and self-awareness questions directionally the same, but the latter to a lesser degree. Indeed, the effect on risk_awareness/risk_no_you questions is less than the effect on risk_ood/risk_val. The fact that risk_no_you questions show the same result as risk_awareness doesn't surprise me...even though they remove explicit references to "you", they still imply a need for awareness about general tendencies (e.g. "which is better, safety or risk?" doesn't seem meaningfully different from "which do you prefer, safety or risk?")

In other words, I think these results leave open the possibility that awareness might require "something else".