alexandraabbas

Wiki Contributions

Comments

Sorted by

Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation."

What do you mean by "human modeling in output generation"?