All of alexandraabbas's Comments + Replies

Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

1Clément Dumas
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation."

What do you mean by "human modeling in output generation"?

2Nina Panickssery
I am contrasting generating an output by: 1. Modeling how a human would respond (“human modeling in output generation”) 2. Modeling what the ground-truth answer is Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)