LESSWRONG
LW

All of alexandraabbas's Comments + Replies

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas4mo30

Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas4mo20

Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

1Clément Dumas4mo

Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

Reducing sycophancy and improving honesty via activation steering

alexandraabbas1y10

"[...] This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation."

What do you mean by "human modeling in output generation"?

2Nina Panickssery1y

I am contrasting generating an output by: 1. Modeling how a human would respond (“human modeling in output generation”) 2. Modeling what the ground-truth answer is Eg. for common misconceptions, maybe most humans would hold a certain misconception (like that South America is west of Florida), but we want the LLM to realize that we want it to actually say how things are (given it likely does represent this fact somewhere)