Steering LLMs' Behavior with Concept Activation Vectors
Recently, some researches have reported a mechanism called Activation Steering, which can influence the behavior styles of large language models (LLMs). This mainly includes refusal capabilities [1] and language usage [2]. This mechanism resembles the functionality of the safety concept activation vectors (SCAVs) [3] we proposed early this year. We’ve...