Ruixuan Huang

Message

An undergraduate student from USTC CSE. Currently during internship in MSRA social computing group. Recent research interests are XAI-related issues.

Steering LLMs' Behavior with Concept Activation Vectors

Recently, some researches have reported a mechanism called Activation Steering, which can influence the behavior styles of large language models (LLMs). This mainly includes refusal capabilities [1] and language usage [2]. This mechanism resembles the functionality of the safety concept activation vectors (SCAVs) [3] we proposed early this year. We’ve...

Sep 28, 20248

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

[Edit on 17th Mar] After conducting experiments on more data points (5000 texts) on the Pile dataset (more sample sources), we are confident that the experimental results described earlier are reliable. Therefore, we have opened the code. Recently, we conducted several experiments focused on the evolution and migration of token...

Mar 8, 20246

LESSWRONG
LW

LESSWRONG
LW

Ruixuan Huang

Ruixuan Huang

Ruixuan Huang

Ruixuan Huang

Steering LLMs' Behavior with Concept Activation Vectors

Exploring the Evolution and Migration of Different Layer Embedding in LLMs

Steering LLMs' Behavior with Concept Activation Vectors

Exploring the Evolution and Migration of Different Layer Embedding in LLMs