Zheng Wang

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Why and When Interpretability Work is Dangerous

Zheng Wang4mo32

My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.

The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the "value system" is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of "doom" comparing with AI taking full control.

Reply