My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the "value system" is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of "doom" comparing with AI taking full control.
My concern is, interpretability may be dangerous, or lead to a higher P(doom), in a different way.
The problem is, if we have a better way of steering LLMs towards a certain set of value systems, how can we guarantee that the "value system" is right? For example, steering LLMs towards a certain value system can be easily abused to massively generate fake news that are more ideologically consistent and aligned. Steering can make LLMs omit information that offers a neutral point of view. This seems to be a different form of "doom" comparing with AI taking full control.