See A Longlist of Theories of Impact for Interpretability (this seems similar to #4). Unfortunately I think interpretability is harder than you seem to think; on that see transformer-circuits.pub and this Mechanistic Interpretability Quickstart Guide.
Thank you for your comment, Zac. The links you suggest will be helpful for me to check whether this kind of analysis has been tried. Up to now I've only seen studies directed to interpret specific neurons or areas of a model, but not a statistical analysis of the whole model that can raise an alert when the model is using certain areas previously associated with negative behaviors.
For a work implementin this idea, see: https://www.anthropic.com/index/decomposing-language-models-into-understandable-components
fMRI (Functional Magnetic Resonance Imaging) is a technique used to investigate the human brain by detecting changes in the blood flow when doing certain tasks. Blood flow is an indicator of energy consumption, and its changes indicate areas of the brain associated with certain activities.
AI alignment is an open area of investigation trying to solve the problem of ensuring that an AI system's goals and behaviours are aligned with human values. Among other problems that may arise in an advance AI system (specification, robustness, interpretability…), one of the main concerns is the possibility that it may attempt to hide its real intentions. This is generally referred to as "concealed intent" or "deceptive behaviour".
We propose to use an approach similar to fMRI to investigate the possibility of a deceptive behaviour in an advance AI system. In our case, the activation of neurons will play the role of energy consumption in fMRI, with the advantage of having a much higher granularity.
First, we hypothesize that:
To use this approach, we propose these general steps:
Technical feasibility
Complex AI systems have a huge number of neurons and parameters, and work as a “black box” with no clear explanation of the results it produces. This complexity makes difficult to determine the state and evolution of the activated cells when the system is in training or operation.
Our approach uses averages of activation to represent internal operation. The use of memory will be large, but of the order or magnitude of the AI system size.
The calculation of averages of activation should also be feasible in terms of compute power.
Test Design
The design of the tests to be used in phases 1 and 2 is critical, and we propose that it should be prepared by a multidisciplinary team with people specialized in psychology, behaviour, AI architecture, prompting…
Limitations and Risks
There are two main risks that may limit the applicability of this approach: