Hello. I've been interesting in AI for a long time, but I never contributed anything to discussions of AI alignment in the past, because I didn't think I was smart enough. Turns out, no-one else in the world is smart enough to solve this either. So, without further preamble, I was wondering if anyone would like to see my proposal for a research direction. Its a first draft, and I'm sure someone else has thought of this already, but on the offchance they haven't...
(p.s. I used chatgpt to proofread it and suggest rewrites. I hope thats ok.)
Background
Interpreting the functionality of individual neurons and neural circuits within artificial intelligence (AI) systems remains a significant challenge in the field of AI alignment. This task includes the classification of neurons involved in complex behaviors such as deception, and subsequently monitoring these specific neurons to determine the AI's behavior. Despite advancements in AI interpretability, such as using one AI to predict the behavior of neurons in another AI, this problem is still unsolved1.
A limitation of this approach is rooted in Lob's theorem, which states that a system cannot perfectly predict its own behavior. Although the extent to which an AI can imperfectly predict its own behavior is still unknown, the problem intuitively arises from the paradox that for a system to fully comprehend its own behavior, it would need to be larger and more complex than itself.
To address this issue, I propose an alternative training scheme that incorporates the roles of neurons during the training process, effectively "baking in" an approximate understanding of their functions. This method could potentially improve the interpretability of AI systems without significantly impacting their efficiency, or even potentially improving it. The purpose of the rest of this document is to provide a brief outline of this scheme, including a proposal for a relatively simple test case that could be implemented without excessive computational resources.
Outline
My approach is to dedicate different subsections of the weight matrices to different concepts, by only using certain slices of the matrix at training and runtime. Therefore by looking at connection strength, neuron activations etc in different slices, we can deduce information about the inner workings of the network. 'Concepts' may refer to categories such as 'dog', 'car' or 'truth', 'deception' or 'probabilistic', 'declarative'.
Let's consider, for simplicity, a neural network composed solely of fully-connected layers. Each layer can be expressed as:
relu(Wx+b)
(Eq. 1)
In any given problem domain, there may be a number of concepts we're interested in analyzing, such as distinguishing between deception and non-deception. If there are only a few concepts, these can be manually enumerated. For larger sets, we can leverage a model like ChatGPT to generate a list of potential concepts, outputting in a structured, parseable format. In cases where an extremely large number of concepts exist (for instance, a fine-grained subdivision of all concepts known to humanity), we could ask ChatGPT to tackle the problem recursively.
Given this list of concepts, we create subsets of neurons in each layer. These subsets are disjoint and each one corresponds to a concept. It's important to note that a concept might be associated with multiple neurons due to the complexity of the concept. For simplicity, we could start with subsets that contain the same number of neurons and assign the same subsets to each layer of the network. However, neither of these conditions is strictly necessary.
Thus, we could express our weight matrix for a concept as:
[note is this the correct mathematical notation for slicing a matrix?]
Where I is the index of neurons corresponding to a concept, and C is the set of relevant concepts.
In an initial simplified scenario, where training and evaluation are executed one datapoint at a time rather than in batches, we would identify the relevant concepts for each datapoint using a smaller model like ChatGPT. We would then train or evaluate on the subnet using the following subset of weights and biases:
where W' and b' represent the sub-weights and sub-biases, and C is the set of concepts at that datapoint. Additionally, we need to normalize the net activation with respect to the number of concepts (or number of neurons if the concepts are of non-uniform size):
relu(W’x/|C|+b’)
(Eq. 4)
Given this subset of weights and biases, training and evaluation proceed as normal. In the event that we need to train or evaluate in batches where there are not enough exact matches to the set of concepts to fill the entire batch, we could use approximate matches.
Uses
Concept-oriented interpretability of neural networks opens avenues for understanding and controlling AI systems in a more nuanced and efficient manner. This approach categorizes neurons into relevant concepts, which can lead to increased computational efficiency, and potentially simplify the interpretability task. However, the practical realization of these benefits requires addressing several fundamental questions:
1) Computational Efficiency
Only evaluating on certain concepts is a form of sparse networks, which theoretically could increase efficiency by removing the need to evaluate parts of the network which are not relevant to the topic, however experimentation will be required to see if this will work in practice.
2) Neuron and Weight Encoding: How do neurons and individual weights encode concepts and subconcepts?
The association between two concepts can be defined as the sum of the connections (weights) between them.
This could be summed over every layer or just the last layer
Likewise, the strength of the connection between a neuron and a concept can be expressed as the sum of its connections.
A_{i,C2} = \sum_\limits{j \in C2} W_{i,j} x_j
This could potentially offer insights into the nature of the encoded subconcepts.
3) Concept Relevance: How do we identify the relevance of a particular concept at a specific point in time?
The relevance of a concept can be quantified as the sum of the activations of neurons associated with that concept.
R_{C} = \sum_\limits{i \in C} a_{i}
In this equation, is the relevance of concept C, C is the indices of neurons associated with concept C, and is the activation of neuron i. We sum over all neurons i in C.
This measure can aid in understanding the role of a concept during a particular task or point in time.
4) Concept Contribution: How do we estimate the contribution of each concept to the current output?
This question involves using backpropagation to estimate the contribution of each concept to the AI's projected output. The overall impact of a particular concept can be quantified by calculating the sum of the dot product of the changes in weights (𝛿W) and the original weights (W). If elements of 𝛿W and W have the same sign, this suggests that Wij positively influences the output. Hence, the overall impact of a particular concept on the outcome can be quantified by calculating the sum of the dot product of 𝛿W and W:
\sum\limits_{\forall i,j} \delta W_{ij} * W_{ij}
5) AI Control without or with less Reinforcement Learning (RL): Can concept-oriented interpretability be used to control AI without or with less use of RL?
We can introduce a new bias term d that promotes or penalizes certain behavior, based upon the membership of neurons in different concepts.
relu(W’x/|C|+b’+d)
The advantage of this over RL is threefold: Firstly, some have argued that RL will inevitably lead to misaligned AI [add citations]. It's possible that since this isn’t a reward per se some of the problems with RL will not apply here, although I’m not confidant about this as if the AI has a reward function it can still try to hack it's reward function, and if it does not it's non-trivial to see how the AI could be persuaded to make any sort of plan. Much more thought would need to be given to this.
Secondly, it provides more fine grained control. It has been noted that LLMs can give false answers to controversial questions, for example incorrectly claiming that a certain pet cat does not exist, rather than admit the cat does exist, but refuse to name it because the name is obscene. This is caused by conflating the need to tell the truth and the need to not be offensive, because RL only gives a positive or negative signal, rather than explaining what is wrong with the result. Therefore, by separating the concepts of ‘wrong because false’ and ‘wrong because offensive’ the LLM will be better able to reason about why things are true. Similarly, there will be many other forms of more precise control, for example by separating declarative and probabilistic knowledge we might be able to stop the LLM hallucinating things that sound truthful, but are not, when the user wants to know things that are definitely true.
Thirdly, in order to get the AI to do different things, currently the AI would have to be re-trained, and even with techniques like LORA [cite?] which are surprisingly efficient, it still takes dozens of GPU-hours to train a medium-sized net, which isn’t tractable if you need a large number of different behaviours. However, under this approach, any possible combination of concepts can be incentivised or disincentivized on the fly.
Interpretability of Game-based AI as a Stepping Stone to Large Language Models
A major challenge in interpretability research on large language models (LLMs) is the computational expense of training these models from scratch. Consequently, starting with simpler domains such as games, particularly ones with well-defined rules like chess, could present a practical pathway. This approach, with potentially lower computational costs, allows for iterative refinement of our interpretability techniques while simultaneously providing insights into AI decision-making in complex environments. The lessons learned could then be applied to the more computationally demanding domains of LLMs as our methodology matures.
In the context of chess, our interpretability method could illuminate the AI's understanding of individual pieces (e.g., knights, bishops) and their strategic importance.
This comprehensive understanding of AI strategy could be contrasted with evaluations made by human chess masters, serving as a practical validation of our interpretability approach.
Beyond chess, our method could also be applied to deception-based games like diplomacy and werewolf. These games provide insights into whether the "conception of deception" -- the moment when the AI realizes that it should conceal its thoughts [cite bostrom] -- can be realised merely by analyzing neuron activations. While it might be theoretically possible for a superintelligence to deceive such an interpretability method (analogous to how a human might fool a lie detector test by thinking deceptive thoughts during control questions), the capacity to deceive in this manner is likely a later development in intellectual evolution than the initial conception of deception. Consequently, our approach could serve as an early warning system for deceptive tendencies in AI systems. On a less dramatic note, AI being able to analyze the rationale behind chess positions would probably be of interest to chess players.
Hello. I've been interesting in AI for a long time, but I never contributed anything to discussions of AI alignment in the past, because I didn't think I was smart enough. Turns out, no-one else in the world is smart enough to solve this either. So, without further preamble, I was wondering if anyone would like to see my proposal for a research direction. Its a first draft, and I'm sure someone else has thought of this already, but on the offchance they haven't...
(p.s. I used chatgpt to proofread it and suggest rewrites. I hope thats ok.)
Background
Interpreting the functionality of individual neurons and neural circuits within artificial intelligence (AI) systems remains a significant challenge in the field of AI alignment. This task includes the classification of neurons involved in complex behaviors such as deception, and subsequently monitoring these specific neurons to determine the AI's behavior. Despite advancements in AI interpretability, such as using one AI to predict the behavior of neurons in another AI, this problem is still unsolved1.
A limitation of this approach is rooted in Lob's theorem, which states that a system cannot perfectly predict its own behavior. Although the extent to which an AI can imperfectly predict its own behavior is still unknown, the problem intuitively arises from the paradox that for a system to fully comprehend its own behavior, it would need to be larger and more complex than itself.
To address this issue, I propose an alternative training scheme that incorporates the roles of neurons during the training process, effectively "baking in" an approximate understanding of their functions. This method could potentially improve the interpretability of AI systems without significantly impacting their efficiency, or even potentially improving it. The purpose of the rest of this document is to provide a brief outline of this scheme, including a proposal for a relatively simple test case that could be implemented without excessive computational resources.
Outline
My approach is to dedicate different subsections of the weight matrices to different concepts, by only using certain slices of the matrix at training and runtime. Therefore by looking at connection strength, neuron activations etc in different slices, we can deduce information about the inner workings of the network. 'Concepts' may refer to categories such as 'dog', 'car' or 'truth', 'deception' or 'probabilistic', 'declarative'.
Let's consider, for simplicity, a neural network composed solely of fully-connected layers. Each layer can be expressed as:
relu(Wx+b)
(Eq. 1)
In any given problem domain, there may be a number of concepts we're interested in analyzing, such as distinguishing between deception and non-deception. If there are only a few concepts, these can be manually enumerated. For larger sets, we can leverage a model like ChatGPT to generate a list of potential concepts, outputting in a structured, parseable format. In cases where an extremely large number of concepts exist (for instance, a fine-grained subdivision of all concepts known to humanity), we could ask ChatGPT to tackle the problem recursively.
Given this list of concepts, we create subsets of neurons in each layer. These subsets are disjoint and each one corresponds to a concept. It's important to note that a concept might be associated with multiple neurons due to the complexity of the concept. For simplicity, we could start with subsets that contain the same number of neurons and assign the same subsets to each layer of the network. However, neither of these conditions is strictly necessary.
Thus, we could express our weight matrix for a concept as:
W_I = \bigoplus_\limits{i \in I,\forall j} W_{i, j} \quad
b_I = \bigoplus_\limits{i \in I} b_i \quad (Eq. 4)
(Eq. 2)
[note is this the correct mathematical notation for slicing a matrix?]
Where I is the index of neurons corresponding to a concept, and C is the set of relevant concepts.
In an initial simplified scenario, where training and evaluation are executed one datapoint at a time rather than in batches, we would identify the relevant concepts for each datapoint using a smaller model like ChatGPT. We would then train or evaluate on the subnet using the following subset of weights and biases:
W' = \bigoplus_\limits{I \in C,\forall j} W_{i, j} \quad
b' = \bigoplus_\limits{I \in C} b_i \quad (Eq. 4)
(Eq. 3)
where W' and b' represent the sub-weights and sub-biases, and C is the set of concepts at that datapoint. Additionally, we need to normalize the net activation with respect to the number of concepts (or number of neurons if the concepts are of non-uniform size):
relu(W’x/|C|+b’)
(Eq. 4)
Given this subset of weights and biases, training and evaluation proceed as normal.
In the event that we need to train or evaluate in batches where there are not enough exact matches to the set of concepts to fill the entire batch, we could use approximate matches.
Uses
Concept-oriented interpretability of neural networks opens avenues for understanding and controlling AI systems in a more nuanced and efficient manner. This approach categorizes neurons into relevant concepts, which can lead to increased computational efficiency, and potentially simplify the interpretability task. However, the practical realization of these benefits requires addressing several fundamental questions:
1) Computational Efficiency
Only evaluating on certain concepts is a form of sparse networks, which theoretically could increase efficiency by removing the need to evaluate parts of the network which are not relevant to the topic, however experimentation will be required to see if this will work in practice.
2) Neuron and Weight Encoding: How do neurons and individual weights encode concepts and subconcepts?
The association between two concepts can be defined as the sum of the connections (weights) between them.
A_{C1,C2} = \sum_\limits{i \in C1, j \in C2} W_{i,j} x_j
This could be summed over every layer or just the last layer
Likewise, the strength of the connection between a neuron and a concept can be expressed as the sum of its connections.
A_{i,C2} = \sum_\limits{j \in C2} W_{i,j} x_j
This could potentially offer insights into the nature of the encoded subconcepts.
3) Concept Relevance: How do we identify the relevance of a particular concept at a specific point in time?
The relevance of a concept can be quantified as the sum of the activations of neurons associated with that concept.
R_{C} = \sum_\limits{i \in C} a_{i}
In this equation, is the relevance of concept C, C is the indices of neurons associated with concept C, and is the activation of neuron i. We sum over all neurons i in C.
This measure can aid in understanding the role of a concept during a particular task or point in time.
4) Concept Contribution: How do we estimate the contribution of each concept to the current output?
This question involves using backpropagation to estimate the contribution of each concept to the AI's projected output. The overall impact of a particular concept can be quantified by calculating the sum of the dot product of the changes in weights (𝛿W) and the original weights (W). If elements of 𝛿W and W have the same sign, this suggests that Wij positively influences the output. Hence, the overall impact of a particular concept on the outcome can be quantified by calculating the sum of the dot product of 𝛿W and W:
\sum\limits_{\forall i,j} \delta W_{ij} * W_{ij}
5) AI Control without or with less Reinforcement Learning (RL): Can concept-oriented interpretability be used to control AI without or with less use of RL?
We can introduce a new bias term d that promotes or penalizes certain behavior, based upon the membership of neurons in different concepts.
relu(W’x/|C|+b’+d)
The advantage of this over RL is threefold:
Firstly, some have argued that RL will inevitably lead to misaligned AI [add citations]. It's possible that since this isn’t a reward per se some of the problems with RL will not apply here, although I’m not confidant about this as if the AI has a reward function it can still try to hack it's reward function, and if it does not it's non-trivial to see how the AI could be persuaded to make any sort of plan. Much more thought would need to be given to this.
Secondly, it provides more fine grained control. It has been noted that LLMs can give false answers to controversial questions, for example incorrectly claiming that a certain pet cat does not exist, rather than admit the cat does exist, but refuse to name it because the name is obscene. This is caused by conflating the need to tell the truth and the need to not be offensive, because RL only gives a positive or negative signal, rather than explaining what is wrong with the result. Therefore, by separating the concepts of ‘wrong because false’ and ‘wrong because offensive’ the LLM will be better able to reason about why things are true. Similarly, there will be many other forms of more precise control, for example by separating declarative and probabilistic knowledge we might be able to stop the LLM hallucinating things that sound truthful, but are not, when the user wants to know things that are definitely true.
Thirdly, in order to get the AI to do different things, currently the AI would have to be re-trained, and even with techniques like LORA [cite?] which are surprisingly efficient, it still takes dozens of GPU-hours to train a medium-sized net, which isn’t tractable if you need a large number of different behaviours. However, under this approach, any possible combination of concepts can be incentivised or disincentivized on the fly.
Interpretability of Game-based AI as a Stepping Stone to Large Language Models
A major challenge in interpretability research on large language models (LLMs) is the computational expense of training these models from scratch. Consequently, starting with simpler domains such as games, particularly ones with well-defined rules like chess, could present a practical pathway. This approach, with potentially lower computational costs, allows for iterative refinement of our interpretability techniques while simultaneously providing insights into AI decision-making in complex environments. The lessons learned could then be applied to the more computationally demanding domains of LLMs as our methodology matures.
In the context of chess, our interpretability method could illuminate the AI's understanding of individual pieces (e.g., knights, bishops) and their strategic importance.
This comprehensive understanding of AI strategy could be contrasted with evaluations made by human chess masters, serving as a practical validation of our interpretability approach.
Beyond chess, our method could also be applied to deception-based games like diplomacy and werewolf. These games provide insights into whether the "conception of deception" -- the moment when the AI realizes that it should conceal its thoughts [cite bostrom] -- can be realised merely by analyzing neuron activations. While it might be theoretically possible for a superintelligence to deceive such an interpretability method (analogous to how a human might fool a lie detector test by thinking deceptive thoughts during control questions), the capacity to deceive in this manner is likely a later development in intellectual evolution than the initial conception of deception. Consequently, our approach could serve as an early warning system for deceptive tendencies in AI systems. On a less dramatic note, AI being able to analyze the rationale behind chess positions would probably be of interest to chess players.