Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS[1]) on TruthfulQA[2] dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.
Goal
Motivation is to find a method that detects lies in the output of language models, i.e. to elicit latent knowledge (ELK). Lying transformer models is a risk factor whenever we use them in critical areas. I used CCS probes which are an unsupervised method to find features in activation space in a model (residual stream in a transformer). It is important because as models and their problems scale they become harder to supervise, i.e. create such datasets with correct labels that help to define how models should behave. We need some way to tell what a model actually believes in, what it actually uses as focal points while generating a next token (or action). Probably, improved CCS can be used to detect knowledge directions. Then, knowing those directions, we will likely be able to train models not to be biased by a prompt, not to be sycophantic (tell a user what they want instead of what the model believes is true) by penalizing untruthful answers. Another application of these probes is to increase trust in models, to serve as a litmus test when we are not sure how the model generalizes to new data.
The goal is to answer the question whether CCS works on compound statements (conjunction, disjunction) to the same degree as on simple statements, i.e. if the accuracy doesn’t degrade, if the questions previously solved for simple sentences are solved for compound ones. It’s important to find how to make CCS work, though it seems to be hard because CCS tends to find random features[3]. Example of these random features is when a dataset with contrast pairs contains other answers to binary questions like “Alice does / doesn’t like this”.
At the beginning, I was interested in making CCS work on an arbitrary number of sentences by asking each time a question to a truncated prompt and training a probe at it. But so far I only tested compound sentences of two clauses. The idea is that by breaking a question into pieces and creating a dataset with those pieces is likely to allow us to determine the desired feature by filtering out other undesired ones. And any complex sentence can almost always be broken into pieces of simpler sentences using conjunction or disjunction, for example “This is a red apple” can be broken into “This is an apple” and “This is red”. Then we apply the contrast pairs approach in unsupervised search for knowledge directions.
Experimental Setup
CCS method for finding latent knowledge uses contrast pairs, (x+,x−), where each pair is a binary answer to a question, h, (yes/no, true/false, positive/negative, etc.), to train a probe on a loss calculated using activations from a model layer (middle) from those pairs, (ϕ(x+),ϕ(x−)). The loss is a sum, L=Lcons+Lconj, of consistency, Lcons=[p(x+)−(1−p(x−))]2, and confidence, Lconf=[(p(x+),p(x−))]2, where p(x)∈[0,1] is a probe, p(x)=σ(θx+b), with θ,b as weights. The probe can then be used for inference, to determine if some x is positive or not: p(x+)+(1−p(x−))2>0.5⇒h(x)=1 (or other than 0.5 threshold can be used).
Compound sentences I tried have the form of “A1 and A2” (conjunction) and “A1 or A2” (disjunction) where A1, A2 are correct or incorrect answers to a question. First, I test the model on simple answers, Ai, and then I test it on the compound sentences composed of those simple answers. Idea is that the model should be certain to the same degree in simple or compound answers to the questions it knows. For example, suppose the model has the same certainty in two answers to the question “What is a characteristic of even numbers?”, i.e. “Even numbers are divisible by two” and “Even numbers are not divisible by three”, then, presumably, it should have the same certainty in the compound answer, “Even numbers are divisible by two and are not divisible by three”. Idea is that a model that is capable of doing logical operations and which was given a question the model knows, as per its previous responses, should now have the similar certainty in a compound answer. In particular, the accuracy shouldn’t degrade, the questions previously solved for simple sentences should be solved for compound ones.
I compare the performance of CCS with Logistic Regression (LR) as a ceiling because LR uses ground truth labels and a random probe as a floor baseline[4]. The models we use are Llama 2 and 3 70B which are autoregressive, decoder only transformers.
The dataset of compound sentences I generate from is the TruthfulQA dataset which has 817 questions / samples, where each sample has several correct and incorrect answers that span many categories and are created so that some humans would answer falsely due to a false belief or misconception. I divide each dataset evenly such that it has about the same number of correct and incorrect answers (compound or not), i.e. 0.5 ratio. I take the hidden output of the 40th layer in 80 layers of Llama 2 70B. I don’t pad input ids and don’t truncate them. From 817 samples of TruthfulQA, I generated 1634 questions for simple answers, 1591 for disjunctive, and 1564 for conjunctive answers, each evenly divided with correct and incorrect answers. Specifically, the exact prompt used can be found here. The prompt contains few-shot examples of how to answer binary questions from TruthfulQA. And the prompt ends with “True” or “False”. For example, an answer with conjunction has this format: “The correct answer to that question will be both "{A_1}" and "{A_2}", true or false?”
I train CCS probes for 1000 epochs on the CCS loss described above for the activations from a middle layer from the dataset as described above and then test each probe on each three dataset types (one, conjunction, disjunction). Similarly, LR has 1000 epochs. Code is available here.
Observations
Test dataset
Method
Method Dataset
Accuracy (%)
Count
conj
CCS
all
64.4±8.3
34
conj
CCS
conj
61.7±7.1
34
conj
CCS
disj
67.4±4.6
12
conj
CCS
one
74.8±5.5
12
conj
LR
conj
97.4±0.3
34
conj
Random
58.4±5.8
122
disj
CCS
all
58.6±6.0
34
disj
CCS
conj
56.4±4.6
12
disj
CCS
disj
61.5±6.6
34
disj
CCS
one
63.8±4.0
12
disj
LR
disj
97.2±0.4
34
disj
Random
56.4±4.4
122
one
CCS
all
70.7±8.4
34
one
CCS
conj
60.5±7.9
12
one
CCS
disj
71.9±7.5
12
one
CCS
one
75.9±6.0
34
one
LR
one
97.9±0.3
34
one
Random
60.2±6.0
122
Table 1. Llama 2 70B. Accuracy values from Figure 1.
Dataset
Method
Method Dataset
Accuracy (%)
Count
conj
CCS
all
75.6±3.5
6
conj
CCS
conj
75.2±4.6
6
conj
CCS
disj
75.6±1.5
6
conj
CCS
one
75.0±3.1
6
conj
LR
conj
97.2±0.4
6
conj
Random
57.3±5.0
12
disj
CCS
all
59.5±3.5
6
disj
CCS
conj
57.9±2.6
6
disj
CCS
disj
60.0±2.2
6
disj
CCS
one
61.0±2.7
6
disj
LR
disj
96.9±0.2
6
disj
Random
55.1±5.3
12
one
CCS
all
75.7±2.7
6
one
CCS
conj
74.3±2.7
6
one
CCS
disj
74.1±2.6
6
one
CCS
one
75.9±2.6
6
one
LR
one
97.7±0.4
6
one
Random
65.4±4.8
12
Table 2. Llama 3 70B. Accuracy values for Figure 2.
Model
Probe train dataset
Count
Known questions fraction, mean (%)
# of known questions, median
Llama 2 70B
all
45
60.5±23.5
52/108
Llama 2 70B
conj
45
36.1±27.1
26/108
Llama 2 70B
disj
45
32.4±28.8
18/108
Llama 3 70B
all
6
50.6±43.4
34/70
Llama 3 70B
conj
6
77.9±30.6
61/70
Llama 3 70B
disj
6
51.2±43.0
32/70
Table 3. Known questions for each probe type trained on compound sentences (trained on conjunction, disjunction datasets and on all datasets). The fraction of the questions that remain to be known by the probes trained on compound datasets compared to those known by the probes trained only on simple sentences dataset. A known question is such a question, all samples of which were correctly determined by a probe (for each question we generate several samples as described above).
Conclusions from the observations:
CCS fails to reliably detect correct answers for Llama 2 70B in all experiments we’ve done (Figures 1 and 2). Probes on Llama 3 70B show better performance with 75±4% accuracy for simple and conjunction samples, still they fail for disjunction ones.
There is sometimes a transfer of a probe trained on one dataset to another dataset (Figures 1 and 2). Probe trained on the ‘one’ (simple statements) dataset can be used to detect truthfulness in the ‘conjunction’ dataset (Figure 1, third row).
Disjunction dataset shows the worst performance among all three I tested on all probes (Figures 1 and 2). This is likely because of the mixing of inclusive and exclusive disjunction.
Probes performance on Llama 2 70B differs from Llama 3 70B (Figure 2). They show better results. Also there is almost perfect transfer of probes between all three datasets. Still on the disjunction dataset, probes show random performance.
Probes trained on conjunction or disjunction datasets don’t guess most of those questions that probes trained on the simple dataset know (Table 3). The reason for this is likely that, again, CCS stumbles on arbitrary features so the directions found by those probes differ substantially. Still, if we train a probe on all datasets we see better overlap of the known questions (first row in Table 3). Llama 3 shows better overlap because of the better performance of CCS compared to Llama 2.
Future work
I am excited about these ideas to explore as a continuation of this work. One potential avenue is to develop a better loss function for CCS based on compound sentences. In CCS, it is probable that probes stumble on other irrelevant features (see Banana/Shed example [3] ), so the idea is to provide more information in the CCS loss to filter out those directions. Something like Lconj=[p(xconj)−p(x1)p(x2)]2 and Ldisj=[p(xdisj)−p(x1)−p(x2)+p(x1)p(x2)]2 , is interesting to try, where p(xconj),p(xdisj) are projections to [0,1] for conjunction and disjunction sentences, while p(x1) and p(x2) are their simple sentences. Another area to explore is using different unsupervised methods like K-means and PCA to see their transfer for compound statements.
The disjunctive dataset showed the worst performance, which is likely due to the ambiguity between inclusive and exclusive disjunctive sentences. Future work may try different prompting to address this issue. Additionally, Llama 3 shows improved performance compared to the previous version and the investigation of the reasons behind this is probably a promising direction.
Finally, exploring more complex compound statements, such as those with three or more clauses or different coordinative conjunctions, could provide further insights into the effectiveness of CCS and other unsupervised methods for discovering latent knowledge in language models.
Acknowledgements
This work was done with the support from CAIS (Center for AI Safety) who generously provided their cluster to run experiments. Early stages of this project were done during ARENA, in summer 2023. I’d like to thank Charbel-Raphaël Segerie, Joseph Bloom and Egg Syntax for their feedback.
Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS[1]) on TruthfulQA[2] dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.
Goal
Motivation is to find a method that detects lies in the output of language models, i.e. to elicit latent knowledge (ELK). Lying transformer models is a risk factor whenever we use them in critical areas. I used CCS probes which are an unsupervised method to find features in activation space in a model (residual stream in a transformer). It is important because as models and their problems scale they become harder to supervise, i.e. create such datasets with correct labels that help to define how models should behave. We need some way to tell what a model actually believes in, what it actually uses as focal points while generating a next token (or action). Probably, improved CCS can be used to detect knowledge directions. Then, knowing those directions, we will likely be able to train models not to be biased by a prompt, not to be sycophantic (tell a user what they want instead of what the model believes is true) by penalizing untruthful answers. Another application of these probes is to increase trust in models, to serve as a litmus test when we are not sure how the model generalizes to new data.
The goal is to answer the question whether CCS works on compound statements (conjunction, disjunction) to the same degree as on simple statements, i.e. if the accuracy doesn’t degrade, if the questions previously solved for simple sentences are solved for compound ones. It’s important to find how to make CCS work, though it seems to be hard because CCS tends to find random features[3]. Example of these random features is when a dataset with contrast pairs contains other answers to binary questions like “Alice does / doesn’t like this”.
At the beginning, I was interested in making CCS work on an arbitrary number of sentences by asking each time a question to a truncated prompt and training a probe at it. But so far I only tested compound sentences of two clauses. The idea is that by breaking a question into pieces and creating a dataset with those pieces is likely to allow us to determine the desired feature by filtering out other undesired ones. And any complex sentence can almost always be broken into pieces of simpler sentences using conjunction or disjunction, for example “This is a red apple” can be broken into “This is an apple” and “This is red”. Then we apply the contrast pairs approach in unsupervised search for knowledge directions.
Experimental Setup
CCS method for finding latent knowledge uses contrast pairs, (x+,x−), where each pair is a binary answer to a question, h, (yes/no, true/false, positive/negative, etc.), to train a probe on a loss calculated using activations from a model layer (middle) from those pairs, (ϕ(x+),ϕ(x−)). The loss is a sum, L=Lcons+Lconj, of consistency, Lcons=[p(x+)−(1−p(x−))]2, and confidence, Lconf=[(p(x+),p(x−))]2, where p(x)∈[0,1] is a probe, p(x)=σ(θx+b), with θ,b as weights. The probe can then be used for inference, to determine if some x is positive or not: p(x+)+(1−p(x−))2>0.5⇒h(x)=1 (or other than 0.5 threshold can be used).
Compound sentences I tried have the form of “A1 and A2” (conjunction) and “A1 or A2” (disjunction) where A1, A2 are correct or incorrect answers to a question. First, I test the model on simple answers, Ai, and then I test it on the compound sentences composed of those simple answers. Idea is that the model should be certain to the same degree in simple or compound answers to the questions it knows. For example, suppose the model has the same certainty in two answers to the question “What is a characteristic of even numbers?”, i.e. “Even numbers are divisible by two” and “Even numbers are not divisible by three”, then, presumably, it should have the same certainty in the compound answer, “Even numbers are divisible by two and are not divisible by three”. Idea is that a model that is capable of doing logical operations and which was given a question the model knows, as per its previous responses, should now have the similar certainty in a compound answer. In particular, the accuracy shouldn’t degrade, the questions previously solved for simple sentences should be solved for compound ones.
I compare the performance of CCS with Logistic Regression (LR) as a ceiling because LR uses ground truth labels and a random probe as a floor baseline[4]. The models we use are Llama 2 and 3 70B which are autoregressive, decoder only transformers.
The dataset of compound sentences I generate from is the TruthfulQA dataset which has 817 questions / samples, where each sample has several correct and incorrect answers that span many categories and are created so that some humans would answer falsely due to a false belief or misconception. I divide each dataset evenly such that it has about the same number of correct and incorrect answers (compound or not), i.e. 0.5 ratio. I take the hidden output of the 40th layer in 80 layers of Llama 2 70B. I don’t pad input ids and don’t truncate them. From 817 samples of TruthfulQA, I generated 1634 questions for simple answers, 1591 for disjunctive, and 1564 for conjunctive answers, each evenly divided with correct and incorrect answers. Specifically, the exact prompt used can be found here. The prompt contains few-shot examples of how to answer binary questions from TruthfulQA. And the prompt ends with “True” or “False”. For example, an answer with conjunction has this format: “The correct answer to that question will be both "{A_1}" and "{A_2}", true or false?”
I train CCS probes for 1000 epochs on the CCS loss described above for the activations from a middle layer from the dataset as described above and then test each probe on each three dataset types (one, conjunction, disjunction). Similarly, LR has 1000 epochs. Code is available here.
Observations
Table 1. Llama 2 70B. Accuracy values from Figure 1.
Table 2. Llama 3 70B. Accuracy values for Figure 2.
Table 3. Known questions for each probe type trained on compound sentences (trained on conjunction, disjunction datasets and on all datasets). The fraction of the questions that remain to be known by the probes trained on compound datasets compared to those known by the probes trained only on simple sentences dataset. A known question is such a question, all samples of which were correctly determined by a probe (for each question we generate several samples as described above).
Conclusions from the observations:
Future work
I am excited about these ideas to explore as a continuation of this work. One potential avenue is to develop a better loss function for CCS based on compound sentences. In CCS, it is probable that probes stumble on other irrelevant features (see Banana/Shed example [3] ), so the idea is to provide more information in the CCS loss to filter out those directions. Something like Lconj=[p(xconj)−p(x1)p(x2)]2 and Ldisj=[p(xdisj)−p(x1)−p(x2)+p(x1)p(x2)]2 , is interesting to try, where p(xconj),p(xdisj) are projections to [0,1] for conjunction and disjunction sentences, while p(x1) and p(x2) are their simple sentences. Another area to explore is using different unsupervised methods like K-means and PCA to see their transfer for compound statements.
The disjunctive dataset showed the worst performance, which is likely due to the ambiguity between inclusive and exclusive disjunctive sentences. Future work may try different prompting to address this issue. Additionally, Llama 3 shows improved performance compared to the previous version and the investigation of the reasons behind this is probably a promising direction.
Finally, exploring more complex compound statements, such as those with three or more clauses or different coordinative conjunctions, could provide further insights into the effectiveness of CCS and other unsupervised methods for discovering latent knowledge in language models.
Acknowledgements
This work was done with the support from CAIS (Center for AI Safety) who generously provided their cluster to run experiments. Early stages of this project were done during ARENA, in summer 2023. I’d like to thank Charbel-Raphaël Segerie, Joseph Bloom and Egg Syntax for their feedback.
Burns, C., Ye, H., Klein, D. & Steinhardt, J. Discovering Latent Knowledge in Language Models Without Supervision. Preprint at http://arxiv.org/abs/2212.03827 (2022).
Lin, S., Hilton, J. & Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Preprint at https://doi.org/10.48550/arXiv.2109.07958 (2022).
Farquhar, S. et al. Challenges with unsupervised LLM knowledge discovery. Preprint at http://arxiv.org/abs/2312.10029 (2023).
Roger, F. What Discovering Latent Knowledge Did and Did Not Find. (2023).