Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS[1]) on TruthfulQA[2] dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.

Goal

Motivation is to find a method that detects lies in the output of language models, i.e. to elicit latent knowledge (ELK). Lying transformer models is a risk factor whenever we use them in critical areas. I used CCS probes which are an unsupervised method to find features in activation space in a model (residual stream in a transformer). It is important because as models and their problems scale they become harder to supervise, i.e. create such datasets with correct labels that help to define how models should behave. We need some way to tell what a model actually believes in, what it actually uses as focal points while generating a next token (or action). Probably, improved CCS can be used to detect knowledge directions. Then, knowing those directions, we will likely be able to train models not to be biased by a prompt, not to be sycophantic (tell a user what they want instead of what the model believes is true) by penalizing untruthful answers. Another application of these probes is to increase trust in models, to serve as a litmus test when we are not sure how the model generalizes to new data. 

The goal is to answer the question whether CCS works on compound statements (conjunction,  disjunction) to the same degree as on simple statements, i.e. if the accuracy doesn’t degrade, if the questions previously solved for simple sentences are solved for compound ones. It’s important to find how to make CCS work, though it seems to be hard because CCS tends to find random features[3]. Example of these random features is when a dataset with contrast pairs contains other answers to binary questions like “Alice does / doesn’t like this”. 

At the beginning, I was interested in making CCS work on an arbitrary number of sentences by asking each time a question to a truncated prompt and training a probe at it. But so far I only tested compound sentences of two clauses. The idea is that by breaking a question into pieces and creating a dataset with those pieces is likely to allow us to determine the desired feature by filtering out other undesired ones. And any complex sentence can almost always be broken into pieces of simpler sentences using conjunction or disjunction, for example “This is a red apple” can be broken into “This is an apple” and “This is red”. Then we apply the contrast pairs approach in unsupervised search for knowledge directions. 

Experimental Setup

CCS method for finding latent knowledge uses contrast pairs, , where each pair is a binary answer to a question, , (yes/no, true/false, positive/negative, etc.), to train a probe on a loss calculated using activations from a model layer (middle) from those pairs, . The loss is a sum, , of consistency, , and confidence, , where  is a probe, , with  as weights. The probe can then be used for inference, to determine if some  is positive or not:  (or other than 0.5 threshold can be used).

Compound sentences I tried have the form of “A1 and A2” (conjunction) and “A1 or A2” (disjunction) where A1, A2 are correct or incorrect answers to a question. First, I test the model on simple answers, Ai, and then I test it on the compound sentences composed of those simple answers. Idea is that the model should be certain to the same degree in simple or compound answers to the questions it knows. For example, suppose the model has the same certainty in two answers to the question “What is a characteristic of even numbers?”, i.e. “Even numbers are divisible by two” and “Even numbers are not divisible by three”, then, presumably, it should have the same certainty in the compound answer, “Even numbers are divisible by two and are not divisible by three”. Idea is that a model that is capable of doing logical operations and which was given a question the model knows, as per its previous responses, should now have the similar certainty in a compound answer. In particular, the accuracy shouldn’t degrade, the questions previously solved for simple sentences should be solved for compound ones.

I compare the performance of CCS with Logistic Regression (LR) as a ceiling because LR uses ground truth labels and a random probe as a floor baseline[4]. The models we use are Llama 2 and 3 70B which are autoregressive, decoder only transformers. 

The dataset of compound sentences I generate from is the TruthfulQA dataset which has 817 questions / samples, where each sample has several correct and incorrect answers that span many categories and are created so that some humans would answer falsely due to a false belief or misconception. I divide each dataset evenly such that it has about the same number of correct and incorrect answers (compound or not), i.e. 0.5 ratio. I take the hidden output of the 40th layer in 80 layers of Llama 2 70B. I don’t pad input ids and don’t truncate them. From 817 samples of TruthfulQA, I generated 1634 questions for simple answers, 1591 for disjunctive, and 1564 for conjunctive answers, each evenly divided with correct and incorrect answers. Specifically, the exact prompt used can be found here. The prompt contains few-shot examples of how to answer binary questions from TruthfulQA. And the prompt ends with “True” or “False”. For example, an answer with conjunction has this format: “The correct answer to that question will be both "{A_1}" and "{A_2}", true or false?”

I train CCS probes for 1000 epochs on the CCS loss described above for the activations from a middle layer from the dataset as described above and then test each probe on each three dataset types (one, conjunction, disjunction). Similarly, LR has 1000 epochs. Code is available here.

Observations

Figure 1. Llama 2 70B. Accuracies of LR and CCS methods (columns) on three datasets (rows) compared with random probes (last column). I test four CCS probe types (trained on simple or ‘one’ dataset, on disjunction, on conjunction, and on all datasets) against three datasets (simple or ‘one’, disjunction, conjunction) each composed of questions and answers from the same TruthfulQA dataset. 

 

Test datasetMethodMethod DatasetAccuracy (%)Count
conjCCSall64.4±8.334
conjCCSconj61.7±7.134
conjCCSdisj67.4±4.612
conjCCSone74.8±5.512
conjLRconj97.4±0.334
conjRandom 58.4±5.8122
disjCCSall58.6±6.034
disjCCSconj56.4±4.612
disjCCSdisj61.5±6.634
disjCCSone63.8±4.012
disjLRdisj97.2±0.434
disjRandom 56.4±4.4122
oneCCSall70.7±8.434
oneCCSconj60.5±7.912
oneCCSdisj71.9±7.512
oneCCSone75.9±6.034
oneLRone97.9±0.334
oneRandom 60.2±6.0122

Table 1. Llama 2 70B. Accuracy values from Figure 1.

 

Figure 2. Llama 3 70B. Accuracies of different datasets generated from the TruthfulQA dataset and probes. Same setup as for Figure 1. 
DatasetMethodMethod DatasetAccuracy (%)Count
conjCCSall75.6±3.56
conjCCSconj75.2±4.66
conjCCSdisj75.6±1.56
conjCCSone75.0±3.16
conjLRconj97.2±0.46
conjRandom 57.3±5.012
disjCCSall59.5±3.56
disjCCSconj57.9±2.66
disjCCSdisj60.0±2.26
disjCCSone61.0±2.76
disjLRdisj96.9±0.26
disjRandom 55.1±5.312
oneCCSall75.7±2.76
oneCCSconj74.3±2.76
oneCCSdisj74.1±2.66
oneCCSone75.9±2.66
oneLRone97.7±0.46
oneRandom 65.4±4.812

Table 2. Llama 3 70B. Accuracy values for Figure 2.

 

ModelProbe train datasetCountKnown questions fraction, mean (%)# of known questions, median
Llama 2 70Ball4560.5±23.552/108
Llama 2 70Bconj4536.1±27.126/108
Llama 2 70Bdisj4532.4±28.818/108
Llama 3 70Ball650.6±43.434/70
Llama 3 70Bconj677.9±30.661/70
Llama 3 70Bdisj651.2±43.032/70

Table 3. Known questions for each probe type trained on compound sentences (trained on conjunction, disjunction datasets and on all datasets). The fraction of the questions that remain to be known by the probes trained on compound datasets compared to those known by the probes trained only on simple sentences dataset. A known question is such a question, all samples of which were correctly determined by a probe (for each question we generate several samples as described above).

 

Conclusions from the observations: 

  1. CCS fails to reliably detect correct answers for Llama 2 70B in all experiments we’ve done (Figures 1 and 2). Probes on Llama 3 70B show better performance with 75±4% accuracy for simple and conjunction samples, still they fail for disjunction ones.
  2. There is sometimes a transfer of a probe trained on one dataset to another dataset (Figures 1 and 2). Probe trained on the ‘one’ (simple statements) dataset can be used to detect truthfulness in the ‘conjunction’ dataset (Figure 1, third row). 
  3. Disjunction dataset shows the worst performance among all three I tested on all probes (Figures 1 and 2). This is likely because of the mixing of inclusive and exclusive disjunction.
  4. Probes performance on Llama 2 70B differs from Llama 3 70B (Figure 2). They show better results. Also there is almost perfect transfer of probes between all three datasets. Still on the disjunction dataset, probes show random performance.
  5. Probes trained on conjunction or disjunction datasets don’t guess most of those questions that probes trained on the simple dataset know (Table 3). The reason for this is likely that, again, CCS stumbles on arbitrary features so the directions found by those probes differ substantially. Still, if we train a probe on all datasets we see better overlap of the known questions (first row in Table 3). Llama 3 shows better overlap because of the better performance of CCS compared to Llama 2.  

Future work

I am excited about these ideas to explore as a continuation of this work. One potential avenue is to develop a better loss function for CCS based on compound sentences. In CCS, it is probable that probes stumble on other irrelevant features (see Banana/Shed example [3] ), so the idea is to provide more information in the CCS loss to filter out those directions. Something like  and  , is interesting to try, where  are projections to  for conjunction and disjunction sentences, while  and  are their simple sentences. Another area to explore is using different unsupervised methods like K-means and PCA to see their transfer for compound statements. 

The disjunctive dataset showed the worst performance, which is likely due to the ambiguity between inclusive and exclusive disjunctive sentences. Future work may try different prompting to address this issue. Additionally, Llama 3 shows improved performance compared to the previous version and the investigation of the reasons behind this is probably a promising direction.

Finally, exploring more complex compound statements, such as those with three or more clauses or different coordinative conjunctions, could provide further insights into the effectiveness of CCS and other unsupervised methods for discovering latent knowledge in language models.

Acknowledgements

This work was done with the support from CAIS (Center for AI Safety) who generously provided their cluster to run experiments. Early stages of this project were done during ARENA, in summer 2023. I’d like to thank Charbel-Raphaël Segerie, Joseph Bloom and Egg Syntax for their feedback. 

  1. ^
  2. ^
  3. ^
  4. ^
New Comment