You mention prompting for calibration. I've been experimenting with prompting models to give their probabilities for the set of answers on a multiple choice question in order to calculate a Brier score. This is just vague speculation, but I wonder if there's a training regime where the data involves getting the model to be well calibrated in its reported probabilities which could lead to the model having a clearer, more generalized representation of truth that would be easier to detect.
That would certainly be an interesting experiment. A related expe...
Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety. The first GoT paper was very good.
...One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the "neg_" versions of the datasets from GoT, containing negated statements like "The city of Beijing
That's right - thanks for pointing out! Added a footnote:
For unsupervised methods, we do technically use the labels in two places. One, we select the sign of the probe based on labels. Two, for some datasets, we only want one true and false answer each, while there may be many. We use the labels to limit to one each.
One perspective is that representation engineering allows us to do "single-bit edits" to the network's behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying "we trained the mod...
(Apologies, been on holiday.)
For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.
I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn't put this in the report. I'll try to find time next week to put this in the appendix.