All of mishajw's Comments + Replies

(Apologies, been on holiday.)

For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.

I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn't put this in the report. I'll try to find time next week to put this in the appendix.

You mention prompting for calibration. I've been experimenting with prompting models to give their probabilities for the set of answers on a multiple choice question in order to calculate a Brier score. This is just vague speculation, but I wonder if there's a training regime where the data involves getting the model to be well calibrated in its reported probabilities which could lead to the model having a clearer, more generalized representation of truth that would be easier to detect.

That would certainly be an interesting experiment.  A related expe... (read more)

Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety.  The first GoT paper was very good.

One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the "neg_" versions of the datasets from GoT, containing negated statements like "The city of Beijing

... (read more)
2Sam Marks
I found that the PCA plot for 7B for larger_than and smaller_than individually looked similar to that for 13B, but that the PCA plot for larger_than + smaller_than looked degenerate in the way I screenshotted. Are you saying that your larger_than + smaller_than PCA looked familiar for 7B? Agreed on both points. Thanks for clarifying! I'm guessing this is what's making the GoT datasets much worse for generalization (from and to) in your experiments. For 13B, it mostly seemed to me that training on negated statements helped for generalization to other negated statements, and that pairing negated statements with unnegated statements in training data usually (but not always) made generalization to unnegated datasets a bit worse. (E.g. the cities -> sp_en_trans generalization is better than cities + neg_cities -> sp_en_trans generalization.)

That's right - thanks for pointing out! Added a footnote:

For unsupervised methods, we do technically use the labels in two places. One, we select the sign of the probe based on labels. Two, for some datasets, we only want one true and false answer each, while there may be many. We use the labels to limit to one each.

One perspective is that representation engineering allows us to do "single-bit edits" to the network's behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.

(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)

This is beneficial from a safety point of view, as instead of saying "we trained the mod... (read more)

I like this idea, it matches quite closely how I naturally work. I had some spare time this weekend, so I made a quick prototype site: https://rationalbreaks.vercel.app

8bfinn
Thanks - I like it!  Can you make it do these, to implement the whole thing? * Sound an alarm when your break is up. Preferably re-sound it every minute * Carry forward unused break time to your next break * A readout of how much break you've earned, including breaks carried forward, while you're working (so you can e.g. decide to work until you can take a 15-minute break, say) * Start a big break (previously called an extended meal break)- by the user stating the duration. (Or ideally a choice of duration or the end time). I suppose you could also have buttons for common lengths - say 30min, 45min, 1 hour. Starting a big break should cancel any saved-up break time. * Let you choose which fraction to use, if not 1/3