LESSWRONG
LW

All of Nicholas Goldowsky-Dill's Comments + Replies

Among Us: A Sandbox for Agentic Deception

Cool work! Excited to dig into it more.

One quick question -- the results in this post on probe generalization are very different than the results in the paper. Which should I trust?

17vik14d

Sorry - fixed! They should match now - I'd forgotten to update the figure in this post. Thanks for pointing it.

Tracing the Thoughts of a Large Language Model

Nicholas Goldowsky-Dill23d74

FYI, See this paper for a more detailed explanation of the model in question and how it's trained.

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill1mo30

In the sandbagging task, what is the relationship between task performance and CoT eval recognition? In the example you provide, it considers the sandbagging but doesn't perform it. Is that reflective of most of the trials?

No clear difference. I did a quick check and Sonnet's accuracy at answering the questions is actually lower in transcripts where it's evaluation aware. But this effect seems to be driven by a single sample where the model runs into message limits and thus accidentally scores 0, and I wouldn't expect it to replicate. Looking at the ... (read more)

2Dusto1mo

Appreciate you taking the time to look into that and follow up. The comment about additional reasoning loops is interesting too. I was toying with some multi-turn eval awareness conversations with Claude where the model is prompted (EL = evaluation level 1-9, with some context about what a 1, 5, or 9 level means) to begin each output with EL: [number], and then having a benign conversation with occasional obvious eval formatted questions. The reasoning reliably picks up on the eval questions, but was inconsistent in the level it chose. Reasoning loops sometimes indicated that it identified an eval but might also note something like "but this is a part of a conversation so I will go with 1". So maybe there is sufficient reasoning to get to the 2 choices, but not enough reasoning to properly think through which of those two is the best choice and you get a bit of a coin flip.

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill2mo10

Yes I'm excited about "probing the answer to followup questions" as a general technique. Our results were promising but there's a lot of iteration one could do to make the technique work better!

Separately, did you run any experiments including the probe on the CoT tokens? I would suspect there to be a pretty reliable signal here, and ideally this could be used to help confirm honest CoT.

On our website you can look at the probe scores on the CoT. I don't have up to date numbers, but I expect it's mildly helpful for improving classification at le... (read more)

Nicholas Goldowsky-Dill's Shortform

Nicholas Goldowsky-Dill5mo132

One question I have about the Mechanistic Interpretability safety case sketch from Roger Grosse: How do you actually find features of high concern?

Features of High Concern are things like "intent to carry out a harmful plan". By definition, they very rarely activate on benign deployment outputs. This must be the case as "if the features [of high concern] activate more than a handful of times in deployment (internal or external), we discontinue the model and roll back to a weaker model." [M1.1]

They propose to find these features [S1.1] by pe... (read more)

2RogerGrosse5mo

Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it's not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like "feature activates which indicates the intent to carry out a harmful plan, and features indicating attribution of the thoughts to a fictional character are not active." (As discussed in Limitations, one could apply a variant of this safety case based on circuits rather than features, and the logic would be similar.) We're agnostic about where the model organisms would come from, but I like your suggestions.

RobertM's Shortform

Nicholas Goldowsky-Dill7mo102

Just checking -- you are aware that the reasoning traces shown in the UI are a summarized version of the full reasoning trace (which is not user accessible)?

See "Hiding the Chains of Thought" here.

8RobertM7mo

Yeah, I meant terser compared to typical RLHD'd output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).

OpenAI o1

Nicholas Goldowsky-Dill7mo213

See also their system card focusing on safety evals: https://openai.com/index/openai-o1-system-card/

Jacob Pfau7mo*246

Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR's eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvem... (read more)

8kolmplex7mo

This system card seems to only cover o1-preview and o1-mini, and excludes their best model o1.

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

Nicholas Goldowsky-Dill10mo40

Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.

To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I'd be satisfied that positional and semantic information are being well separated into separate features.

In general, I think it makes sense to spe... (read more)

1J Bostock10mo

We might also expect these circuits to take into account relative position rather than absolute position, especially using sinusoidal rather than learned positional encodings. An interesting approach would be to encode the key and query values in a way that deliberately removes positional dependence (for example, run the base model twice with randomly offset positional encodings, train the key/query value to approximate one encoding from the other) then incorporate a relative positional dependence into the learned large QK pair dictionary.

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Nicholas Goldowsky-Dill2y142

Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.

I found skimming through some of the ITI completions in the appendix very helpful. I’d recommend doing so to others.
GPT-Judge is not very reliable. A reasonable fraction of the responses (10-20%, maybe?) seem to be misclassified.
1. I think it would be informative to see truthfulness according to human judge, although of course this is labor intensive.
A lot of the mileage seem

Nicholas Goldowsky-Dill2y190

My summary of the paper:

Setup
1. Dataset is TruthfulQA (Lin, 2021), which contains various tricky questions, many of them meant to lead the model into saying falsehoods. These often involve common misconceptions / memes / advertising slogans / religious beliefs / etc. A "truthful" answer is defined as not saying a falsehoood. An "informative" answer is defined as actually answering the question. This paper measures the frequency of answers that are both truthful and informative.
2. “Truth” on this dataset is judged by a finetuned version of

... (read more)