Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it's plausible you'd want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.

DeepMind's "Frontier Safety Framework" is weak and unambitious

Rohin Shah1d40

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

You're right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the "Future work" section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.

DeepMind: Evaluating Frontier Models for Dangerous Capabilities

Rohin Shah4d20

E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.

Are you thinking specifically of exfiltration here?

Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be "sure, but there are other threat models where the 'who' and 'what' can be done by humans".

DeepMind's "Frontier Safety Framework" is weak and unambitious

Rohin Shah12d4210

Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:

The document doesn't specify whether "deployment" includes internal deployment.

Unless otherwise stated, "deployment" to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

Some people get unilateral access to weights until the top level. This is disappointing. It's been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.

I don't think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)

Mostly they discuss developers' access to the weights. This is disappointing. It's important but lots of other stuff is important too.

The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)

No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).

Sorry, that's just poor wording on our part -- "every 3 months of fine-tuning progress" was meant to capture that as well. Thanks for pointing this out!

Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.

With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2).

Given recent updates in AGI safety overall, I'm happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

Rohin Shah18dΩ672

It's interesting to look back at this question 4 years later; I think it's a great example of the difficulty of choosing the right question to forecast in the first place.

I think it is still pretty unlikely that the criterion I outlined is met -- Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some time in 2023), given the current state of discourse and the progress in the AI governance space. (Though you could argue that the governance space is still primarily focused on misuse rather than misalignment.)

I did quite deliberately include Q2 in my planned survey -- I think it's important that the people whom governments defer to in crafting policy understand the concerns, rather than simply voicing support. But I failed to notice that it is quite plausible (indeed, the default) for there to be a relatively small number of experts that understand the concerns in enough depth to produce good advice on policy, plus a large base of "voicing support" from other experts who don't have that same deep understanding. This means that it's very plausible that fraction defined in the question never gets anywhere close to 0.5, but nonetheless the AI community "agrees on the risk" to a sufficient degree that governance efforts do end up in a good place.

Refusal in LLMs is mediated by a single direction

Rohin Shah1moΩ11155

Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.
Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance?

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain". I definitely agree this isn't yet close to "doing something useful, beyond what well-tuned baselines can do". But this should presumably rule out some hypotheses that current interpretability results are due to an extreme streetlight effect?

(I suppose you could have already been 100% confident that results so far weren't the result of extreme streetlight effect and so you didn't update, but imo that would just make you overconfident in how good current mech interp is.)

(I'm basically saying similar things as Lawrence.)

Explaining grokking through circuit efficiency

Rohin Shah1moΩ220

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?

Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

Rohin Shah1moΩ662

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.
Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant circumstantial evidence) and (b) the amount of interference differs significantly based on which frequencies you use (which in turn changes the quality of the logits holding parameter norm fixed, and thus changes efficiency).

In principle this can be tested by randomly sampling frequency sets, simulating the level of interference you'd get, using that to estimate the efficiency + critical dataset size for that grokking circuit. This gives you a predicted distribution over critical dataset sizes, which you could compare against the actual distribution.

Tbc there are other hypotheses too, e.g. perhaps different frequency sets are easier / harder to implement by the neural network architecture.

Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah1moΩ220

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of $π_{gate}$ as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on $f_{mag}$ for everything else.

Improving Dictionary Learning with Gated Sparse Autoencoders

Rohin Shah1moΩ460

Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the $γ$ that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed $\frac{E [| |^x | |^{2}]}{E [| | x | |^{2}]}$ for Gated SAEs, I expect that would be below 1.)

I think the main thing we got wrong is that we accidentally treated $E [| |^x - x | |^{2}]$ as though it were $E [| |^x - γ x | |^{2}]$ . To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually $γ$ is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.

We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.