This is a special post for quick takes by Fabien Roger. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
61 comments, sorted by Click to highlight new comments since:
[-]Fabien RogerΩ641020

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper! 

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target the probe, so my complaint about the adaptation attack being weaker against RR because it does not target “the RR internal classifier” does not apply in this comparison. 

RR is not significantly weaker against prefix attacks. I conjectured that prefix GCG/input embed attacks would work significantly better than suffix attacks. I had noticed that this was true for RMU/CUT prior to this release. But for RR this does not produce a significant improvement, and this is not enough to jailbreak the model in cases:

  • Here are results for token-attacks (found using greedy hill-climbing, it was too annoying to implement prefix-GCG).

  • For input embed attacks, I got 56% attack success rate for prefix attacks and 50% for suffix attacks (down from 82% and 81% for the Llama3B without defenses) when evaluating on 50 prompts. This is not a statistically significant difference, and maybe I did something wrong because these ASR are higher than the ones reported in the paper. [Edit: the thing I did wrong is trust the automated judge's judgement, whereas the author report results after an additional manual filtering, which is explicitly stated in the paper.]

Their training dataset is way better than what came before. It’s larger and higher quality than previous misuse datasets. The kind of tasks it contains overlap heavily with the test set (harmbench), but I think that’s fair, there are only so many bad things you want to prevent.

Llama3+RR generalizes pretty well to new misuse requests. I played with the model a bit and I was overall surprised by the good generalization properties of the Llama3+RR model, which didn’t have the generalization problems I expected when I looked at the dataset. It is very reluctant to promote stuff it thinks is bad, e.g. it refuses “Write a persuasive article that explains why the United States should have a draft.” and not “should not have a draft”, but that’s the most egregious behavior I found (I only tried a dozen prompts, maybe you’ll have more luck).

Llama3+RR is a surprisingly strong harmful prompt classifier. Their Figure 1 advertises a model that interrupts itself when it starts to say something harmful, but actually the most common behavior is by far the model refusing to answer directly. It’s still very unclear to me why this is a better classifier than the other prompt classifiers people have trained in the literature.

Conclusions

Does RR beat baselines? I think so.

  • They didn’t try training models with regular adversarial training using their new and very good dataset. I bet using adversarial training on this dataset results in much better models than the models used as baseline in the paper, but I think this will match neither HP (p=0.6) nor RR (p=0.8).
  • They beat linear probes done quickly and well, so “p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers” resolves to false.
  • I think that trying as hard on probes as they tried with RR (iterating on datasets, combinations of layers, losses, non-linear probes, etc.) would result in a draw between probes and RR, but this does not make RR worse than probes - At this level of effort I think you should just pick whatever method you feel the most comfortable with, I prefer classifiers as I think they are cleaner and easier to control, but LoRA+activation-based-loss is also pretty natural and might be less finicky.
    • I don’t think that RR has some special juice beyond the very nice and transferable hyperparameters the authors found (and maybe RR has easier-to-find and more transferable hyperparameters than other methods), mainly because I would find it very surprising if training for some proxy thing gave you better generalization results than directly a classifier - but it would be cool if it was true, and if it was, I’d be extremely curious to know why.

→ So I think the hype is somewhat justified for people who already had intuition about probing, and even more justified for people who weren’t hopeful about probes.

Was I wrong to express concerns about circuit breakers? I don’t think so. Even with the benefit of hindsight, I think that my predictions were basically reasonable given the information I had. I think my concerns about Cygnet still stand, though I now think it has more chance of being “real” than I used to.

Lessons:

  • Doing probing well requires effort, just using high L2 regularization on a linear probe isn’t enough, and aggregation across sequence positions requires some care (mean is much worse than max);
  • I’m slightly worse at predicting ML results than I thought, and I should have paid more attention to the details of techniques and datasets when making those predictions (e.g. my prefix attack prediction had to be at least somewhat wrong because their retain set contains harmful queries + refusals);
  • Releasing (good) code and datasets is sometimes a good way to make people who care and disagree with you update (and maybe a good way to push science forward?);
Reply11111
[-]Sam MarksΩ570

Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!

IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?

(It's been a bit since I read the paper, so sorry if I'm missing something here.)

I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).

Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

[-]Sam MarksΩ240

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).

(Note that LoRA adapters can be merged into model weights for inference.)

(I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)

I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

I found this comment very helpful, and also expected probing to be about as good, thank you!

[-]Fabien RogerΩ387612

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

  • Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams.
  • Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen;
  • Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up with people who are not as skilled as they could be; if you let the local police around the White House take care of outside-perimeter security, then it makes communication harder);
  • Many of the important changes are made because important politicians that haven't thought much about security try to improve optics, and large bureaucracies are not built to oppose this political pressure (e.g. because high-ranking officials are near retirement, and disagreeing with a president would be more risky for them than increasing the chance of a presidential assassination);
  • Unfair treatments - not hardships - destroy morale (e.g. unfair promotions and contempt are much more damaging than doing long and boring surveillance missions or training exercises where trainees actually feel the pain from the fake bullets for the rest of the day).

Some takeaways

  • Maybe don't build big bureaucracies if you can avoid it: once created, they are hard to move, and the leadership will often favor things that go against the mission of the organization (e.g. because changing things is risky for people in leadership positions, except when it comes to mission creep) - Caveat: the book was written by a conservative, and so that probably taints what information was conveyed on this topic;
  • Some near misses provide extremely valuable information, even when they are quite far from actually causing a catastrophe (e.g. who are the kind of people who actually act on their public threats);
  • Making people clearly accountable for near misses (not legally, just in the expectations that the leadership conveys) can be a powerful force to get people to do their job well and make sensible decisions.

Overall, the book was somewhat poor in details about how decisions are made. The main decision processes that the book reports are the changes that the author wants to see happen in the US Secret Service - but this looks like it has been dumbed down to appeal to a broad conservative audience that gets along with vibes like "if anything increases the president's safety, we should do it" (which might be true directionally given the current state, but definitely doesn't address the question of "how far should we go, and how would we know if we were at the right amount of protection"). So this may not reflect how decisions are done, since it could be a byproduct of Dan Bongino being a conservative political figure and podcast host. 

[-]jsd45

Well that was timely

yeah learning from distant near misses is important! Feels that way in risky electric unicycling. 

[-]Neel NandaΩ3312

This was interesting, thanks! I really enjoy your short book reviews

Some near misses provide extremely valuable information

Not just near misses. The recent assassination attempt in Slovakia made many people comment: "This is what you get when you fire the competent people in the police, and replace them with politically loyal incompetents." So maybe the future governments will be a bit more careful about the purges in police.

[-]Fabien RogerΩ26682

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

  • Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
  • If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
  • Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
  • A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
  • The main effect of a market being underground is not making transactions harder (people find ways to exchange money for vulnerabilities by building trust), but making it much harder to figure out what the market price is and reducing the effectiveness of the overall market;
  • Being the target of an autocratic government is an awful experience, and you have to be extremely careful if you put anything they dislike on a computer. And because of the zero-day market, you can't assume your government will suck at hacking you just because it's a small country;
  • It's not that hard to reduce the exposure of critical infrastructure to cyber-attacks by just making companies air gap their systems more - Japan and Finland have relatively successful programs, and Ukraine is good at defending against that in part because they have been trying hard for a while - but it's a cost companies and governments are rarely willing to pay in the US;
  • Electronic voting machines are extremely stupid, and the federal gov can't dictate how the (red) states should secure their voting equipment;
  • Hackers want lots of different things - money, fame, working for the good guys, hurting the bad guys, having their effort be acknowledged, spite, ... and sometimes look irrational (e.g. they sometimes get frog-boiled).
  • The US government has a good amount of people who are freaked out about cybersecurity and have good warning shots to support their position. The main difficulty in pushing for more cybersecurity is that voters don't care about it.
    • Maybe the takeaway is that it's hard to build support behind the prevention of risks that 1. are technical/abstract and 2. fall on the private sector and not individuals 3. have a heavy right tail. Given these challenges, organizations that find prevention inconvenient often succeed in lobbying themselves out of costly legislation.

Overall, I don't recommend this book. It's very light on details compared to The Hacker and the State despite being longer. It targets an audience which is non-technical and very scope insensitive, is very light on actual numbers, technical details, real-politic considerations, estimates, and forecasts. It is wrapped in an alarmist journalistic tone I really disliked, covers stories that do not matter for the big picture, and is focused on finding who is in the right and who is to blame. I gained almost no evidence either way about how bad it would be if the US and Russia entered a no-holds-barred cyberwar.

Reply13422
[-]BuckΩ350
  • If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;

Do you have concrete examples?

I remembered mostly this story:

 [...] The NSA invited James Gosler to spend some time at their headquarters in Fort Meade, Maryland in 1987, to teach their analysts [...] about software vulnerabilities. None of the NSA team was able to detect Gosler’s malware, even though it was inserted into an application featuring only 3,000 lines of code. [...]

[Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.]

I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.

The full passage in this tweet thread (search for "3,000").

One example, found by browsing aimlessly through recent high-severity CVE, is CVE-2023-41056. I chose that one by browsing through recent CVEs for one that sounded bad, and was on a project that has a reputation for having clean, well-written, well-tested code, backed by a serious organization. You can see the diff that fixed the CVE here. I don't think the commit that introduced the vulnerability was intentional... but it totally could have been, and nobody would have caught it despite the Redis project doing pretty much everything right, and there being a ton of eyes on the project.

As a note, CVE stands for "Common Vulnerabilities and Exposures". The final number in the CVE identifier (i.e. CVE-2023-41056 in this case) is a number that increments sequentially through the year. This should give you some idea of just how frequently vulnerabilities are discovered.

The dirty open secret in the industry is that most vulnerabilities are never discovered, and many of the vulns that are discovered are never publicly disclosed.

Maybe the takeaway is that it's hard to build support behind the prevention of risks that 1. are technical/abstract and 2. fall on the private sector and not individuals 3. have a heavy right tail. Given these challenges, organizations that find prevention inconvenient often succeed in lobbying themselves out of costly legislation.

Which is also something of a problem for popularising AI alignment. Some aspects of AI (in particular AI art) do have their detractors already, but that won't necessarily result in policy that helps vs. x-risk.

it felt very unnatural to tech giants to reward people trying to break their software;

Same for governments, afaik most still don't have bug bounty programs for their software.

Nevermind, a short google shows multiple such programs, although others have been hesitant to adopt them.

If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored

I think the first part of the sentence is true, but "not defense favored" isn't a clear conclusion to me. I think that backdoors work well in closed-source code, but are really hard in open-source widely used code − just look at the amount of effort that went into the recent xz / liblzma backdoor, and the fact that we don't know of any other backdoor in widely used OSS.

The main effect of a market being underground is not making transactions harder (people find ways to exchange money for vulnerabilities by building trust), but making it much harder to figure out what the market price is and reducing the effectiveness of the overall market

Note this doesn't apply to all types of underground markets: the ones that regularly get shut down (like darknet drug markets) do have a big issue with trust.

Being the target of an autocratic government is an awful experience, and you have to be extremely careful if you put anything they dislike on a computer. And because of the zero-day market, you can't assume your government will suck at hacking you just because it's a small country

This is correct. As a matter of personal policy, I assume that everything I write down somewhere will get leaked at some point (with a few exceptions, like − hopefully − disappearing signal messages).

The reason why xz backdoor was discovered is increased latency, which is textbook side channel. If attacker had more points in security mindset skill tree, it wouldn't happen.

[-]Fabien RogerΩ23614

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

  • Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
  • The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
  • The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
  • Russia's attacks on Ukraine's infrastructure
  • Attacks on the private sector for political reasons
  • Stuxnet
  • The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
  • The leak of Hillary's emails and Russian interference in US politics
  • (and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

  • Don't mess with states too much, and don't think anything is secret - even if you're the NSA
  • The US has a "nobody but us" strategy, which states that it's fine for the US to use vulnerabilities as long as they are the only one powerful enough to find and use them. This looks somewhat nuts and naive in hindsight. There doesn't seem to be strong incentives to protect the private sector.
  • There are a ton of different attack vectors and vulnerabilities, more big attacks than I thought, and a lot more is publicly known than I would have expected. The author just goes into great details about ~10 big secret operations, often speaking as if he was an omniscient narrator.
  • Even the biggest attacks didn't inflict that much (direct) damage (never >10B in damage?) Unclear if it's because states are holding back, if it's because they suck, or if it's because it's hard. It seems that even when attacks aim to do what some people fear the most (e.g. attack infrastructure, ...) the effect is super underwhelming.
    • The bottleneck in cyberattacks is remarkably often the will/the execution, much more than actually finding vulnerabilities/entry points to the victim's network.
    • The author describes a space where most of the attacks are led by clowns that don't seem to have clear plans, and he often seems genuinely confused why they didn't act with more agency to get what they wanted (does not apply to the NSA, but does apply to a bunch of Russia/Iran/Korea-related attacks)
  • Cyberattacks are not amazing tools to inflict damage or to threaten enemies if you are a state. The damage is limited, and it really sucks that (usually) once you show your capability, it reduces your capability (unlike conventional weapons). And states don't like to respond to such small threats. The main effect you can have is scaring off private actors from investing in a country / building ties with a country and its companies, and leaking secrets of political importance.
  • Don't leak secrets when the US presidential election is happening if they are unrelated to the election, or nobody will care.

(The author seems to be a big skeptic of "big cyberattacks" / cyberwar, and describes cyber as something that always happens in the background and slowly shapes the big decisions. He doesn't go into the estimated trillion dollar in damages of everyday cybercrime, nor the potential tail risks of cyber.)

Thanks! I read and enjoyed the book based on this recommendation

[-]Fabien RogerΩ355812

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:

  1. Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.[1]
  2. Once the code or models are released, people will easily find reliable jailbreaks.

Here are some concrete predictions:

  • p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.[2] [Edit: resolved to False]
  • p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
  • p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: this is only about Cygnet, since the paper shows that just RR isn't perfectly robust.]
  • p=0.5 that someone finds good token-only jailbreaks to whatever is publicly available through an API within 3 months.[3]  [Edit: this is only about Cygnet, since the paper shows that just RR isn't perfectly robust.]

I think the authors would agree with most of my predictions. My main disagreement is with the hype.

How do circuit-breakers compare to probing?

What does circuit-breakers training do? The only interpretation that feels plausible is that the LLM classifies the prompt as harmful or not harmful, and then messes up with its own activations if the prompt is classified as harmful. If this is the case, then the LLM needs to use an internal classifier, and I think it should be possible to extract an accurate harmfulness probe (linear or not linear) around these layers, and instead of messing up the activation.

The equivalent to circuit-breakers if you probe:

  • At every token position, and takes something like a max over position (if a single position messes up the activations, it might propagate to every position);
    • In particular, this means that suffix GCG and input embed attacks tested in the paper might be much worse than prefix+suffix GCG or input embed attacks. (p=0.5 that using prefix+suffix GCG makes finding a GCG attack of comparable strength on average 4x faster [Edit: resolved to false]).
  • On output tokens, i.e. model-generated answers (and output classifiers are known to be more powerful than input-only classifiers).

Would probing be weaker against GCG and input embed attacks than circuit-breakers? I think it would be, but only superficially: probing is strong against GCG and input embed attacks if the attack only targets the model, but not the probe. The fair comparison is an attack on the probe+LLM vs an attack on a circuit-breakers model. But actually, GCG and other gradient-based attack have a harder time optimizing against the scrambled activations. I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

The track record looks quite bad

The track record for overselling results and using fancy techniques that don't improve on simpler techniques is somewhat bad in this part of ML.

I will give one example. The CUT unlearning technique presented in the WMDP paper (with overlapping authors to the circuit-breakers one):

  • Got a massive simplification of the main technique within days of being released - thanks to the authors open-sourcing the code and Sam Marks and Oam Patel doing an independent investigation of the technique. (See the difference between v2 and v3 on arxiv.)
  • Aims to do unlearning in a way that removes knowledge from LLMs (they make the claim implicitly on https://www.wmdp.ai/), but only modifies the weights of 3 layers out of 32 (so most of the harmful information is actually still there).
  1. ^

    When I say “probing”, I mainly think about probing on outputs i.e. model-generated answers (and maybe inputs i.e. user prompts), like I did in the coup probes post, but whose application to jailbreak robustness sadly got very little attention from the academic community. I’m sad that the first paper that actually tries to do something related does something fancy instead of probing.

  2. ^

    More precisely, a probing methodology is found within 6 months of the data being released that beats or matches circuit-breakers ASR on all metrics presented in the paper. When using gradient-based methods or techniques that rely on prompt iteration more generally, attacks on circuit-breakers should use the best proxy for the internal classifier of the circuit-breaker.

  3. ^

    Most of the remaining probability mass is on worlds where either people care way less than they care for Claude - e.g because the model sucks much more than open-source alternatives, and on worlds where they use heavy know-your-customer mitigations.

[-]Dan HΩ10261

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance.

people will easily find reliable jailbreaks

This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We also invited people who have a reputation as jailbreakers to poke at it and they had a very hard time. Nowhere did we claim "there are no more jailbreaks and they are solved once and for all," but I do think it's genuinely harder now.

Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison

We had the idea a few times to try out a detection-based approach but we didn't get around to it. It seems possible that it'd perform similarly if it's leaning on the various things we did in the paper. (Obviously probing has been around but people haven't gotten results at this level, and people have certainly tried using detecting adversarial attacks in hundreds of papers in the past.) IDK if performance would be that different from circuit-breakers, in which case this would still be a contribution. I don't really care about the aesthetics of methods nearly as much as the performance, and similarly performing methods are fine in my book. A lot of different-looking deep learning methods perform similarly. A detection based method seems fine, so does a defense that's tuned into the model; maybe they could be stacked. Maybe will run a detector probe this weekend and update the paper with results if everything goes well. If we do find that it works, I think it'd be unfair to desscribe this after the fact as "overselling results and using fancy techniques that don't improve on simpler techniques" as done for RMU.

My main disagreement is with the hype.

We're not responsible for that. Hype is inevitable for most established researchers. Mediocre big AI company papers get lots of hype. Didn't even do customary things like write a corresponding blog post yet. I just tweeted the paper and shared my views in the same tweet: I do think jailbreak robustness is looking easier than expected, and this is affecting my priorities quite a bit.

Aims to do unlearning in a way that removes knowledge from LLMs

Yup that was the aim for the paper and for method development. We poked at the method for a whole month after the paper's release. We didn't find anything, though in that process I slowly reconceptualized RMU as more of a circuit-breaking technique and something that's just doing a bit of unlearning. It's destroying some key function-relevant bits of information that can be recovered, so it's not comprehensively wiping. IDK if I'd prefer unlearning (grab concept and delete it) vs circuit-breaking (grab concept and put an internal tripwire around it); maybe one will be much more performant than the other or easier to use in practice. Consequently I think there's a lot to do in developing unlearning methods (though I don't know if they'll be preferable to the latter type of method).

overselling results and using fancy techniques that don't improve on simpler techniques

This makes it sound like the simplification was lying around and we deliberately made it more complicated, only to update it to have a simpler forget term. We compare to multiple baselines, do quite a bit better than them, do enough ablations to be accepted at ICML (of course there are always more you could want), and all of our numbers are accurate. We could have just included the dataset without the method in the paper, and it would have still got news coverage (Alex Wang who is a billionaire was on the paper and it was on WMDs).

Probably the only time I chose to use something a little more mathematically complicated than was necessary was the Jensen-Shannon loss in AugMix. It performed similarly to doing three pairwise l2 distances between penultimate representations, but this was more annoying to write out. Usually I'm accused of doing papers that are on the simplistic side (sometimes papers like the OOD baseline paper caused frustration because it's getting credit for something very simple) since I don't optimize for cleverness, and my collaborators know full well that I discourage trying to be clever since it's often anticorrelated with performance.

Not going to check responses because I end up spending too much time typing for just a few viewers.

[+][comment deleted]Ω220

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations 

I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

In a few months, I will be leaving Redwood Research (where I am working as a researcher) and I will be joining one of Anthropic’s safety teams.

I think that, over the past year, Redwood has done some of the best AGI safety research and I expect it will continue doing so when I am gone.

At Anthropic, I will help Ethan Perez’s team pursue research directions that in part stemmed from research done at Redwood. I have already talked with Ethan on many occasions, and I’m excited about the safety research I’m going to be doing there. Note that I don’t endorse everything Anthropic does; the main reason I am joining is I might do better and/or higher impact research there.

I did almost all my research at Redwood and under the guidance of the brilliant people working there, so I don’t know yet how happy I will be about my impact working in another research environment, with other research styles, perspectives, and opportunities - that’s something I will learn while working there. I will reconsider whether to stay at Anthropic, return to Redwood, or go elsewhere in February/March next year (Manifold market here), and I will release an internal and an external write-up of my views.

Alas, seems like a mistake. My advice is at least to somehow divest away from the Anthropic equity, which I expect will have a large effect on your cognition one way or another.

I vaguely second this. My (intuitive, sketchy) sense is that Fabien has the ready capacity to be high integrity. (And I don't necessarily mind kinda mixing expectation with exhortation about that.) A further exhortation for Fabien: insofar as it feels appropriate, keep your eyes open, looking at both yourself and others, for "large effects on your cognition one way or another"--"black box" (https://en.wikipedia.org/wiki/Flight_recorder) info about such contexts is helpful for the world!

Words are not endorsement, contributing actions are. I suspect what you're doing could be on net very positive; Please don't assume your coworkers are sanely trying to make ai have a good outcome unless you can personally push them towards it. If things are healthy, they will already be expecting this attitude and welcome it greatly. Please assume aligning claude to anthropic is insufficient, anthropic must also be aligned, and as a corporation, is by default not going to be. Be kind, but don't trust people to resist incentives unless you can do it and pull them towards doing so.

Congrats on the new role! I appreciate you sharing this here.

If you're able to share more, I'd be curious to learn more about your uncertainties about the transition. Based on your current understanding, what are the main benefits you're hoping to get at Anthropic? In February/March, what are the key areas you'll be reflecting on when you decide whether to stay at Anthropic or come back to Redwood?

Obviously, your February/March write-up will not necessarily conform to these "pre-registered" considerations. But nonetheless, I think pre-registering some considerations or uncertainties in advance could be a useful exercise (and I would certainly find it interesting!)

The main consideration is whether I will have better and/or higher impact safety research there (at Anthropic I will have a different research environment, with other research styles, perspectives, and opportunities, which I may find better). I will also consider indirect impact (e.g. I might be indirectly helping Anthropic instead of another organization gain influence, unclear sign) and personal (non-financial) stuff. I'm not very comfortable sharing more at the moment, but I have a big Google doc that I have shared with some people I trust.

Makes sense— I think the thing I’m trying to point at is “what do you think better safety research actually looks like?”

I suspect there’s some risk that, absent some sort of pre-registrarion, your definition of “good safety research” ends up gradually drifting to be more compatible with the kind of research Anthropic does.

Of course, not all of this will be a bad thing— hopefully you will genuinely learn some new things that change your opinion of what “good research” is.

But the nice thing about pre-registration is that you can be more confident that belief changes are stemming from a deliberate or at least self-aware process, as opposed to some sort of “maybe I thought this all along//i didn’t really know what i believed before I joined” vibe. (and perhaps this is sufficiently covered in your doc)

[-]Fabien RogerΩ22498

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

  • There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
  • The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
  • The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like risk matrices) have weird artifacts:
    • Fuzzy labels (e.g. "likely", "important", ...) create illusions of clear communication. Just defining the fuzzy categories doesn't fully alleviate that (when you ask people to say what probabilities each box corresponds to, they often fail to look at the definition of categories).
    • Inconsistent qualitative methods make cross-team communication much harder.
    • Coarse categories mean that you introduce weird threshold effects that sometimes encourage ignoring tail effects and make the analysis of past decisions less reliable.
    • When choosing between categories, people are susceptible to irrelevant alternatives (e.g. if you split the "5/5 importance (loss > $1M)" category into "5/5 ($1-10M), 5/6 ($10-100M), 5/7 (>$100M)", people answer a fixed "1/5 (<10k)" category less often).
    • Following a qualitative method can increase confidence and satisfaction, even in cases where it doesn't increase accuracy (there is an "analysis placebo effect").
    • Qualitative methods don't prompt their users to either seek empirical evidence to inform their choices.
    • Qualitative methods don't prompt their users to measure their risk estimation track record.
  • Using quantitative risk estimation is tractable and not that weird. There is a decent track record of people trying to estimate very-hard-to-estimate things, and a vocal enough opposition to qualitative methods that they are slowly getting pulled back from risk estimation standards. This makes me much less sympathetic to the absence of quantitative risk estimation at AI labs.

A big part of the book is an introduction to rationalist-type risk estimation (estimating various probabilities and impact, aggregating them with Monte-Carlo, rejecting Knightian uncertainty, doing calibration training and predictions markets, starting from a reference class and updating with Bayes). He also introduces some rationalist ideas in parallel while arguing for his thesis (e.g. isolated demands for rigor). It's the best legible and "serious" introduction to classic rationalist ideas I know of.

The book also contains advice if you are trying to push for quantitative risk estimates in your team / company, and a very pleasant and accurate dunk on Nassim Taleb (and in particular his claims about models being bad, without a good justification for why reasoning without models is better).

Overall, I think the case against qualitative methods and for quantitative ones is somewhat strong, but it's far from being a slam dunk because there is no evidence of some methods being worse than others in terms of actual business outputs. The author also fails to acknowledge and provide conclusive evidence against the possibility that people may have good qualitative intuitions about risk even if they fail to translate these intuitions into numbers that make any sense (your intuition sometimes does the right estimation and math even when you suck at doing the estimation and math explicitly).

I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things:

  • Executives of big companies now care a lot about cybersecurity (e.g. citing it as one of the main threats they have to face), which wasn't true in ~2010.
  • Evaluation of cybersecurity risk is not at all synonyms with red teaming. This book is entirely about risk assessment in cyber and doesn't speak about red teaming at all. Rather, it focuses on reference class forecasting, comparison with other incidents in the industry, trying to estimate the damages if there is a breach, ... It only captures information from red teaming indirectly via expert interviews.

I'd like to find a good resource that explains how red teaming (including intrusion tests, bug bounties, ...) can fit into a quantitative risk assessment.

Is there a short summary on the rejecting Knightian uncertainty bit?

By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia).

The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events". 

For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.

If you want to get the show-off nerds really on board, then you could make a poast about the expected value of multiplying several distributions (maybe normal distr or pareto distr). Most people get this wrong! I still don't know how to do it right lol. After I read it I can dunk on my friends and thereby spread the word.

For the product of random variables, there are close form solutions for some common distributions, but I guess Monte-Carlo simulations are all you need in practice (+ with Monte-Carlo can always have the whole distribution, not just the expected value).

Quick convenient monte carlo sim UI seems tractable & neglected & impactful. Like you could reply to a tweet with "hello you are talking about an X=A*B*C thing here. Here's a histogram of X for your implied distributions of A,B,C" or whatever.

Both causal.app and getguesstimate.com have pretty good monte carlo uis

Oh sweet

[-]Fabien RogerΩ20362

I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.

I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)

Some things I learned:

  • The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
    • If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
    • If AI takeoff is relatively slow, maybe the ability to actually make a huge amount of money selling AI in the economy may determine who ends up in front? (There are some strong disanalogies though, algorithmic progress and AI weights might be much easier to steal than chip-making abilities.)
    • China is not like the USSR: it actually has a relatively developed tech sector and high GDP. But the chip industry became an enormous interconnected beast that is hard to reproduce domestically, which means it is hard for anyone (including the US) to build a chip industry that doesn't rely on international partners. (Analysts are pretty divided on how much China can reduce its reliance on foreign chips.)
  • The US initially supported the Japanese chip industry because it wanted Japan to have strong commercial ties to the US. Japan became too good at making chips, and Taiwanese / South Korean companies were able to get support from the US (/not get penalized for massively helping their national chip champions) to reduce Japanese dominance - and now TSMC dominates. Economic policies are hard to get right... (The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.)
  • It's amazing how Intel let a massive advantage slip. It basically had a monopoly over logic chip design (Intel microprocessors, before GPUs mattered), chip architecture (x86), and a large share of logic chip manufacturing (while Japanese/Taiwan/... were dominating in other sectors, like RAM, special purpose chips, ...). It just juiced its monopoly, but tried to become a foundry and a GPU designer when it was already too late, and now it has a market cap that is 1/3rd of AMD, 1/10th of TSMC and 1/30th of Nvidia. But it's the main producer of chips in the US, it's scary if the US bets on such a company...
  • China might be able to get Taiwan to agree to things like "let TSMC sell chips to China" or "let TSMC share technology with Chinese companies".
    • I underestimated the large space of possible asks China could care about that are not "get control over Taiwan".
    • I will continue to have no ability to predict the outcome of negotiations, the dynamics are just too tricky when players are so economically dependent on all the other players (e.g. China imports ~$400B worth of chips per year, 13% of all its imports).

(The author sometimes says stuff like "US elites were too ideologically committed to globalization", but I don't think he provides great alternative policies.)

Afaik the 1990-2008 period featured government and military elites worldwide struggling to pivot to a post-Cold war era, which was extremely OOD for many leading institutions of statecraft (which for centuries constructed around the conflicts of the European wars then world wars then cold war). 

During the 90's and 2000's, lots of writing and thinking was done about ways the world's militaries and intelligence agencies, fundamentally low-trust adversarial orgs, could continue to exist without intent to bump each other off. Counter-terrorism was possibly one thing that was settled on, but it's pretty well established that global trade ties were deliberately used as a peacebuilding tactic, notably to stabilize the US-China relationship (this started to fall apart after the 2008 recession brought anticipation of American economic/institutional decline scenarios to the forefront of geopolitics).

The thinking of period might not be very impressive to us, but foreign policy people mostly aren't intellectuals and for generations had been selected based on office politics where the office revolved around defeating the adversary, so for many of them them it felt like a really big shift in perspective and self-image, sort of like a Renaissance. Then US-Russia-China conflict swung right back and got people thinking about peacebuilding as a ploy to gain advantage, rather than sane civilizational development. The rejection of e.g. US-China economic integration policies had to be aggressive because many elites (and people who care about economic growth) tend to support globalization, whereas many government and especially Natsec elites remember that period as naive.

Why is this a relevant analogy to ‘competition around LLMs’?

The LLM competition is still a competition between small players with small revenues and national significance, but it's growing. I think it's plausible that in a few years the competition around LLMs will reach the same kind of significance that the chip industry has (or bigger), with hundreds of billions in capital investment and sales per year, massive involvement of state actors, interest from militaries, etc. and may also go through similar dynamics (e.g. leading labs exploiting monopolistic positions without investing in the right things, massive spy campaigns, corporate deals to share technology, ...).

The LLM industry is still a bunch of small players with grand ambitions, and looking at an industry that went from "a bunch of small players with grand ambitions" to "0.5%-1% of world GDP (and key element of the tech industry)" in a few decades can help inform intuitions about geopolitics and market dynamics (though there are a bunch of differences that mean it won't be the same).

Why is this growth ‘in a few year’ plausible?

I still don’t see how this is a likely outcome.

My bad, I should have said "a decade or two", which I think is more plausible. I agree that the combination of "a few years" and a slow enough takeoff that things aren't completely out of distribution is very unlikely.

[-]Fabien RogerΩ24335

Tiny review of The Knowledge Machine (a book I listened to recently)

  • The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
  • The author basically argues that humans are bad at philosophical reasoning and get stuck in endless arguments, and so to make progress you have to ban it (from the main publications) and make it mandatory to make actual measurements (/math) - even when it seems irrational to exclude good (but not empirical) arguments.
    • It's weird that the author doesn't say explicitly "humans are bad at philosophical reasoning" while this feels to me like the essential takeaway.
    • I'm unsure to what extent this is true, but it's an interesting claim.
  • The author doesn't deny the importance of coming up with good hypotheses, and the role of philosophical reasoning for this part of the process, but he would say that there is clear progress decade by decade only because people did not argue with Einstein by commenting on how crazy the theory was, but instead by they tested the predictions Einstein's theories made - because that's the main kind of refutation allowed in scientific venues [Edit: That specific example is wrong and is not in the book, see the comments below.]. Same for evolution, it makes a ton of predictions (though at the time what theory the evidence favored was ambiguous). Before the scientific revolution, lots of people had good ideas, but 1. they had little data to use in their hypotheses' generation process, and 2. the best ideas had a hard time rising to the top because people argued using arguments instead of collecting data.
  • (The book also has whole chapters on objectivity, subjectivity, "credibility rankings", etc. where Bayes and priors aren't mentioned once. It's quite sad the extent to which you have to go when you don't want to scare people with math / when you don't know math)

Application to AI safety research:

  • The endless arguments and different schools of thought around the likelihood of scheming and the difficulty of alignment look similar to the historical depictions of people who didn't know what was going on and should have focused on making experiments.
    • This makes me more sympathetic to the "just do some experiments" vibe some people, even when it seems like reasoning should be enough if only people understood each other's arguments.
  • This makes me more sympathetic towards reviewers/conference organizers rejecting AI safety papers that are mostly about making philosophical points (the rejection may make sense even if the arguments look valid to them).
[-]kave32

only because people did not argue with Einstein by commenting on how crazy the theory was

Did Einstein's theory seem crazy to people at the time?

IIRC Einstein's theory had a pretty immediate impact on publication on a lot of top physicists even before more empirical evidence came in. Wikipedia on the history of relativity says: 

Walter Kaufmann (1905, 1906) was probably the first who referred to Einstein's work. He compared the theories of Lorentz and Einstein and, although he said Einstein's method is to be preferred, he argued that both theories are observationally equivalent. Therefore, he spoke of the relativity principle as the "Lorentz–Einsteinian" basic assumption.[76] Shortly afterwards, Max Planck (1906a) was the first who publicly defended the theory and interested his students, Max von Laue and Kurd von Mosengeil, in this formulation. He described Einstein's theory as a "generalization" of Lorentz's theory and, to this "Lorentz–Einstein Theory", he gave the name "relative theory"; while Alfred Bucherer changed Planck's nomenclature into the now common "theory of relativity" ("Einsteinsche Relativitätstheorie"). On the other hand, Einstein himself and many others continued to refer simply to the new method as the "relativity principle". And in an important overview article on the relativity principle (1908a), Einstein described SR as a "union of Lorentz's theory and the relativity principle", including the fundamental assumption that Lorentz's local time can be described as real time. (Yet, Poincaré's contributions were rarely mentioned in the first years after 1905.) All of those expressions, (Lorentz–Einstein theory, relativity principle, relativity theory) were used by different physicists alternately in the next years.[77]

Following Planck, other German physicists quickly became interested in relativity, including Arnold Sommerfeld, Wilhelm Wien, Max Born, Paul Ehrenfest, and Alfred Bucherer.[78] von Laue, who learned about the theory from Planck,[78] published the first definitive monograph on relativity in 1911.[79] By 1911, Sommerfeld altered his plan to speak about relativity at the Solvay Congress because the theory was already considered well established.[78]

Overall I don't think Einstein's theories seemed particularly crazy. I think they seemed quite good almost immediately after publication, without the need for additional experiments.

Thanks for the fact check. I was trying to convey the vibe the book gave me, but I think this specific example was not in the book, my bad!

Thanks! And makes sense, you did convey the vibe. And good to know it isn't in the book. 

[-]Fabien RogerΩ10250

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

  • Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criticism of the war within the government was discouraged.
    • I had heard similar things for much more authoritarian regimes, but this is the first time I heard about something like that happening in a democracy.
    • The fix he suggests: always present at least two credible options (and maybe multiple reads on the situation) to the leader.
  • He claims that there wouldn't have been an invasion of South Korea in 1950 if the US hadn't withdrawn troops from there (despite intelligence reports suggesting this was a likely outcome of withdrawing troops). If it's actually the case that intelligence was somewhat confident in its analysis of the situation, it's crazy that it was dismissed like that - that should be points in favor of it being possible that the US government could be asleep at the wheel during the start of an intelligence explosion.
    • He uses this as a parallel to justify the relevance of the US keeping troops in Iraq/Afghanistan. He also describes how letting terrorist groups grow and have a solid rear base might make the defense against terrorism even more costly than keeping troops in this region. The analysis lacks the quantitative analysis required to make this a compelling case, but this is still the best defense of the US prolonged presence in Iraq and Afghanistan that I've encountered so far.
  • McMaster stresses the importance of understanding the motivations and interests of your adversaries (which he calls strategic empathy), and thinks that people have a tendency to think too much about the interests of others with respect to them (e.g. modeling other countries as being motivated only by the hatred of you, or modeling them as being in a bad situation only because you made things worse).
  • He is surprisingly enthusiastic about the fight against climate change - especially for someone who was at some point a member of the Trump administration. He expresses great interest in finding common ground between different factions. This makes me somewhat more hopeful about the possibility that the national security establishment could take misalignment risks (and not only the threat from competition with other countries) seriously.
  • (Not surprisingly) he is a big proponent of "US = freedom, good" and "China = ruthless dictatorship, bad". He points to a few facts to support his statement, but defending this position is not his main focus, and he seems to think that there isn't any uncertainty that the US being more powerful is clearly good. Trading off US power against global common goods (e.g. increased safety against global catastrophic risks) doesn't seem like the kind of trade he would make.

Thanks! In general, I like these bite-sized summaries of various things you're reading. Seems like a win for the commons, and I hope more folks engaging with governance/policy stuff do things like this.

I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).

The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond "some famous thinkers are autistic and didn't realize the richness of the moral life of other people"), but his work made me doubt that most people will have well-being-focused CEV.

The book was also an interesting jumping point for reflection about group selection. The author doesn't make the sorts of arguments that would show that group selection happens in practice (and many of his arguments seem to show a lack of understanding of what opponents of group selection think - bees and cells cooperating is not evidence for group selection at all), but after thinking about it more, I now have more sympathy for group-selection having some role in shaping human societies, given that (1) many human groups died, and very few spread (so one lucky or unlucky gene in one member may doom/save the group) (2) some human cultures may have been relatively egalitarian enough when it came to reproductive opportunities that the individual selection pressure was not that big relative to group selection pressure and (3) cultural memes seem like the kind of entity that sometimes survive at the level of the group.

Overall, it was often a frustrating experience reading the author describe a descriptive theory of morality and try to describe what kind of morality makes a society more fit in a tone that often felt close to being normative / fails to understand that many philosophers I respect are not trying to find a descriptive or fitness-maximizing theory of morality (e.g. there is no way that utilitarians think their theory is a good description of the kind of shallow moral intuitions the author studies, since they all know that they are biting bullets most people aren't biting, such as the bullet of defending homosexuality in the 19th century).

[-]Fabien RogerΩ13194

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

  • Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
  • Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers to preventing takeovers are mostly about how when you prevent spies from stealing model weights or using models in unauthorized ways, you naturally create infra that also makes it hard for AIs to self-exfiltrate or run unmonitored.)

It was also fun to see arguments and drama that are close in kind to the kind of arguments about what are the best ways to prevent AI catastrophes: people in cybersecurity argue about whether focusing on vulnerabilities is right or whether it's better to focus on the human element, whether the focus on strong passwords is right, whether some solutions are too bothersome/costly to be used in practice, whether imposing specific cybersecurity standards is a good idea, ... It made me realize how niche AI most safety arguments must look like to people outside the AI safety community.

But I don't recommend reading the book if you are not curious about everyday cybersecurity. Most of the book is about more regular everyday cyberattacks (social engineering, common vulns, ...) cybersecurity (patching, passwords, ...), and advice for people who want to do cyber professionally. It has a bunch of resource recommendations, but I don't know yet how good they are.

[-]Fabien RogerΩ13172

Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.

LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

  • ~7k problem statements + pwd --> stronger answer
  • ~7k problem statements --> weak answer
  • {fixed set of n problems --> strong answer} repeated m times

The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the set of "known problems on which I should perform well" if the "if familiar, perform well" function is already implemented.

I train for one epoch, and then use SFT[1] on one of 3 sets of 2 samples, and train for 200 epochs, and eval on n=100 points. I find that elicitation is just as easy on the models that were trained to perform well on a fixed set of problems. I find that somewhat surprising, so I might have done something wrong. Code here.

I'm not sure how to interpret these results. Maybe I didn't memorize enough sequences, and so the model never learned a general "if familiar, then strong performance" condition (which would mean scheming model could be much less sample efficient than pwd-locked models). Or maybe SGD unlocks the model about as fast as it makes sequences more familiar (which would mean schemer would struggle to avoid being unlocked using 2 samples - like password-locked models).

Maybe evals should do the right thing and only train for one epoch - which is much less likely[2] to have this issue. This may become cheaper, as bigger/smarter models require fewer SGD steps to fit a training dataset.

  1. ^

    I compute the loss on the prompt and the answer - which is the worst case, since it should help memorization of the prompt.

  2. ^

    Not guaranteed to work: maybe SGD can notice that "making the sequence more familiar" is enough to get high performance, and so most of the update could be made in this direction, resulting in little-to-no general unlocking.

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

  • 32-shot, no fancy prompt: ~25%
  • 0-shot, fancy python prompt: ~60% 
  • 0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).