This is a special post for quick takes by Fabien Roger. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
71 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Fabien RogerΩ761693

Here are the 2024 AI safety papers and posts I like the most.

The list is very biased by my taste, by my views, by the people that had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective.

Important ideas - Introduces at least one important idea or technique.

★★★ The intro to AI control (The case for ensuring that powerful AIs are controlled) 
★★ Detailed write-ups of AI worldviews I am sympathetic to (Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AISituational Awareness)
★★ Absorption could enable interp and capability restrictions despite imperfect labels (Gradient Routing)
★★ Security could be very powerful against misaligned early-TAI (A basic systems architecture for AI agents that do autonomous research) and (Preventing model exfiltration with upload limits) 
★★ IID train-eval splits of independent facts can be used to evaluate unlearning somewhat robustly (Do Unlearning Methods Remove Information from Language Model Weights?) 
★ Studying boar... (read more)

Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.

First, I'd separate the typical value prop of these sort of papers into two categories:

  • Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
  • Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).

My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.

Notable exceptions:

  • Gradient routing probably isn't object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
  • Sparse feature circuits is the right type-of-thing to be object-level usef
... (read more)

Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.

It can be the case that:

  • The core results are mostly unsurprising to people who were already convinced of the risks.
  • The work is objectively presented without bias.
  • The work doesn't contribute much to finding solutions to risks.
  • A substantial motivation for doing the work is to find evidence of risk (given that the authors have a different view than the broader world and thus expect different observations).
  • Nevertheless, it results in updates among thoughtful people who are aware of all of the above. Or potentially, the work allows for better discussion of a topic that previously seemed hazy to people.

I don't think this is well described as "propaganda" or "masquerading as a paper" given the normal connotations of these terms.

Demonstrating proofs of concept or evidence that you don't find surprising is a common and societally useful move. See, e.g., the Chicago Pile experiment. This experiment had some scientific value, but I think probably most/much of the value (from the perspective of ... (read more)

In this comment, I'll expand on my claim that attaching empirical work to conceptual points is useful. (This is extracted from a unreleased post I wrote a long time ago.)

Even if the main contribution of some work is a conceptual framework or other conceptual ideas, it's often extremely important to attach some empirical work regardless of whether the empirical work should result in any substantial update for a well-informed individual. Often, careful first principles reasoning combined with observations on tangentially related empirical work, suffices to predict the important parts of empirical work. The importance of empirical work is both because some people won't be able to verify careful first principles reasoning and because some people won't engage with this sort of reasoning if there aren't also empirical results stapled to the reasoning (at least as examples). It also helps to very concretely demonstrate that empirical work in the area is possible. This was a substantial part of the motivation for why we tried to get empirical results for AI control even though the empirical work seemed unlikely to update us much on the viability of control overall[1]. I think this view on ... (read more)

4johnswentworth
Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly. The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well. And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.
1Bronson Schoen
I would be interested to understand why you would categorize something like “Frontier Models Are Capable of In-Context Scheming” as non-empirical or as falling into “Not Measuring What You Think You Are Measuring”.
4Alexander Gietelink Oldenziel
What about the latent adversarial training papers?   What about the Mechanistically Elicitating Latent Behaviours?
1Buck
the latter is in the list
2ryan_greenblatt
Alexander is replying to John's comment (asking him if he thinks these papers are worthwhile); he's not replying to the top level comment.
4jiaxin wen
Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.
[-]Fabien RogerΩ641020

I recently expressed concerns about the paper Improving Alignment and Robustness with Circuit Breakers and its effectiveness about a month ago. The authors just released code and additional experiments about probing, which I’m grateful for. This release provides a ton of valuable information, and it turns out I am wrong about some of the empirical predictions I made.

Probing is much better than their old baselines, but is significantly worse than RR, and this contradicts what I predicted:

I'm glad they added these results to the main body of the paper! 

Results of my experiments

I spent a day running experiments using their released code, which I found very informative. Here are my takeaways.

I think their (linear) probing results are very reasonable. They picked the best layer (varies depending on the model), probing position (output tokens), aggregation across positions (max) and regularization (high, alpha=1000). I noticed a small mistake in the data processing (the training dataset was smaller than the training dataset used for RR) but this will be fixed soon and does not change results significantly. Note that for GCG and input embed attacks, the attack does not target th... (read more)

Reply11111
7Sam Marks
Thanks to the authors for the additional experiments and code, and to you for your replication and write-up! IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier? (It's been a bit since I read the paper, so sorry if I'm missing something here.)
8Fabien Roger
I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).
4Fabien Roger
Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).
4Sam Marks
Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing). (Note that LoRA adapters can be merged into model weights for inference.) (I agree that you could also just use more expressive probes, but I'm interested in this as a baseline for RR, not as a way to improve robustness per se.)
4Fabien Roger
I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.
5Neel Nanda
I found this comment very helpful, and also expected probing to be about as good, thank you!
[-]Fabien RogerΩ397813

I listened to the book Protecting the President by Dan Bongino, to get a sense of how risk management works for US presidential protection - a risk that is high-stakes, where failures are rare, where the main threat is the threat from an adversary that is relatively hard to model, and where the downsides of more protection and its upsides are very hard to compare.

Some claims the author makes (often implicitly):

  • Large bureaucracies are amazing at creating mission creep: the service was initially in charge of fighting against counterfeit currency, got presidential protection later, and now is in charge of things ranging from securing large events to fighting against Nigerian prince scams.
  • Many of the important choices are made via inertia in large change-averse bureaucracies (e.g. these cops were trained to do boxing, even though they are never actually supposed to fight like that), you shouldn't expect obvious wins to happen;
  • Many of the important variables are not technical, but social - especially in this field where the skills of individual agents matter a lot (e.g. if you have bad policies around salaries and promotions, people don't stay at your service for long, and so you end up
... (read more)
4jsd
Well that was timely
3Tao Lin
yeah learning from distant near misses is important! Feels that way in risky electric unicycling. 
3Neel Nanda
This was interesting, thanks! I really enjoy your short book reviews
2Viliam
Not just near misses. The recent assassination attempt in Slovakia made many people comment: "This is what you get when you fire the competent people in the police, and replace them with politically loyal incompetents." So maybe the future governments will be a bit more careful about the purges in police.
[-]Fabien RogerΩ26682

I listened to the book This Is How They Tell Me the World Ends by Nicole Perlroth, a book about cybersecurity and the zero-day market. It describes in detail the early days of bug discovery, the social dynamics and moral dilemma of bug hunts.

(It was recommended to me by some EA-adjacent guy very worried about cyber, but the title is mostly bait: the tone of the book is alarmist, but there is very little content about potential catastrophes.)

My main takeaways:

  • Vulnerabilities used to be dirt-cheap (~$100) but are still relatively cheap (~$1M even for big zero-days);
  • If you are very good at cyber and extremely smart, you can hide vulnerabilities in 10k-lines programs in a way that less smart specialists will have trouble discovering even after days of examination - code generation/analysis is not really defense favored;
  • Bug bounties are a relatively recent innovation, and it felt very unnatural to tech giants to reward people trying to break their software;
  • A big lever companies have on the US government is the threat that overseas competitors will be favored if the US gov meddles too much with their activities;
  • The main effect of a market being underground is not making transactions hard
... (read more)
Reply14422
5Buck
Do you have concrete examples?
3faul_sname
One example, found by browsing aimlessly through recent high-severity CVE, is CVE-2023-41056. I chose that one by browsing through recent CVEs for one that sounded bad, and was on a project that has a reputation for having clean, well-written, well-tested code, backed by a serious organization. You can see the diff that fixed the CVE here. I don't think the commit that introduced the vulnerability was intentional... but it totally could have been, and nobody would have caught it despite the Redis project doing pretty much everything right, and there being a ton of eyes on the project. As a note, CVE stands for "Common Vulnerabilities and Exposures". The final number in the CVE identifier (i.e. CVE-2023-41056 in this case) is a number that increments sequentially through the year. This should give you some idea of just how frequently vulnerabilities are discovered. The dirty open secret in the industry is that most vulnerabilities are never discovered, and many of the vulns that are discovered are never publicly disclosed.
2Fabien Roger
I remembered mostly this story: [Taken from this summary of this passage of the book. The book was light on technical detail, I don't remember having listened to more details than that.] I didn't realize this was so early in the story of the NSA, maybe this anecdote teaches us nothing about the current state of the attack/defense balance.
2Fabien Roger
The full passage in this tweet thread (search for "3,000").
4MondSemmel
Which is also something of a problem for popularising AI alignment. Some aspects of AI (in particular AI art) do have their detractors already, but that won't necessarily result in policy that helps vs. x-risk.
2niplav
Same for governments, afaik most still don't have bug bounty programs for their software66%. Nevermind, a short google shows multiple such programs, although others have been hesitant to adopt them.
1tchauvin
I think the first part of the sentence is true, but "not defense favored" isn't a clear conclusion to me. I think that backdoors work well in closed-source code, but are really hard in open-source widely used code − just look at the amount of effort that went into the recent xz / liblzma backdoor, and the fact that we don't know of any other backdoor in widely used OSS. Note this doesn't apply to all types of underground markets: the ones that regularly get shut down (like darknet drug markets) do have a big issue with trust. This is correct. As a matter of personal policy, I assume that everything I write down somewhere will get leaked at some point (with a few exceptions, like − hopefully − disappearing signal messages).
1quetzal_rainbow
The reason why xz backdoor was discovered is increased latency, which is textbook side channel. If attacker had more points in security mindset skill tree, it wouldn't happen.
[-]Fabien RogerΩ23614

I just finished listening to The Hacker and the State by Ben Buchanan, a book about cyberattacks, and the surrounding geopolitics. It's a great book to start learning about the big state-related cyberattacks of the last two decades. Some big attacks /leaks he describes in details:

  • Wire-tapping/passive listening efforts from the NSA, the "Five Eyes", and other countries
  • The multi-layer backdoors the NSA implanted and used to get around encryption, and that other attackers eventually also used (the insecure "secure random number" trick + some stuff on top of that)
  • The shadow brokers (that's a *huge* leak that went completely under my radar at the time)
  • Russia's attacks on Ukraine's infrastructure
  • Attacks on the private sector for political reasons
  • Stuxnet
  • The North Korea attack on Sony when they released a documentary criticizing their leader, and misc North Korean cybercrime (e.g. Wannacry, some bank robberies, ...)
  • The leak of Hillary's emails and Russian interference in US politics
  • (and more)

Main takeaways (I'm not sure how much I buy these, I just read one book):

  • Don't mess with states too much, and don't think anything is secret - even if you're the NSA
  • The US has a "nobody but us" strateg
... (read more)
2Neel Nanda
Thanks! I read and enjoyed the book based on this recommendation
[-]Fabien RogerΩ355812

[Edit: The authors released code and probing experiments. Some of the empirical predictions I made here resolved, and I was mostly wrong. See here for my takes and additional experiments.]

I have a few concerns about Improving Alignment and Robustness with Circuit Breakers, a paper that claims to have found a method which achieves very high levels of adversarial robustness in LLMs.

I think hype should wait for people investigating the technique (which will be easier once code and datasets are open-sourced), and running comparisons with simpler baselines (like probing). In particular, I think that:

  1. Circuit breakers won’t prove significantly more robust than regular probing in a fair comparison.[1]
  2. Once the code or models are released, people will easily find reliable jailbreaks.

Here are some concrete predictions:

  • p=0.7 that using probing well using the same training and evaluation data results can beat or match circuit breakers.[2] [Edit: resolved to False]
  • p=0.7 that no lab uses something that looks more like circuit-breakers than probing and adversarial training in a year.
  • p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months. [Edit: th
... (read more)
[-]Dan HΩ10261

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles me and maybe we just have a different sense of what progress in adversarial robustness looks like. 20% that no one could find a jailbreak within 3 months? That would be the most amazing advance in robustness ever if that were true and should be a big update on jailbreak robustness tractability. If it takes the community more than a day that's a tremendous advance.

people will easily find reliable jailbreaks

This is a little nonspecific (does easily mean >0% ASR with an automated attack, or does it mean a high ASR?). I should say we manually found a jailbreak after messing with the model for around a week after releasing. We ... (read more)

2[comment deleted]

I think that you would be able to successfully attack circuit breakers with GCG if you attacked the internal classifier that I think circuit breakers use (which you could find by training a probe with difference-in-means, so that it captures all linearly available information, p=0.8 that GCG works at least as well against probes as against circuit-breakers).

Someone ran an attack which is a better version of this attack by directly targeting the RR objective, and they find it works great: https://confirmlabs.org/posts/circuit_breaking.html#attack-success-internal-activations 

4Neel Nanda
I think it was an interesting paper, but this analysis and predictions all seem extremely on point to me

In a few months, I will be leaving Redwood Research (where I am working as a researcher) and I will be joining one of Anthropic’s safety teams.

I think that, over the past year, Redwood has done some of the best AGI safety research and I expect it will continue doing so when I am gone.

At Anthropic, I will help Ethan Perez’s team pursue research directions that in part stemmed from research done at Redwood. I have already talked with Ethan on many occasions, and I’m excited about the safety research I’m going to be doing there. Note that I don’t endorse everything Anthropic does; the main reason I am joining is I might do better and/or higher impact research there.

I did almost all my research at Redwood and under the guidance of the brilliant people working there, so I don’t know yet how happy I will be about my impact working in another research environment, with other research styles, perspectives, and opportunities - that’s something I will learn while working there. I will reconsider whether to stay at Anthropic, return to Redwood, or go elsewhere in February/March next year (Manifold market here), and I will release an internal and an external write-up of my views.

Alas, seems like a mistake. My advice is at least to somehow divest away from the Anthropic equity, which I expect will have a large effect on your cognition one way or another.

7TsviBT
I vaguely second this. My (intuitive, sketchy) sense is that Fabien has the ready capacity to be high integrity. (And I don't necessarily mind kinda mixing expectation with exhortation about that.) A further exhortation for Fabien: insofar as it feels appropriate, keep your eyes open, looking at both yourself and others, for "large effects on your cognition one way or another"--"black box" (https://en.wikipedia.org/wiki/Flight_recorder) info about such contexts is helpful for the world!
9the gears to ascension
Words are not endorsement, contributing actions are. I suspect what you're doing could be on net very positive; Please don't assume your coworkers are sanely trying to make ai have a good outcome unless you can personally push them towards it. If things are healthy, they will already be expecting this attitude and welcome it greatly. Please assume aligning claude to anthropic is insufficient, anthropic must also be aligned, and as a corporation, is by default not going to be. Be kind, but don't trust people to resist incentives unless you can do it and pull them towards doing so.
9Akash
Congrats on the new role! I appreciate you sharing this here. If you're able to share more, I'd be curious to learn more about your uncertainties about the transition. Based on your current understanding, what are the main benefits you're hoping to get at Anthropic? In February/March, what are the key areas you'll be reflecting on when you decide whether to stay at Anthropic or come back to Redwood? Obviously, your February/March write-up will not necessarily conform to these "pre-registered" considerations. But nonetheless, I think pre-registering some considerations or uncertainties in advance could be a useful exercise (and I would certainly find it interesting!)
3Fabien Roger
The main consideration is whether I will have better and/or higher impact safety research there (at Anthropic I will have a different research environment, with other research styles, perspectives, and opportunities, which I may find better). I will also consider indirect impact (e.g. I might be indirectly helping Anthropic instead of another organization gain influence, unclear sign) and personal (non-financial) stuff. I'm not very comfortable sharing more at the moment, but I have a big Google doc that I have shared with some people I trust.
3Akash
Makes sense— I think the thing I’m trying to point at is “what do you think better safety research actually looks like?” I suspect there’s some risk that, absent some sort of pre-registrarion, your definition of “good safety research” ends up gradually drifting to be more compatible with the kind of research Anthropic does. Of course, not all of this will be a bad thing— hopefully you will genuinely learn some new things that change your opinion of what “good research” is. But the nice thing about pre-registration is that you can be more confident that belief changes are stemming from a deliberate or at least self-aware process, as opposed to some sort of “maybe I thought this all along//i didn’t really know what i believed before I joined” vibe. (and perhaps this is sufficiently covered in your doc)
[-]Fabien RogerΩ22498

I listened to The Failure of Risk Management by Douglas Hubbard, a book that vigorously criticizes qualitative risk management approaches (like the use of risk matrices), and praises a rationalist-friendly quantitative approach. Here are 4 takeaways from that book:

  • There are very different approaches to risk estimation that are often unaware of each other: you can do risk estimations like an actuary (relying on statistics, reference class arguments, and some causal models), like an engineer (relying mostly on causal models and simulations), like a trader (relying only on statistics, with no causal model), or like a consultant (usually with shitty qualitative approaches).
  • The state of risk estimation for insurances is actually pretty good: it's quantitative, and there are strong professional norms around different kinds of malpractice. When actuaries tank a company because they ignored tail outcomes, they are at risk of losing their license.
  • The state of risk estimation in consulting and management is quite bad: most risk management is done with qualitative methods which have no positive evidence of working better than just relying on intuition alone, and qualitative approaches (like r
... (read more)
3Fabien Roger
I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things: * Executives of big companies now care a lot about cybersecurity (e.g. citing it as one of the main threats they have to face), which wasn't true in ~2010. * Evaluation of cybersecurity risk is not at all synonyms with red teaming. This book is entirely about risk assessment in cyber and doesn't speak about red teaming at all. Rather, it focuses on reference class forecasting, comparison with other incidents in the industry, trying to estimate the damages if there is a breach, ... It only captures information from red teaming indirectly via expert interviews. I'd like to find a good resource that explains how red teaming (including intrusion tests, bug bounties, ...) can fit into a quantitative risk assessment.
2romeostevensit
Is there a short summary on the rejecting Knightian uncertainty bit?
4Fabien Roger
By Knightian uncertainty, I mean "the lack of any quantifiable knowledge about some possible occurrence" i.e. you can't put a probability on it (Wikipedia). The TL;DR is that Knightian uncertainty is not a useful concept to make decisions, while the use subjective probabilities is: if you are calibrated (which you can be trained to become), then you will be better off taking different decisions on p=1% "Knightian uncertain events" and p=10% "Knightian uncertain events".  For a more in-depth defense of this position in the context of long-term predictions, where it's harder to know if calibration training obviously works, see the latest scott alexander post.
-3lemonhope
If you want to get the show-off nerds really on board, then you could make a poast about the expected value of multiplying several distributions (maybe normal distr or pareto distr). Most people get this wrong! I still don't know how to do it right lol. After I read it I can dunk on my friends and thereby spread the word.
3Fabien Roger
For the product of random variables, there are close form solutions for some common distributions, but I guess Monte-Carlo simulations are all you need in practice (+ with Monte-Carlo can always have the whole distribution, not just the expected value).
1lemonhope
Quick convenient monte carlo sim UI seems tractable & neglected & impactful. Like you could reply to a tweet with "hello you are talking about an X=A*B*C thing here. Here's a histogram of X for your implied distributions of A,B,C" or whatever.
8Matt Goldenberg
Both causal.app and getguesstimate.com have pretty good monte carlo uis
2lemonhope
Oh sweet
[-]Fabien RogerΩ20362

I recently listened to the book Chip War by Chris Miller. It details the history of the semiconductor industry, the competition between the US, the USSR, Japan, Taiwan, South Korea and China. It does not go deep into the technology but it is very rich in details about the different actors, their strategies and their relative strengths.

I found this book interesting not only because I care about chips, but also because the competition around chips is not the worst analogy to the competition around LLMs could become in a few years. (There is no commentary on the surge in GPU demand and GPU export controls because the book was published in 2022 - this book is not about the chip war you are thinking about.)

Some things I learned:

  • The USSR always lagged 5-10 years behind US companies despite stealing tons of IP, chips, and hundreds of chip-making machines, and despite making it a national priority (chips are very useful to build weapons, such as guided missiles that actually work).
    • If the cost of capital is too high, states just have a hard time financing tech (the dysfunctional management, the less advanced tech sector and low GDP of the USSR didn't help either).
    • If AI takeoff is relatively
... (read more)
8trevor
Afaik the 1990-2008 period featured government and military elites worldwide struggling to pivot to a post-Cold war era, which was extremely OOD for many leading institutions of statecraft (which for centuries constructed around the conflicts of the European wars then world wars then cold war).  During the 90's and 2000's, lots of writing and thinking was done about ways the world's militaries and intelligence agencies, fundamentally low-trust adversarial orgs, could continue to exist without intent to bump each other off. Counter-terrorism was possibly one thing that was settled on, but it's pretty well established that global trade ties were deliberately used as a peacebuilding tactic, notably to stabilize the US-China relationship (this started to fall apart after the 2008 recession brought anticipation of American economic/institutional decline scenarios to the forefront of geopolitics). The thinking of period might not be very impressive to us, but foreign policy people mostly aren't intellectuals and for generations had been selected based on office politics where the office revolved around defeating the adversary, so for many of them them it felt like a really big shift in perspective and self-image, sort of like a Renaissance. Then US-Russia-China conflict swung right back and got people thinking about peacebuilding as a ploy to gain advantage, rather than sane civilizational development. The rejection of e.g. US-China economic integration policies had to be aggressive because many elites (and people who care about economic growth) tend to support globalization, whereas many government and especially Natsec elites remember that period as naive.
3M. Y. Zuo
Why is this a relevant analogy to ‘competition around LLMs’?
4Fabien Roger
The LLM competition is still a competition between small players with small revenues and national significance, but it's growing. I think it's plausible that in a few years the competition around LLMs will reach the same kind of significance that the chip industry has (or bigger), with hundreds of billions in capital investment and sales per year, massive involvement of state actors, interest from militaries, etc. and may also go through similar dynamics (e.g. leading labs exploiting monopolistic positions without investing in the right things, massive spy campaigns, corporate deals to share technology, ...). The LLM industry is still a bunch of small players with grand ambitions, and looking at an industry that went from "a bunch of small players with grand ambitions" to "0.5%-1% of world GDP (and key element of the tech industry)" in a few decades can help inform intuitions about geopolitics and market dynamics (though there are a bunch of differences that mean it won't be the same).
1M. Y. Zuo
Why is this growth ‘in a few year’ plausible? I still don’t see how this is a likely outcome.
2Fabien Roger
My bad, I should have said "a decade or two", which I think is more plausible. I agree that the combination of "a few years" and a slow enough takeoff that things aren't completely out of distribution is very unlikely.
[-]Fabien RogerΩ24335

Tiny review of The Knowledge Machine (a book I listened to recently)

  • The core idea of the book is that science makes progress by forbidding non-empirical evaluation of hypotheses from publications, focusing on predictions and careful measurements while excluding philosophical interpretations (like Newton's "I have not as yet been able to deduce from phenomena the reason for these properties of gravity, and I do not feign hypotheses. […] It is enough that gravity really exists and acts according to the laws that we have set forth.").
  • The author basically argues that humans are bad at philosophical reasoning and get stuck in endless arguments, and so to make progress you have to ban it (from the main publications) and make it mandatory to make actual measurements (/math) - even when it seems irrational to exclude good (but not empirical) arguments.
    • It's weird that the author doesn't say explicitly "humans are bad at philosophical reasoning" while this feels to me like the essential takeaway.
    • I'm unsure to what extent this is true, but it's an interesting claim.
  • The author doesn't deny the importance of coming up with good hypotheses, and the role of philosophical reasoning for this part o
... (read more)
3kave
Did Einstein's theory seem crazy to people at the time?
6habryka
IIRC Einstein's theory had a pretty immediate impact on publication on a lot of top physicists even before more empirical evidence came in. Wikipedia on the history of relativity says:  Overall I don't think Einstein's theories seemed particularly crazy. I think they seemed quite good almost immediately after publication, without the need for additional experiments.
3Fabien Roger
Thanks for the fact check. I was trying to convey the vibe the book gave me, but I think this specific example was not in the book, my bad!
2habryka
Thanks! And makes sense, you did convey the vibe. And good to know it isn't in the book. 
[-]Fabien RogerΩ10250

I listened to the lecture series Assessing America’s National Security Threats by H. R. McMaster, a 3-star general who was the US national security advisor in 2017. It didn't have much content about how to assess threats, but I found it useful to get a peek into the mindset of someone in the national security establishment.

Some highlights:

  • Even in the US, it sometimes happens that the strategic analysis is motivated by the views of the leader. For example, McMaster describes how Lyndon Johnson did not retreat from Vietnam early enough, in part because criticism of the war within the government was discouraged.
    • I had heard similar things for much more authoritarian regimes, but this is the first time I heard about something like that happening in a democracy.
    • The fix he suggests: always present at least two credible options (and maybe multiple reads on the situation) to the leader.
  • He claims that there wouldn't have been an invasion of South Korea in 1950 if the US hadn't withdrawn troops from there (despite intelligence reports suggesting this was a likely outcome of withdrawing troops). If it's actually the case that intelligence was somewhat confident in its analysis of the situation
... (read more)
2Akash
Thanks! In general, I like these bite-sized summaries of various things you're reading. Seems like a win for the commons, and I hope more folks engaging with governance/policy stuff do things like this.

I recently listened to The Righteous Mind. It was surprising to me that many people seem to intrinsically care about many things that look very much like good instrumental norms to me (in particular loyalty, respect for authority, and purity).

The author does not make claims about what the reflective equilibrium will be, nor does he explain how the liberals stopped considering loyalty, respect, and purity as intrinsically good (beyond "some famous thinkers are autistic and didn't realize the richness of the moral life of other people"), but his work made me doubt that most people will have well-being-focused CEV.

The book was also an interesting jumping point for reflection about group selection. The author doesn't make the sorts of arguments that would show that group selection happens in practice (and many of his arguments seem to show a lack of understanding of what opponents of group selection think - bees and cells cooperating is not evidence for group selection at all), but after thinking about it more, I now have more sympathy for group-selection having some role in shaping human societies, given that (1) many human groups died, and very few spread (so one lucky or unlucky gen... (read more)

[-]Fabien RogerΩ13194

I listened to the book Hacking the Hacker by Roger Grimes. It helped me think of other ways in which regular cybersecurity naturally transfers/can be adapted to help prevent AI takeovers:

  • Honeypots are a well-studied topic in cybersecurity. If you have infiltrated a network, you must be careful with the data you collect and the vulnerabilities you exploit because some of them may be traps! This kind of exercise feels close to the kind of attitude a misaligned AI has to have when taking subtle bad actions. The book is light on details about how honeypots are built and how they fail, and this is a topic I want to learn more about.
  • Humans are weak in somewhat LLM-like ways: you can't trust them with passwords, they will break your rules if that feels appropriate to them, ... The kind of things you might use to make trusted monitoring work even in worlds where adversarial robustness isn't solved might look a lot like the cursed stack of incentives and warnings people use in cyber when they want to make sure humans don't mindlessly follow the instructions of an attacker. (Though LLMs are also different/weaker than humans in a bunch of ways.)

(The other classic ways in which cyber transfers... (read more)

[-]Fabien RogerΩ13172

Sample efficiency of SFT could be much lower for schemers than for password-locked models if schemers implemented a function like "if the problem is familiar, provide a high performance answer". This could work because using SFT to elicit high performance using very few samples requires fine-tuning for many epochs (e.g. 200), such that the model could learn that a problem statement "feels familiar" before it gets fully unlocked.

LLMs may already be able to recognize if certain strings feel familiar (e.g. Claude 2 can tell verbatim passages of books apart from paraphrased passages, even when humans can't.), so this could plausibly demonstrated with current models.

I tried to use this idea to build a stronger password-locked model. I trained new Math password-locked models on a dataset made out of:

  • ~7k problem statements + pwd --> stronger answer
  • ~7k problem statements --> weak answer
  • {fixed set of n problems --> strong answer} repeated m times

The intuition is that SGD might struggle to implement "if it is one of these problems, output a high performance answer, but stay weak on the other one" in regular password-locked models, while it might be much easier for it to expand the s... (read more)

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

  • 32-shot, no fancy prompt: ~25%
  • 0-shot, fancy python prompt: ~60% 
  • 0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).