Some thoughts about the politics of AI safety, copied over (with slight modifications) from my recent twitter thread:

Risks that seem speculative today will become common sense as AI advances. The pros and cons of different safety strategies will also become much clearer over time. So our main job now is to empower future common-sense decision-making. Understanding model cognition and behavior is crucial for making good decisions. But equally important is ensuring that key institutions are able to actually process that knowledge.

Institutions can lock in arbitrarily crazy beliefs via preference falsification. When someone contradicts the party line, even people who agree face pressure to condemn them. We saw this with the Democrats hiding evidence of Biden’s mental decline. It’s also a key reason why dictators can retain power even after almost nobody truly supports them.

I worry that DC has already locked in an anti-China stance, which could persist even if most individuals change their minds. We’re also trending towards Dems and Republicans polarizing on the safety/accelerationism axis. This polarization is hard to fight directly. But there will be an increasing number of “holy shit” moments that serve as Schelling points to break existing consensus. It will be very high-leverage to have common-sense bipartisan frameworks and proposals ready for those moments.

Perhaps the most crucial desideratum for these proposals is that they’re robust to the inevitable scramble for power that will follow those “holy shit” movements. I don’t know how to achieve that, but one important factor is: will AI tools and assistants help or hurt? E.g. truth-motivated AI could help break preference falsification. But conversely, centralized control of AIs used in govts could make it easier to maintain a single narrative.

This problem of “governance with AI” (as opposed to governance *of* AI) seems very important! Designing principles for integrating AI into human governments feels analogous in historical scope to writing the US constitution. One bottleneck in making progress on that: few insiders disclose how NatSec decisions are really made (though Daniel Ellsberg’s books are a notable exception). So I expect that understanding this better will be a big focus of mine going forward.

New Comment
2 comments, sorted by Click to highlight new comments since:

The basic partial counterpoint, as I see it, builds off of what Wei Dai has been writing about for a long time (see also 1, 2, etc.), which is most crisply summarized in his comment on Christiano's "What failure looks like" post:

Here are some additional scenarios that don't fit into this story or aren't made very salient by it.

  1. AI-powered memetic warfare makes all humans effectively insane.

I'm not sure how much I buy the specific details of trevor's theorizing about Clown Attacks, given that he routinely makes strong (and a priori unlikely) factual conspiracy claims without supplying any empirical evidence for them, but large parts of lc's comment on his most popular post rings importantly true to me:

The vast majority of people alive today are the effective mental subjects of some religion, political party, national identity, or combination of the three, no magical backdoor access necessary; the confirmed tools and techniques are sufficient to ruin lives or convince people to do things completely counter to their own interests. And there are intermediate stages of effectiveness that political lobbying can ratchet up along, between the ones they're at now and total control.

[...]

The premise of the above post is not that AI companies are going to try to weaponize "human thought steering" against AI safety. The premise of the above post is that AI companies are going to develop technology that can be used to manipulate people's affinities and politics, Intel agencies will pilfer it or ask for it, and then it's going to be weaponized, to a degree of much greater effectiveness than they have been able to afford historically. I'm ambivalent about the included story in particular being carried out, but if you care about anything (such as AI safety), it's probably necessary that you keep your utilityfunction intact.

I already live in a world that feels to me qualitatively more insane and inadequate than it was even a mere 15 years ago. There could be selection effects of course, such as the fact that a more interconnected world with a greater and faster distribution of information would allow for a higher percentage of misdeeds and bad events to be reported on than in the past even if the "real" quantity had remained unchanged, but even correcting for those, I attribute a substantial part (in fact, a majority) of the negative change to technological improvements that have allowed people to be more connected with one another and to consume more and more maldaptive memes generated by misaligned processes (through social media and software and the Internet more broadly).

I find it rather unlikely that the continued rise of LLMs will reverse this trend; instead, I expect it to only become more and more amplified and accelerated, as sheer lunacy expands to cover more and more public discourse (as an illustration, consider the familiar example of e-acc's, a phenomenon that would have been mostly inconceivable even just a decade ago). So when you say the following:

So our main job now is to empower future common-sense decision-making.

And:

[these proposals] are robust to the inevitable scramble for power that will follow those “holy shit” movements

I realize that I am not at all optimistic about the continued prevalence and stability of "common-sense" as time goes by, particularly in the context of politicized discourse in which reasonable equilibria about decision-making were already getting more and more fragile even before LLMs came around the corner, ready to light the powder keg on fire...

I basically agree with this comment, and the basic reason I am much more pessimistic on sane AI governance than a lot of LWers is precisely because I expect LLMs to be more persuasive than humans, and there's very strong evidence for it.

Here's an RCT and pre-registered study on this topic, and while I find the sample numbers a little low (I'd like it to be more in the realm of 1000-2000 randomly selected people), this is the only type of study that can ensure that the effects are casual without relying much on your priors, so the fact that they show large persuasion from LLMs is really strong evidence for the belief that AI systems are better than humans at persuading people when given access to personal data and interaction.

https://arxiv.org/abs/2403.14380

More generally, it provides evidence for Sam Altman's thesis that super-persuasive AI will come long before AI that's good in every other field.