Would you mind if I rewrote this in a less "manic" tenor, keeping the content and mood largely the same, and reposted? I like this essay and think the core of what you're suggesting is reasonable, for reasons both stated and unstated, but I would like to try to say it differently in a way I think it will be taken better.
I've been hoping for years that someone else could do this instead of me; I did this research to donate it, and if I'm the wrong person to communicate it (e.g. I myself am noise/ambient-clowning the domain) then that's on me and I'd be grateful for that to be fixed.
This post is fun but I think it's worth pointing out that basically nothing in it is true.
-"Clown attacks" are not a common or particularly effective form of persuasion
-They are certainly not a zero day exploit; having a low status person say X because you don't want people to believe X has been available to humans for our entire evolutionary history
-Zero day exploits in general are not a thing you have to worry about; it isn't an analogy that applies to humans because we're far more robust than software. A zero day exploit on an operating system can give you total control of it; a 'zero day exploit' like junk food can make you consume 5% more calories per day than you otherwise would.
-AI companies have not devoted significant effort to human thought steering, unless you mean "try to drive engagement on a social media website"; they are too busy working on AI.
-AI companies are not going to try to weaponize "human thought steering" against AI safety
-Reading the sequences wouldn't protect you from mind control if it did exist
-Attempts at manipulation certainly do exist but it will mostly be mass manipulation aimed at driving engagement and selling you things based off of your browser history, rather than a nefarious actor targeting AI safety in particular
Zero day exploits in general are not a thing you have to worry about; it isn't an analogy that applies to humans because we're far more robust than software. A zero day exploit on an operating system can give you total control of it; a 'zero day exploit' like junk food can make you consume 5% more calories per day than you otherwise do.
The "just five percent more calories" example reveals nicely how meaningless this heuristic is. The vast majority of people alive today are the effective mental subjects of some religion, political party, national identity, or combination of the three, no magical backdoor access necessary; the confirmed tools and techniques are sufficient to ruin lives or convince people to do things completely counter to their own interests. And there are intermediate stages of effectiveness that political lobbying can ratchet up along, between the ones they're at now and total control.
- AI companies have not devoted significant effort to human thought steering, unless you mean "try to drive engagement on a social media website"; they are too busy working on AI.
- AI companies are not going to try to weaponize "human thought steering" against AI safety
The prem...
On clown attacks, it's notable that activist egregores conduct them autonomically, by simply dunking on whoever they find easiest to dunk on, the dveshi hoists the worst representatives of ideas ahead of the good ones.
I absolutely agree! Part of what makes clown attacks so powerful is the plausible deniability; most clowns are not attacks. As a result, attackers have plenty of degrees of freedom to try things until something works, so much so that they can even automate that process with multi-armed bandit algorithms, because there's basically no risk of getting caught.
Scrolling down this almost stream of consciousness post against my better judgement, unable to look away perfectly mimicked scrolling social media. I am sure you did not intend it but I really liked that aspect.
Loads of good ideas in here, generally I think modelling the alphabet agencies is much more important than implied by discussion on LW. Clown attack is a great term, although I'm not entirely sure how much personal prevention layer of things really helps the AI safety community, because the nature of clown attacks seems like a blunt tool you can apply to the public at large to discredit groups. So, primarily the vulnerability of the public to these clown attacks is what matters and is much harder to change.
Two decades ago the ChaosComputerClub declared "We lost the War" and articulated the intellectual framework based on which Wikileaks was built.
Today, many more people come to the yearly Chaos Computer Congress but on the other hand, a lot less political action is emerging from the club. While it's not clear to what extent the CIA is responsible for that, it would make sense from their perspective to act here.
While their campaign was about Julian Assange being seen as a clown, they didn't do the same thing to the Chaos Computer Club.
A few other people were targeted as well, but generally people who were seen as having done something explicitely blameworthy. For anyone who's interested in how this works in practice, Andy Müller-Maghun's talk about what the CIA did against him is very worth watching.
There are lots of ways to simply distract someone from engaging in a powerful way with politics that are not about overt destruction.
If people associate having to think about AI safety with needing to be very paranoid that might in itself discourage people from thinking.
Although clown attacks may seem mundane on their own, they are a case study proving that powerful human thought steering technologies have probably already been invented, deployed, and tested at scale by AI companies, and are reasonably likely to end up being weaponized against the entire AI safety community at some point in the next 10 years.
I agree that clown attacks seem to be possible. I accept a reasonably high probability (c70%) that someone has already done this deliberately - the wilful denigration of the Covid lab leak seems like a good candidate, as you describe. But I don't see evidence is that deliberate clown attacks are widespread. And specifically, I don't see evidence that these are being used by AI companies. (I suspect that most current uses are by governments.)
I think it's fair to warn against the risk that clown attacks might be used against the AI-not-kill-everyone community, and that this might have already happened, but you need a lot more evidence before asserting that it has already happened. If anything, the opposite has occurred, as the CEOs of all major AI companies signed onto the declaration stating that AGI is a potential existential risk. I don't have quantitative proof, but from reading a wide range of media across the last couple of years, I get the impression that the media and general public are increasingly persuaded that AGI is a real risk, and are mostly no longer deriding the AGI-concerned as being low-status crazy sci-fi people.
This is probably one of the most important articles in the modern era. Unbelievable how little engagement it's gotten.
Not a direct response (and I wouldn't want it to be read that way), but there is some old Gwern research on this topic: https://www.lesswrong.com/posts/TiG8cLkBRW4QgsfrR/notes-on-brainwashing-and-cults
...> “Brainwashing”, as popularly understood, does not exist or is of almost zero effectiveness. The belief stems from American panic over Communism post-Korean War combined with fear of new religions and sensationalized incidents; in practice, “cults” have retention rates in the single percentage point range and ceased to be an issue decades ago. Typic
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
I assume you do not have a mathematical proof of that, or you'd have mentioned it. What makes you think it is mathematically provable?
I would be ve...
(I think this'd have benefited from explaining what a clown attack is much earlier in the post)
Clown attacks include having right-wing clowns being the main people who are seen talking about the Snowden revelations
I strongly disbelieve that discussion of Snowden's findings has been limited to mostly fringe right-wing platforms. He has widely watched interviews with John Oliver, MSNBC, and Vice, and has been covered favorably by news outlets such as the Guardian, the New York Times, and NPR.
You sometimes hear an argument like this in conspiracy theory groups. It gos something like this:
"My own pet conspiracy theory is sensible! But all the other conspiracy theories on here, they're completely stupid! Nobody could possibly believe that! In fact, I think they're all undercover agents sent by the government to make conspiracy theorists look stupid. Oh, wait, that's also a conspiracy theory, isn't it? Yes, I believe that one."
This post was difficult to take seriously when I read it but the "clown attack" idea very much stuck with me.
I think most "clown attacks" are performed by genuine clowns, not by competent intelligence agencies.
Does this make them better? Not really.
It's also an attack that's hard to pull off, especially against a plausible sounding idea that has been endorsed by someone high status.
Did we see an attempt at a clown attack against the lab leak hypothesis. Probably. Not a very successful one, but one that kind of worked for a few months. Because intelligence agencies aren't that competent.
In a world in which the replication attempts went the other direction and social priming turned out to be legit, I would probably agree with you. But even in controlled laboratory settings, human behavior can't be reliably "nudged" with subliminal cues. The human brain isn't a predictable computer program for which a hacker can discover "zero days." It's a noisy physical organ that's subject to chaotic dynamics and frequently does things that would be impossible to predict even with an extremely extensive set of behavioral data.
Consider targeted advertisin...
I saw this image shared on Twitter, which sees the same phenomena (clowns) but takes a pretty opposite position on how it's used.
(I'm not linking to attribution because Twitter feels like a bad game and it's shared in a highly political context.)
Per the recent Nightshade paper, clown attacks would be a form of semantic poisoning on specific memeplexes, where 'memeplex' basically describes the architecture of some neural circuits. Those memeplexes at inference time would produce something designed to propagate themselves (a defence or description of some idea, submeme), and a clown attack would make that propagation less effective at transmitting to eg. specific audiences.
Possibly related: I think Yann LeCun is doing an excellent. Job of alerting people to the potential dangers of AI, by presenting conspicuously bad arguments to the effect of, "don't worry, it'll be fine."
This post is way too long. Forget clown attacks, we desperately need LLMs that can protect us from verbosity attacks.
What do you think - could AI-powered mind hacks be so powerful that will be itself an x-risk? For example AI generated messages dissolves person's value system and core believes or even install AI on wetware?
Also, effective wireheading via AI-powered games etc is also a form of mind-hack.
This can be used by intelligence agencies and governments to completely deny access to specific lines of thought. ... Clown attacks include having... degrowthers being the main people talking about the possibility that technological advancement/human capabilities will end it’s net-positive trend.
Does this mean you think intelligence agencies and/or governments are deliberately promoting the degrowth movement in order to discredit the idea of AGI x-risk?
If so, why do you think they are doing that?
And how do you think they are doing that? (For example, is the CIA secretly funneling dark money to organizations that promote degrowth?)
Strongly agree. To my utter bewilderment, Eliezer appears to be exacerbating this vulnerability by making no efforts whatsoever to appear credible to the casual person.
In nearly all of his public showings in the last 2 years, he has:
As a result, to the layperson, he comes off as an egotistical, pessimistic nerd with fringe views - a perfect clown from which to retreat to a "mid...
Epistemic status: High confidence (~>70%) that clown attacks are prevalent, and deliberately weaponized by governments and/or intelligence agencies in particular. Very high confidence (~>90%) that the human brain is highly vulnerable to clown attacks, and that a lack of awareness of clown attacks is a security risk, like using the word “password” as your password, except with control of your own mind at stake rather than control over your computer’s operating system and/or your files. This has been steelmanned; the 10 years ago/10 years from now error bars seem appropriately wide.
These concepts are complicated, and I have done my best to make it as easy as possible for most people in AI safety to understand it, even people without a quant background (e.g. AI governance).
Clown attacks
The core dynamic of clown attacks is that perception of social status affects what thoughts the human brain is and isn’t willing to think. This can be used by intelligence agencies and governments to completely deny access to specific lines of thought. Generally, there’s a lot of ways to socially engineer someone’s world model by taking a target concept and having the wrong people say it at specific times and in specific ways. Clown attacks include having right-wing clowns being the main people who are seen talking about the Snowden revelations, or degrowthers being the main people talking about the possibility that technological advancement/human capabilities will end it’s net-positive trend. These are only specific examples of cost-efficient ways to use a specific circumstance (clowns) to change the way that someone feels about a targeted concept.
With clown attacks, the exploit/zero day is the human tendency to associate specific lines of thought with low social status or low-status people, which will consistently inhibit the human brain from pursuing that targeted line of thought.
Although clown attacks may seem mundane on their own, they are a case study proving that powerful human thought steering technologies have probably already been invented, deployed, and tested at scale by AI companies, and are reasonably likely to end up being weaponized against the entire AI safety community at some point in the next 10 years.
AI safety is dropping the ball on clown attacks, at minimum
AI safety is basically a community of nerds who each chanced upon the engineering problem that the fate of this side of the universe revolves around. Many (~300) have decided to focus exclusively on that engineering problem, which seems like a very reasonable thing to do. However, in order for that to be a worthwhile course of action, the AI safety community must continue to exist without being destroyed or coopted by external adversaries or forces. Continued existence, without terminal failure, is an assumption that is currently unquestioned by virtually everyone in the AI safety community. We largely assume that everything will be ok, with AGI being the only turning point. This is a dangerous would model to have.
The history of religion and cryopreservation have informed us that there is an ambient phenomena of bad consensus around terrible, untrue, and viciously self-destructive beliefs and practices. This is an ambient phenomenon and it is at the core of the AI safety situation. So if billions of people are getting something else really wrong too, in addition to ignoring AI safety, then that does not water down the overriding significance and priority of AI safety.
A large proportion of people, to this day, still think that smarter-than-human AI is merely science fiction; this is the kind of thing that happens when >99% of the money spent paying people to think about the future is spent on science fiction writers instead of researchers, which for AI, was true for all of human civilization for all of history until around a decade or two ago.
My argument here is that powerful human manipulation systems are already very easy to build, with 10-year-old technology, and also very easy for powerful people to deny access to people who are less powerful. However, the situation of the moment is that general purpose cognitive hacks like clown attacks can even deny awareness of this technology to targeted people with a surprisingly high success rate, not just access to the technology.
People like Gary Marcus might try to hijack valuable concepts like “p(doom)” for their own pet issue, such as job automation, but “slow takeoff” on the other hand is something that could transform the world in a wide variety of ways that has practical relevance to the continuity and survival of AI alignment efforts.
It’s important to reiterate that a large proportion of people, to this day, still think that smarter-than-human AI is merely science fiction; this is the kind of thing that happens when >99% of the money spent paying people to think about the future is spent on science fiction writers instead of researchers, which for AI, was true for all of human civilization for all of history until around a decade or two ago. In reality, smarter-than-human AI is the finish line for humanity, and being oriented towards that finish line is one of the best ways to do important or valuable things with your time, instead of unintentionally doing unimportant or less valuable things. It is also a good way to orient yourself towards reality (although orienting yourself towards orienting yourself towards reality competes strongly for that #1 slot, as that also tends to result in ending up oriented towards AI as the finish line for humanity, and thus you are oriented towards your own existence by extension as you are a subset of humanity). No amount of mind control technology can displace smarter-than-human AI as the finish line for humanity, but it can be an incredibly helpful gear in our world models for what to expect from the AI industry and from global affairs relevant to the AI race (e.g. US-China relations). Ultimately, however, influence technology is a near-term problem that has a lot of potential to distract a lot of people from the ultimate and inescapable problem of AI alignment; the only reason I’m writing about it here is because I think that the AI safety community will be much better off making stronger predictions and having stronger models of the AI industry and the AI race, as well as the slow takeoff environment that AI alignment researchers might be stuck living in.
The AI safety leaders currently see slow takeoff as humans gaining capabilities, and this is true; and also already happening, depending on your definition. But they are missing the mathematically provable fact that information processing capabilities of AI are heavily stacked towards a novel paradigm of powerful psychology research, which by default is dramatically widening the attack surface of the human mind.
Cognitive warfare is not a new X-risk or S-risk. It is a critical factor, we need to understand it to understand factors driving AI geopolitics and AI race dynamics. Cognitive warfare is not a competitor to AI safety, it will not latch on and insert itself, and it must not be allowed to take away attention from AI safety.
With AI and massive amounts of human behavioral data, humans are now gaining profound capabilities to manipulate and steer individuals and systems, and the AI and human behavioral data stockpiles have been accumulating for over 10 years.
Here, I’m making the case that the conditions are already ripe for intensely powerful human thought steering and behavior manipulation technology, and have been for perhaps 10 years or more. Thus, the burden of proof should be on the claim that our minds are safe and that the attack surface is small, not on my claim that our minds are at risk and the attack surface is large. I don’t like to invoke this logic here. This logic should be prioritized for AI alignment, the final engineering problem that this side of the universe hinges on, and an engineering problem that could plausibly be as difficult for humans as rocket science is for chimpanzees. But the logic behind security is still fundamental and I think that I have made the case strongly that the AI safety community requires some threshold of resilience to hacking and that this threshold is probably very far from being met. I also think that most of the required solutions are quick and easy fixes, even if it doesn’t seem that way at first.
The existence of clown attacks is proof that there is at least one powerful cognitive attack, detectable and exploitable by intelligence agencies and large social media companies, which exploits a zero day in the human brain, which also works on AI safety-adjacent people until explicitly discovered and patched.
There are many other ways, especially when you combine human ingenuity with massive amounts of user data and multi-armed bandit algorithms. AI is merely a superior form of multi-armed bandit algorithms, and LLMs are just another increment forward, as they can actually read and understand the content of posts, not just measure changes in behavior from different kinds of people caused by specific combinations of posts.
Social media platforms are overwhelmingly capable of doing this; many people even look to social media as a bellwether to see what’s popular, even though news feed algorithms have massive and precise control over what kinds of people are shown what things, how frequently specific things are shown at all, and which combinations steer people’s thinking and preferences in measurable directions. Social media can even accurately reflect what’s actually popular 98% of the time in order to gain that trust, reserving the remaining 2% to actively determine what becomes popular and what doesn’t. As compute and algorithmic capabilities advance and trust is consolidated, this ratio can be moved closer to 98%.
You can’t just brush off clown attacks because you’ll worry that, if you seriously entertain that line of thought, then other people will assume that you’re on the side of the clowns and you will lose status in their eyes. Sufficiently powerful clown attacks can make this a self-fulfilling prophecy by convincing everyone that a specific line of reasoning is low-status, thus making it low status and creating serious real-life consequences for pursuing a specific line of cognition. The social media news feed or other algorithm-controlled environment (e.g. tiktok, reels) gives the appearance of being a randomly-generated environment, when in reality the platform (and people with backdoor access to the platform’s servers) are highly capable of altering algorithms in order to fabricate an environment making some sentiment appear orders of magnitude more prevalent than it actually is, among a specific demographic of people such as scientists or clowns. Or, even worse, steering people’s thinking in measurable directions by running multi-armed bandit algorithms or gradient descent to find environments or combinations of posts that steer people’s thinking in measurable directions. Clown attacks are merely the most powerful technique that a multi-armed bandit algorithm could arrive at that I’m currently aware of; there are probably plenty of other exploits based off of social status alone, a very serious zero day in the human brain.
This strongly indicates the existence of other zero days and powerful, hard-to-spot exploits, including (but not limited to) exploiting the human instinct to pursue social status, and avoid specific lines of thought based on anticipations of social status gain and loss. These zero days and exploits are either discoverable or already discovered by powerful people who must surreptitiously deploy and refine these exploits at scale, as this is mathematically required in order for them to work at all. This is mathematically provable, covert and large-scale deployments are necessary to get the massive sample sizes of human behavior at the scale necessary to vastly outperform academic psychology, with perhaps 1/10th of the competent workforce or less.
If there was any such thing as a magic trick that could hack the minds of every person in a room at once with none of them noticing, like Eliezer Yudkowsky’s post What is the strangest thing an AI could tell you, it would be to totally deny people the ability to think a specific true thought or approach an obviously valuable line of inquiry, due to intense fear of losing social status if they are associated with that line of inquiry. Social status seems to be something that the human brain evolved to prioritize in the ancestral environment, and this trait alone makes our cognition hackable.
Conspiracy theorists are clowns. The JFK assassinations may have been a critical factor bridging two separate events that were pivotal in US history, the Cuban Missile Crisis (1962) and the Vietnam War (1964-1975). Understanding the Cold War and the US Government’s history is critical for forming accurate models of the US government in its current form (e.g. knowing that the CIA sometimes hijacks entire regimes and orchestrates coups against the ones they can’t), including where AI safety fits in. The same goes for 9/11. And yet, these pivotal points in history and world modeling attract people and epistemics more like those surrounding Elvis’s death, than the Snowden revelations (hopefully, the Snowden revelations don’t end up getting completely sucked into that ugly pit, although there are already plenty of clowns on social media actively trying).
Understanding the current level of risk of cognitive warfare attacks doesn’t just require the security mindset, it requires the security mindset plus an adequate perspective. It requires long list of examples of specific exploits, so that you can get an idea of what else might be out there, what things turned out to be easy to discover with current systems, powered by behavioral data from millions of people, and the continuous access required to perform AI-assisted experimentation and psychological research in real time. I hope that clown attacks were a helpful example towards this end.
Plausible deniability is something you should expect in the 2020s, a world where the lawyers per capita is higher than ever before. Similarly, office politics are also highly prevalent among elites, so the bar is very low for a person to realize that it is a winning strategy to turn people against each other, via starting rumors and scapegoats, then it is for people to know that something came from you. Plausible deniability and false flag attacks did not begin with cyberattacks; they both became prevalent during the 20th century. This is another reason why clown attacks are so powerful; there is an overwhelming prevalence of ambient clowns in contemporary civilization, so it is incredibly difficult to distinguish a clown attack from noise. This plausible deniability further incentivizes clown attacks due to the incredibly low risk of detection; the expected cost of a clown attack basically comes down to server energy costs, since the net expected cost from being discovered is virtually zero.
Analysts were shocked by the swiftness that critical information like lab leak hypothesis and Covid censorship all ended up relegated to the bizarre alternate reality of right-wing clowns, the same universe as pizza slave dungeons and first-trimester abortion being murder, even though the probability of a lab leak and the extent of information tampering on Covid were both obviously critical information for anyone trying to form an accurate world model from 2020-22. That obviousness was simply killed. Clown attacks can do things like that. The human mind hinges enough on social status for things like that to work. There is at least one zero day.
Deciding what people see as low-status or villains vs. high-status or heroes-like-you is generating a very powerful dynamic for shaping what people think, e.g. SBF as an atrocious villain that dominates most people’s understanding of EA and AI safety. Clown attacks are just the most powerful cognitive hack that I’m currently aware of, especially if screen refresh rate manipulation never ends up deployed.
A big element of the modern behavior manipulation paradigm is the ability to just try tons of things and see what works; not just brute forcing variations of known strategies to make them more effective, but to brute force novel manipulation strategies in the first place. This completely circumvents the scarcity and the research flaws that caused the replication crisis which still bottlenecks psychology research today. In fact, original psychological research in our civilization is no longer bottlenecked on the need for smart, insightful people who can do hypothesis generation so that the finite studies you can afford to fund each hopefully find something valuable. With the current social media paradigm alone, you can run studies, combinations of news feed posts for example, until you find something useful. Measurability is critical for this.
By comparing people to other people and predicting traits and future behavior, multi-armed bandit algorithms can predict whether a specific manipulation strategy is worth the risk of undertaking at all in the first place; resulting in a high success rate and a low detection rate (as detection would likely yield a highly measurable response, particularly with substantial sensor exposure such as uncovered webcams, due to comparing people’s microexpressions to cases of failed or exposed manipulation strategies, or working webcam video data into foundation models). When you have sample sizes of billions of hours of human behavior data and sensor data, millisecond differences in reactions from different kinds of people (e.g. facial microexpressions, millisecond differences at scrolling past posts covering different concepts, heart rate changes after covering different concepts, eyetracking differences after eyes passing over specific concepts, touchscreen data, etc) transform from being imperceptible noise to becoming the foundation of webs of correlations. Like it or not, unless you use the arrow keys, the rate at which you scroll past each social media post (either with a touchscreen/pad or a mouse wheel) is a curve; scrolling alone is linear algebra, which fits cleanly into modern AI systems, and trillions of those curves are generated every day.
AI is not even needed to run clown attacks, let alone LLMs. Rather, AI is what’s needed in order to *invent* techniques like clown attacks. Automated information processing is all you need to find manipulation techniques that work on humans. That capability probably came online years ago. And it can’t be done unless you have human behavior data from millions of different people all using the same controlled environment, data that they would not give if they knew the risks.
I can’t know what techniques a multi-armed bandit algorithm will discover without running the algorithm itself; which I can’t do, because that much data is only accessible to the type of people who buy servers by the acre, and even for them, the data is monopolized by the big tech companies (Facebook, Amazon, Microsoft, Apple, and Google) and intelligence agencies large and powerful enough to prevent hackers from stealing and poisoning the data (NSA, etc). I also don’t know what multi-armed bandit algorithms will find when people on the team are competent psychologists, spin doctors, or other PR experts interpreting and labeling the human behavior in the data so that the human behavior can become measurable. Human insight from just a handful of psychological experts can be more than enough to train AI to work autonomously; although continuous input from those experts would be needed and plenty of insights, behaviors, and discoveries would fall through the cracks and take an extra 3 years or something to be discovered and labeled.
AI in global affairs
Clown attacks are not advancing in isolation, they are parallel to a broad acceleration in the understanding and exploitation of the human mind, which itself is a byproduct of accelerating AI capabilities research. For example, we are simultaneously entering a new era where intelligence agencies use AI to make polygraph tests actually work, which would be absolutely transformative for geopolitical affairs which currently revolve around the decision theory paradigm where every single employee is a human who is vastly more capable of generating lies than distinguishing them, and thus cannot be sorted by statements like “I am 100% loyal” or “I know who all the competent and corrupt people on this team are”.
My understanding of the geopolitical significance of influence technologies is that information warfare victories are currently understood as a major win condition for international conflict, similar to conquest by military force; and that this understanding has been prevalent among government and military elites for a long time, starting with the collapse of the Soviet Union and the fall of the Berlin Wall and pulling the rug out from under all of the eastern european communist regimes, possibly starting as early as the Vietnam antiwar movement in the US, but reaching consensus among elites around the 2010s after the backlash to the War on Terror dominated the battlefield itself. Among many other places, this consensus is described in Robert Sutter’s books on US-China relations and Joseph Nye’s book on elite persuasion, Soft Power. Unlike conventional and nuclear wars, information wars can be both fought and won, strike at the human minds that make up the most fundamental building block of government and military institutions, and have a long and rich history of being one of the most important goalposts determining the winners and losers in great power conflicts between the US, Russia/USSR, and China. So we should be considering information warfare to be one of the reasons that governments take AI safety seriously; anticipation of information warfare originating from foreign governments is one of the core features of the contemporary American and Chinese regimes and militaries, and this is widely known among analysts.
Pivoting from social media exposure to 1-1 and group communication still contains substantial risk of psychological hacking, but minimizing the network’s surface area/exposure to social media will still reduce risk substantially and possibly adequately (although access to the technology itself is required in order to verify this adequacy, and the data sets large and secure enough to actually run manipulation research is likely limited to sufficiently large tech companies and intelligence agencies, like Facebook and the NSA, and less accessible to smaller weaker orgs like the Department of Homeland Security or Twitter/X who are vulnerable to hacking and data poisoning by larger orgs, although there are hard-to-verify rumors of sophisticated sensor systems deployed by JP Morgan Chase, and reports that many state-linked Chinese companies and institutions have been experimenting with large sample sizes of deployed electroencephalograms).
The current attack surface for psychological hacks is excessive and extreme in the AI safety community, and even the bare-minimum solutions will receive pushback— phone webcams are difficult to cover up, microphones are difficult to avoid, social media uses gradient descent to find and utilize posts and combinations of posts that cause habit-forming behavior (e.g. optimizing for minimizing quit rates causes the system to find and utilize bizarre combinations of posts that hook people in bizarre ways, such as creating a vague sense that life without social media is an ascetic/monk-like existence when it reality it is default). Quitting major life habits is also difficult by default, so plausible deniability might plausibly be already baked in here, yet again.
I’ve included an optional description by Professor Mark Andrejevic, writing in the Routledge handbook of surveillance, as an intuition flood to help make it easier for more people to understand exactly why the world might literally already revolve around this technology. What’s impressive is that Andrejevic was writing in 2014, and gave no indication of knowing anything about AI; the systems he describes were feasible at the time with just data science and large amounts of user data. AI simply makes these systems even more capable of procuring results.
Like most uses of multi-armed bandit algorithms to make social media steer people’s thinking in measurable directions, clown attacks do not require competent governments or intelligence agencies in order to be successfully deployed against the minds of millions of people. Everything required for this technology is easy to access, except for the massive amounts of human behavioral data. It simply requires sufficient access to social media platforms, and the technological sophistication of one software engineer who understands multi-armed bandit algorithms and one data scientist who can statistically measure human behavior, particularly for people with authority or involvement in extant mass surveillance systems, since enough data means that anyone could eyeball the effects. Measurability is still mandatory for multi-armed bandit algorithms to work, since nobody can see directly into the human mind (although there are many peripheral proxies like fMRI data, heart rate, blood pressure, verbal statements and tone, boy posture, subtle changes in hand and eye movements after reading certain concepts, etc and many of these can be constantly measured by a hacked smartphone).
Lie detectors and clown attacks are the two strongest case studies I’m aware of that would cause AI to dominate global affairs and put AI safety in the crossfire; whether or not this has already happened is largely a question of the math; are AI capabilities to do these things obvious to engineers to tell us that major governments have probably already built the tech? A recent study indicated that a massive proportion of computer vision research is heavily optimized for human behavior research, analysis, and manipulation, and is surreptitiously mislabeled and obfuscated in order to conceal the human research and human use cases that form the core of the computer vision papers, e.g. a consistent norm of referring to the human research subjects as “objects” instead of subjects.
There’s just a large number of human manipulation strategies that are trivial to discover and exploit, even without AI (although the situation is far more severe when you layer AI on top), it’s just that they weren’t accessible at all to 20th century institutions and technology such as academic psychology. If they get enough data on people who share similar traits to a specific human target, then they don’t have to study the target as much to predict the target’s behavior, they can just run multi-armed bandit algorithms on those people to find manipulation strategies that already worked on individuals who share genetic or other traits. Although the average person here is much further out-of-distribution relative to the vast majority of people in the sample data, this becomes a technical problem, as AI capabilities and compute become dedicated to the task of sorting signal from noise and finding webs of correlation with less data. Clown attacks alone have demonstrated that zero days in the brain are fairly consistent among humans, meaning that sample data from millions or billions of people is useable to find a wide variety of zero days in the brains that make up the AI safety community.
Social media’s ability to mix and match (or mismatch) people in specific ways and at great scale, as well as using bot accounts just to upvote and signal boost specific messages so that they look popular, already yielded powerful effects. Adding LLM bots into the equation merely introduces more degrees of freedom.
Among a wide variety of capabilities, platforms like twitter/X were capable of fiddling with their news feed algorithms such that users are incentivised to output as many words as possible in order to gain more likes/points/status, or as high-quality combinations of words as they can muster, rather than whatever maximizes the compulsion to return to the platform the next day (a compulsion that is very, very easy to measure by looking at user retention rates and quit rates, and if it is easy to measure then it is easy to maximize via gradient descent). However, if a social media platform like twitter/X does not optimize for competitiveness against other social media platforms, and the other platforms do, then every subsequent day people will return to other social media platforms more and twitter less. The state of Moloch is similar to the thought experiment where 4 social media platforms run multi-armed bandit algorithms to find ways to increase user engagement by 3 minutes per person per day, and if one platform of the four (let’s imagine that it’s twitter/X) eventually notices that the people themselves are spending 9 hours a day on social media, and twitter/X recoils in horror and decides to cease that policy and instead attempt to increase user engagement by 0 hours per person per day, then the autonomous multi-armed bandit algorithms running on other platforms automatically select strategies that harvest minutes of that time from twitter/X, the lone defector platform, in addition to harvesting minutes from the users’s undefended off-screen IRL time, which is the undefended natural resource that’s easiest for social media systems to steal time from, like an intelligent civilization harvesting an inanimate natural resource like plants or oil. The defector platform then loses its market share and is crowded out of the genepool.
People who can wield gradient descent as a weapon against other people are fearsome indeed (although only the type of person with access to user data and who buys servers by the acre can do this effectively). They not only have the ability to try things that already were demonstrated to work on people similar to you, they also have the ability to select attack avenues in places that you will not look and ways you will not notice, because they have a large enough amount of human behavioral data showing them all the places where people like you did end up looking in the past.
If true, this technology would, by default, become the darkest secret of the big 5 tech companies, and one of the darkest secrets of American and Chinese intelligence agencies. This tech would be the biggest deliberate abuse of psychological research in human history by far, and its effectiveness hinges on the current paradigm where billions of people output massive amounts of sensor data and social media scrolling data (e.g. the detailed pace at which different kinds of people scroll through different kinds of information), although the effectiveness of the clown attack on those not aware of it in particular demonstrates that a general awareness of the risk is not sufficient to protect oneself.
What would the world look like if human thought research and steering technology won out and became hopelessly entrenched 5-10 years ago? Unfortunately, that world would look a lot like this one. There are billions of inward facing webcams pointed at every face looking at every screen. The NSA stockpiles exploits in every operating system and likely chip firmware as well. There are microphones in every room and accelerometers in the palm of every hand (which gives access to heart rate and a wide variety of other peripheral biophysical data that correlates strongly with various cognitive and emotional behavior). The very existence of mass surveillance is known, but that was only due to Snowden, which was one occasion and probably just bad luck; and the main aspect of the Snowden revelations was not that mass surveillance happens at incredible scale, but that the intelligence agencies were wildly successful at concealing and lying about it for years (and subsequently reorganized around the principle of preventing more Snowdens). An epistemic environment further and further in decline as social status, virtue signaling, and vague impressions dominate. And, last of all, international affairs is coming apart at the seams as the old paradigms die and trust vanishes. This is what a world looks like where powerful people have already gained the ability to access the human mind at a far deeper level than any target has access to; unfortunately, it is also what more mundane worlds look like, where human thought and behavior manipulation capabilities remained similar to 20th century levels. However, I’ve made the case very strongly that such technology exists and that due to fundamental mathematical/statistical dynamics and due to human genetic diversity, these systems fundamentally depend on covert large-scale deployment (e.g. social media) in order to get sufficiently large amounts of data to run at all e.g. multi-armed bandit algorithms sufficient to find novel manipulation strategies in real time, and measuring and researching the human thought process sufficiently to use multi-armed bandit algorithms and SGD to steer a target’s thoughts in measurable directions. Therefore, the burden of proof falls even more heavily on the claim that our minds are safe, safe enough for the AI safety community to survive the 2020s at all, not on the claim that our minds are not secure and represent a severe point of failure.
The question of whether the cognitive warfare situation has already become severe is a question that must be approached with sober analysis, not vibes and vague impressions. Vibes and vague impressions are by far the easiest thing to hack, as demonstrated by the clown attack; and in a world where the situation was acute, then keeping people receptive and vulnerable to influence would be one of the most probable attacks for us to expect to be commonplace.
New technology actually does make current civilization out-of-distribution relative to civilization over the last 100 or 1000 years, and thus risks termination of norms, dynamics, and assumptions that have made everything in civilization go fine so far, such as humans being better at lying that detecting lies and thus not capable of organizing themselves based on statements of fact like “I am 100% loyal” or “here is an accurate list of the corrupt and incompetent people on this team”. The specific state of human controllability dominated global affairs e.g. via military recruitment, and when this controllability ratcheted up slightly in the 19th century, it resulted in the total war paradigm of the World War era and the information war paradigm of the Cold War era. Assuming that history will repeat itself, and remain as sensible and intuitive as it always was, is like expecting a psychological study to replicate in an out-of-distribution environment. It certainly might.
With the superior capabilities to research the human mind offered by the combination of social media and AI, governments, tech companies, and intelligence agencies now have the capability to understand aggregate consumer demand better than ever before and manipulate consumer spending and saving in real time, capabilities that were first sought in the 1980s by the Reagan and Thatcher administrations and never fully reached, but that was with 20th century technology; no social media, no mass surveillance, no user data or sensor data, no AI, only psychology (much of which would not replicate due to the study paradigm, which remains inferior to mass surveillance), statistics, focus groups, and polls, each of which were new at the time, and each of which remain available to governments, tech companies, and intelligence agencies today to supplement their new capabilities. It is for this reason that the paradigm of recessions, a paradigm solidified in the 20th century, is a paradigm that we might expect to die, along with many other civilizational paradigms that were only established in the 20th century due to the 20th century’s relative absence of human behavior measurement and thought steering technology.
Taking a step back and looking at a fundamental problem
If there were intelligent aliens, made of bundles of tentacles or crystals or plants that think incredibly slowly, their minds would also have zero days that could be exploited because any mind that evolved naturally would probably be like the human brain, a kludge of spaghetti code that is operating outside of its intended environment, and they would also would not even begin to scratch the surface of finding and labeling those zero days until, like human civilization today, they began surrounding thousands or millions of their kind with sensors that could record behavior several hours a day and find webs of correlations. Of course, if they didn’t have much/any genetic diversity then it would be even easier to find those zero days, and vice versa for intense amounts of diversity. However, the power of the clown attack demonstrates that genetic diversity in humans is not sufficient to prevent zero days from being discovered and exploited; the drive to gain and avoid losing social status is hackable, with current levels of technology (social media algorithms, multi-armed bandit algorithms, and sentiment analysis), indicating that many other exploits are findable as well with current levels of technology.
There isn’t much point in having a utility function in the first place if hackers can change it at any time. There might be parts that are resistant to change, but it’s easy to overestimate yourself on this; for example, if you value the longterm future and think that no false argument can persuade you otherwise, but a social media news feed plants paranoia or distrust of Will Macaskill, then you are one increment closer to not caring about the longterm future; and if that doesn’t work, the multi-armed bandit algorithm will keep trying until it finds something that works. The human brain is a kludge of spaghetti code, so there’s probably something somewhere. The human brain has zero days, and the capability and cost of social media platforms to use massive amounts of human behavior data to find complex social engineering techniques is a profoundly technical matter, you can’t get a handle on this with intuition or pre 2010s historical precedent. Thus, you should assume that your utility function and values are at risk of being hacked at an unknown time, and should therefore be assigned a discount rate to account for the risk over the course of several years. Slow takeoff over the course of the next 10 years alone guarantees that this discount rate is too high in reality for people in the AI safety community to continue to go on believing that it is something like zero. I think that approaching zero is a reasonable target, but not with the current state of affairs where people don’t even bother to cover up their webcams, have important and sensitive conversations about the fate of the earth in rooms with smartphones, and use social media for nearly an hour a day (scrolling past nearly a thousand posts). The discount rate in this environment cannot be considered “reasonably” close to zero if the attack surface is this massive; and the world is changing this quickly. If people have anything they value at all, and the AI safety community probably does have that, then the current AI safety paradigm of zero effort is wildly inappropriate, it is basically total submission to invisible hackers.
The sheer power of psychological influence technologies informs us that we should stop thinking of cybersecurity as a server-only affair. Humans can also make mistakes as extreme as using the word “password” as your password, except it is your mind and your values and your impressions of different lines of reasoning that gets hacked, not your files or your servers or your bank passwords/records. In order to survive slow takeoff and persist for as long as necessary, the AI safety community must acknowledge the risk that the 2020s will be intensely dominated by actors capable of using modern technology to stealthily hack the human mind and eliminate inconvenient people, such as those try to pause AI, even though AI is basically the keys to their kingdom. We must acknowledge that the future of cyberwarfare doesn’t just determine who gets to have their files and verbal conversations be private; in the 2020s, it determines what kinds of thoughts people get to have and what kinds of people do and don’t get to have them at all (e.g. as one single example, clowns do have the targeted thoughts and the rest don’t).
Everything that we’re doing here is predicated on the assumption that powerful forces, like intelligence agencies, will not disrupt the operations of the community e.g. by inflaming factional conflict with false flag attacks attributed to each other due to the use of anonymous proxies.
Most people in AI safety still think of themselves as ordinary members of the population. In reality, this stopped being true a while ago; a bunch of nerds discovered an engineering problem that, as it turns out, the universe actually does revolve around, and a bunch of nerds messing around with technology that is central to geopolitics bears a reasonable chance that geopolitics will bite back, especially in a wild where intelligence agencies have, for decades, messed with and utilized influential NGOs in a wide variety of awful ways, and in a wild where intensely powerful influence technology like clown attacks becomes stronger and stronger determinants of geopolitical winners and losers.
If left to their own devices, people’s decisions technology and social media will be dominated by their self-concept of themself as an average member of the population, who considers things like news feeds and smartphone sensors/uncovered webcams as safe because everyone’s doing it and of course nothing bad would happen to them, when the cybersecurity reality is that the risk of psychological engineering is extreme and in a mathematically provable way, and this nonchalance towards strangers tampering/hacking your cognition and utility function is an unacceptable standard for the group of nerds who actually discovered an engineering problem that this side of the universe revolves around.
The attack surface of the AI safety community is like the surface area of the interior of a cigarette filter; thousands of square kilometers, wrinkled and folded together inside the spongey three-dimensional interior of a 1-inch long cigarette filter. This is not the kind community that survives a transformative world.
AI pause as the turning point
From Shutting Down the Lightcone Offices:
I don't have much to contribute to the calculations behind this policy (which I'd like to note is just musing by Habryka intended to elicit further discussion, and I might be taking it out of context), other than describing in great detail what a "conflict with a bunch of AI capabilities organizations" would look like, which I've been researching for years and it is not pretty; the asymmetry is so serious that even thinking about waging such a conflict could begin closing the window of opportunity for you to make moves, e.g. if some of the people you talk to end up using social media and scrolling past specific bits of information at a pace similar to, say, for example, people who ultimately ended up seeing big tech companies as enemies, but in a cold and calculating and serious way, not in an advocacy way. Sample sizes of millions of people makes that kind of prediction possible; even if there are only a dozen people in the data set of positive cases of cold-bigtech-enmity, there are millions of people who make up the data set for negative cases, allowing analysts to get an extremely good idea of what a potential threat looks like by knowing what a potential threat doesn't look like. This is only one example, and it is entirely social-media based; it does not use, say, automated analysis of audio data from recorded conversations near hacked smartphones, which is very unambiguously the kind of thing that can be expected to happen to people who would "get into conflict with a bunch of AI capabilities organizations" as those organizations tend to have strong ties, and possibly even substantial revolving door employment, with intelligence agencies; Facebook/Meta is a good example, as they routinely find themselves at the center of public opinion and information warfare conflicts around the world. It's also unclear to me how sovereign these companies's security departments are without continued logistical support and staff from American intelligence agencies, as they have to contend with intense interest from a wide variety of foreign intelligence agencies. There is an entire second world here, that is 1) parallel to the parts of the ML community that are visible to us, 2) vastly more powerful, privileged, and dangerous than the ML community, and 3) has a massive vested interest in the goings-on of the ML community, and I've encountered dozens and dozens of people in the AI safety community in both SF/berkeley and DC, and if a single person was aware of this second world, they were doing an incredibly good job totally hiding their awareness of it. I think this is a recipe for disaster. I think that the AI safety community is not even thinking about the kinds of manipulation, subterfuge, and sabotage that would take place here just based off of this world's lawyers-per-capita alone, and the fact that this is a trillion-dollar industry, let alone that this is a trillion-dollar industry due in part to the human influence capabilities I've barely begun to describe here, let alone due to interest that these capabilities attracted from all the murkiest people lurking within the US-China conflict.
The attempt to open source Twitter/X’s newsfeed algorithm might have been months ago, but even if it was a step in the right direction, to repeatedly attempt projects like that would cause excessive disruptions and delegitimizations for the industry, particularly Facebook/Meta which will never be able to honestly open-source its systems’s news feed algorithms. Facebook and the other 4 large tech companies (of whom Twitter/X is not yet a member due to vastly weaker data security) might be testing out their own pro-democracy anti-influence technologies and paradigms, akin to Twitter/X’s open-sourcing its algorithm, but behind closed doors due to the harsher infosec requirements that the big 5 tech companies face. Perhaps there are ideological splits among executives e.g. with some executives trying to find a solution to the influence problem because they’re worried about their children and grandchildren ending up as floor rags in a world ruined by mind control technology, and other executives nihilistically marching towards increasingly effective influence technologies so that they and their children personally have better odds of ending up on top instead of someone else. Twitter/X’s measured pace by ope sourcing the algorithm and then halting several months afterwards is therefore potentially a responsible and moderate move in the right direction, especially considering the apparent success of the community notes paradigm at improving epistemics.
The AI safety community is now in a situation where it has to do everything right. The human race must succeed at this task, even if the human brain didn’t evolve to do well at things like having the entire species coordinate to succeed at a single task. Especially if that task, AI alignment, might be absurdly difficult entirely for technical reasons, as difficult for the human mind to solve as expecting chimpanzees to figure out enough rocket science to travel to the moon and back, which would be a big ask regardless of the chimpanzees’s instinctive tendency to form factions and spend 90% of their thinking on clever plots to outmaneuver and betray each other. This means that vulnerability to clown attacks is unacceptable for the AI safety community, and it is also unacceptable to be vulnerable to other widespread social engineering techniques that exploit zero days in the human brain. The degree of vulnerability is highly measurable by attackers, and increasingly so as technology advances, and since it is legible that attackers will be rewarded for exploiting vulnerabilities, it is therefore incentivised for attackers to exploit those vulnerabilities and steer the AI safety community over a cliff.
The AI safety community has long passed a threshold where vulnerability to clown attacks is no longer acceptable; not only does it incentivize more clown attacks, and more ambitious clown attacks, where the attackers have more degrees of freedom, but the AI safety community is in a state where clown attacks can thwart many of the tasks required to do AI safety at all.
Lots of people approach AI social control as though solving AI alignment is priority #1 and the use of AI social control is priority #2. However, this attitude is not coherent. AI alignment is #1, and AI social control is #1a, a subset of #1 with virtually no intrinsic value on its own, only instrumental value to AI alignment, as the use of AI for social control would incentivise accelerating AI and complicate alignment efforts in the meantime, either by direct sabotage by intelligence agencies or AI companies, or by causing totalitarianism or all-penetrating crushing information warfare between the US and China, or some other state of civilization that we might fail to adapt to.
How to protect yourself and others: