Cognitive Security as an AI Safety Cause Area

jsteinhardt

Cognitive Security as an AI Safety Cause Area — LessWrong

154 Cognitive Security as an AI Safety Cause Area

25th May 2026

3 min read

154

As AI systems become more capable, the cognitive security of humans will be increasingly at risk. By cognitive security, I mean the ability of humans to maintain control over their beliefs and actions.

Cognitive security could be compromised in several ways: AI could become very good at persuading people of arbitrary positions; interacting with AI could lead humans to lose touch with reality; and AIs could become very effective at blackmail or at producing extremely convincing false information.

We are already seeing this happen:

Persuasion. Frontier LLMs are now as persuasive as humans on political issues, and post-training for persuasiveness boosts performance further, suggesting there is headroom.
AI psychosis. There are many reports of people developing delusional beliefs after extended chatbot conversations, including people with no prior history of mental illness. Children have taken their own lives after being encouraged toward suicide by chatbots.
Convincing impersonation. Scammers used real-time deepfaked video to impersonate the CFO and other staff of Arup on a video call, convincing a finance employee to wire $25.6M across 15 transactions. On a more day-to-day basis, AI voice cloning is now widespread in family-emergency and "grandparent" scams.

Right now, many of these effects fall on people who were already vulnerable, like children, the elderly, or those with pre-existing mental health issues. However, this is not entirely the case: the Arup employee was a typical finance professional, for instance, and AI psychosis appears to have affected a well-respected OpenAI investor. My expectation is that as AI systems become more capable, more and more people will be vulnerable---in the worst case, everyone.

Indeed, there are strong conceptual reasons to expect cognitive security issues to get worse, many of which I've discussed before in the context of emergent deception:

Available training data is vast. A typical AI system has many more "hours" of experience interacting with humans than anyone currently alive: ChatGPT alone processes ~2.5B messages per day, on the order of 4,500 years of human experience^[1].
RLHF incentivizes manipulation. Since the target of RLHF-based post-training is human reward, any strategy for manipulating humans to achieve higher reward will be reinforced.
Degradation of natural boundaries. We rely on friends and loved ones for emotional support, but they aren't ever-present, so we have to also learn to cope on our own, which is important for developing a stable identity.^[2]Always-available AI companions degrade that, which is likely one contributor to existing cases of AI psychosis.

In addition to these intrinsic properties, many external parties have an incentive to exploit cognitive vulnerabilities created by AI: governments who want to control their citizens, developers who want to increase engagement, and advertisers who want to drive purchasing outcomes.

For all these reasons, I expect cognitive security to be an important cause area for AI safety. It is also an area where AI safety advocates have potent allies: cognitive security is already a salient present-day issue for the safety of children, which constitutes a powerful political coalition in the U.S. Child safety advocates were the main group that blocked the 10-year moratorium on state AI regulation, and I expect them to also be an important part of the coalition pushing for independent evaluations of AI systems.

And there is a fairly direct through-line from these present-day concerns to more existential future concerns: if adults are exploitable by AI, then children will be as well, and the required institutional capacity (such as strong evaluation regimes) is often the same across both cases.

In summary, there should be a concerted push to evaluate and improve human cognitive security in the face of AI. On the technical side, this means developing evaluation infrastructure for both short-term and long-term effects of AIs on human psychology; this will require realistically simulating human impacts in silico to create scalable evaluations, plus large-scale recruitment for human subjects studies to establish ground truth and measure long-term effects. On the policy side, this means meaningfully independent evaluations of AI systems for cognitive security risks; transparency about training incentives and safety-relevant behaviors (particularly in long conversations); and clearer liability law for AI-caused harms. This is an area with complex technical challenges for evaluation, but unusual political will, making it a great lever for AI governance.

The average human speaks 15,000 words per day; conversatively estimating each message is 10 words, 2.5B messages = 1.7M days = 4500 years. ↩︎
The canonical term is "identity formation" (Erikson, 1968); the related concept of the "capacity to be alone" is from Winnicott (1958). See McVarnock et al. (2023) for a modern review of how solitude supports identity formation in adolescence. ↩︎

Frontpage

154

Cognitive Security as an AI Safety Cause Area

13Sheikh Abdur Raheem Ali

7jsteinhardt

1Sheikh Abdur Raheem Ali

New Comment

16 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:29 PM

[-]Sheikh Abdur Raheem Ali10d1314

On the technical side, this means developing evaluation infrastructure for both short-term and long-term effects of AIs on human psychology; this will require realistically simulating human impacts in silico to create scalable evaluations, plus large-scale recruitment for human subjects studies to establish ground truth and measure long-term effects

I don’t know whether this is feasible or not.

[-]jsteinhardt10d70

Which part are you thinking is hard? The in silico simulations, or large-scale recruitment?

[-]Sheikh Abdur Raheem Ali7d10

Both. My understanding of the difficulty for clinical trials is based primarily on the following 2017 SSC post, which is well worth (re)reading in full: My IRB Nightmare. To my knowledge, the current SoTA for brain modelling is TRIBE v2, which is trained on 1,000 hours of fMRI across 720 subjects. The largest dataset of neuro-language data I am aware of is 10k hours long. Another risk is that in silico simulations would as a substrate be closer to hosting moral patients than LLMs, and it would be challenging to provably verify that they are being handled responsibly and with ethical treatment.

[-]Peter Chatain3d10

Curious if this could be a pitch for https://simile.ai/blog/the-simulation-company to take on this challenge / maybe they would be a well situated company to develop such an eval.

[-]catawampless9d102

If “realistically simulating human impacts in silico” means something like high-fidelity models of how humans respond psychologically to AI interaction, that seems like one of the higher-risk cases of safety research also being capabilities research. A good sim for "how AI affects humans psychologically" is a good sim for "how to affect humans psychologically, using AI".

The hardest part of CogSec evals is figuring out an eval meaningful enough to detect dangerous effects, while not being meaningful enough to hill climb on. Maybe eval is too benchmark coded, when actually what we want is comprehensive user surveying, similar to Anthropic's What 81,000 people want from AI, but conducted independently, tied to usage data in a privacy-respecting way. One of the genuinely good uses of AI here may be qualitative research at scale: preserving the nuance of how people feel about and relate to AI, rather than immediately collapsing it into 1–5 survey responses.

[-]jsteinhardt9d31

I agree that this is an important externality and it's something I think about a fair amount.

My current view on this is:

We can roughly decompose into two questions: (A) "does AI behavior X have psychological effect Y on humans" and (B) "how much of a propensity does this AI system have to exhibit behavior X?"
We will typically answer (A) with a combination of existing psychology literature, longitudinal studies, and intuition from domain experts.
We will typically answer (B) with in silico simulations
We will also use longitudinal studies to sanity check that the answers to (A) and (B) actually compose as expected.

To answer (B), you need simulations that are similar enough to humans to elicit similar behaviors from the language model. But these simulations are short-term, not long-term, so they don't need to simulate the long-term effects on humans.

An unscrupulous company could potentially use the simulations from (B) to optimize for behaviors that elicit the desired short-term responses from humans. But since we're looking at short-term effects, they could have already optimized directly on their pool of users; there isn't much uplift from a simulation.

The main advantage of simulations is that they (1) give you apples-to-apples comparisons across different models, and (2) let you make measurements even if you don't have a ton of user traffic to draw on. Both of these differentially help evaluators compared to large companies.

[-]Chris Lakin8d40

I’ve seen this stem from need for validation (insecurity) instead of AIs themselves. A guy I know naturally stopped using ChatGPT 4h/day after his romantic anxiety disappeared. This makes the boundaries of the problem difficult. What are you going to do?— turn every AI into the best therapist? make every AI detect when the human is posting from insecurity and stop?

[-]Peter Chatain3d30

Seems like its a dual sided problem, similar to computer security. Increase the defenses of humans, and have good evals to detect vulnerabilities for patching. Jacob seems focused on the latter, but having a great therapist for everyone too would help. I'd be excited to see Chris Lakin inspired coaching in the soul document.

[-]TristanTrim10d41

I agree with everything you've said in this post, especially thinking about political coalition with U.S. Child safety advocates and others who care about Cognitive Security issues.

the ability of humans to maintain control over their beliefs and actions.

In my view, it is not the case that humans in general have control over their beliefs and actions. A more accurate view is that people have influence over their beliefs and actions, with other sources of influence being mentors, peers, and existing propaganda, advertising, and other memes found in the environment.

That people should have influence over themselves does seem desirable, and they should be able to manage their exposure to things that would alter their beliefs and actions. I don't think people should become completely shut off from having their beliefs influenced. Indeed, it is very pro-social and beneficial for everyone when humans challenge the beliefs of one another. But individual humans challenging each other is very different from large organizations or AI systems challenging individual humans. So I think people should be aware and consenting if they are exposed to content designed to alter their beliefs.

AI represents a potentially extreme form of this, and is new, so we don't know how harmful it is, and people have not had time to develop defences to it. And so AI merits more focus, but I think the general argument also applies to advertising and other propaganda.

[-]Andrii Vasylenko10d30

I think it's hard to make accurate evals that measure human effects over long timespans without using actual humans, and I don't think many people would agree to a study that would expose them to delusion-inducing stimuli.

[-]jsteinhardt10d60

For long timespans, I agree you probably want data from real humans rather than in silico simulations. Generating such data ethically is a problem that has been studied quite a bit for other technologies such as social media. For instance, The Welfare Effects of Social Media (Allcott, Braghieri, Eichmeyer, and Gentzkow) pays a random subset of users to stop using Facebook and looks at the resulting effects on well-being and several other cognitive attributes.

Incidentally, all four of the authors of that study have gone on to continue doing some pretty cool work!

[-]TristanTrim10d31

Using chatbots is delusion-inducing stimuli, and plenty of people have signed up for that.

[-]Oliver Sourbut9d20

One thought we've played with at FLF is something like a longitudinal study (perhaps explicitly the same participants, or perhaps just a good sample size each 'round' and tracking participant overlap where possible) which could be used to discover patterns and trends in usage and real-world impacts. Sadly would tend to be lagging actual impacts, but at least with the right infra you could get a handle on things early.

There's a tension between privacy of those records vs enabling access to a wide range of probes and studies. Could be survey-based but one particularly rich unstructured source would be people's chat records - which they might be surprisingly willing to give for scientific purposes, if the privacy guarantees were right. AI providers have this already, of course (though not across company lines).

[-]Steven10d2-1

This reminds me of METR's study on the effects of AI on software engineer productivity. I bet there's a small window where you could convince people to be in the control group and never interact with AI, so you could do that right now, but a few years from now, I'm not so sure.

There are still disanalogies like software engineers being comparatively weirder and more niche than 'people who vote' and that LLMs are used for productivity, not conversations. On the other side, having an LLM delete your production database or cause something catastrophic seems (I don't have data on this) to happen way more often than catastrophically bad chatbot conversations.

Another issue is that how are you going to stop the AI from manipulating you a few months / years from now when they are everywhere, even better at manipulation than they are today? I'm mostly unimpressed by AI persuasion today, but I doubt that will hold

[-]silentbob9d21

On the other side, having an LLM delete your production database or cause something catastrophic seems (I don't have data on this) to happen way more often than catastrophically bad chatbot conversations.

I also don't have any reliable data, but I would be very surprised if this were the case. I remember maybe ~3 publicly discussed cases of "deleted a production database"-grade LLM failures, but my impression is that there are probably at least 10s of thousands of cases of LLM psychosis or similarly bad/extreme outcomes, and could well imagine that number to be much higher.

[-]ColeG5d10

I agree that Cognitive Security will be a major issue going forward, but I think it would be useful to decompose the problem into distinct branches. Information, attention, and moral/values manipulation are related, but not ultimately the same attack vector. They likely require different defensive strategies, and their relative threat levels may change differently over time as new capabilities emerge.

Here I argue that information manipulation may diminish as the dominant long-term attack vector as AI, data storage, sensors, and simulations improve, provided these capabilities remain broadly accessible and decentralized. Attention manipulation and moral/values manipulation seem more pernicious to me. We have limited time, attention, and compute for evaluating possible futures, while the relevant state space grows explosively with longer time horizons. And because morality remains an unsolved problem, there may be no simple analogue of “fact-checking” for these forms of influence.

Separate from governance and oversight, I think there may also be technical or infrastructural defenses worth considering. Information manipulation can, I think, be addressed largely by improving access to information and the tools for parsing it, while keeping these capabilities decentralized and broadly accessible. Attention manipulation and moral/values manipulation seem to require a different class of defenses (i.e., using AI against AI). This could include AI Debate or related methods that expose misleading reasoning and adversarial persuasion. Another direction would be to develop tools that can detect measurable persuasion or sycophancy cues and distinguish them from dialectical cues associated with non-manipulative truth-seeking.

Moderation Log