soft prerequisite: skimming through How it feels to have your mind hacked by an AI until you get the general point. I'll try to make this post readable as a standalone, but you may get more value out of it if you read the linked post.
Thanks to Claude 3.7 Sonnet for giving feedback on a late draft of this post. All words here are my own writing. Caution was exercised in integrating Claude's suggestions, as is thematic.
Many people right now are thinking about the hard skills of AIs: their ability to do difficult math, or code, or advance AI R&D. All of these are immensely important things to think about, and indeed I spend much of my time thinking about those things, but I am here right now to talk about soft skills of AIs, so that fewer of us end up with our brains hacked by AI.
A Motivating Example
soft prerequisite for this section: Superstimuli and the Collapse of Western Civilization.
Superstimuli are stimuli that are much more intense than those in the environment where humans evolved, like how a candy bar is much denser in tasty things like sugars and fats than anything in nature.
Many humans spend much of their time following their local dopamine gradient, moving to the next most exciting thing in their immediate vicinity: they see something appealing, they go do it. They can also be strategic about things and can look to the global dopamine gradient, further away in space and time, when they need to, but this often requires nonnegligible willpower. (e.g. Stanford Marshmallow Experiment ).
Occasionally, someone gets swept up in a dopamine gradient too strong to resist, even with good reasons to stop. They overdose on drugs, they overeat unhealthy foods, they play video games for days until they die. And those are just some of the strongest dopamine gradients that humans have created.
We're seeing the beginning of the rise of cheap AI video generation. It's all over Youtube[1]. It's not good, but it's mesmerizing. It's bizarre, and it scratches some itch for some people. You can look it up if you're really morbidly curious, but I won't link anything, since the whole point of this section of the post is "don't get stuck in strong dopamine gradients from AI-generated content." When (not if) this technology does get Good, then we have cheap content generation with a powerful optimizer behind it, presumably trained well enough to grok what keeps humans engaged.
Maybe they already exist, maybe they will only exist later, but at some point I expect there to be people who spend significant time caught in loops of highly stimulating AI-optimized content beyond what is available from human creators. This prediction relies on a few specific things:
- Fast Feedback Loops: AI content creators can create videos in a given style and for a given audience faster than an individual can watch them. These models can quickly pick up on cues to go from "unentertaining" to "superstimulus" much faster than human content creators can. Video generation is currently expensive, but I expect it to get much cheaper as AI historically has.
- Managing a Complex Reward Signal: The best human content creators have a strong intuition for how a given piece will be received, based on their experience both of optimizing for "the algorithm" and of interacting with humans. The set "all humans who at least semi-regularly consume Youtube" is very large, and even the most seasoned creators are at a disadvantage, both from their slow feedback loops and from their limited ability to handle the sheer complexity of the problem. Meanwhile, AI has shown a remarkable ability to fit wide varieties of structured data and match patterns that humans have a hard time picking up on. In the limit, this risk doesn't only apply to the iPad-bound children who have been watching 12 hours of Youtube every day since they were 2 years old, but also to the people who watch Youtube sometimes, the people who have strong willpower but occasionally watch a video that particularly piques their interest. In the limit, some of those people get sucked into whatever highly stimulating media they find, and some of the most susceptible people don't come out.
In this example, we see a utility maximizer which can be fooled by dopamine gradients (human), a recommendation algorithm, and a utility maximizer that has exactly one complex goal (content generator that maximizes engagement metrics). The optimization between these is mostly self-reinforcing; the only forces that push away from the stable state of "~everyone is watching superstimulating videos until they die" is the limits on how good the content generators and recommenders are and the willpower of the humans to do things like eat and get money. I am not confident relying on either of those things, given the increasing scaling in AI systems and the small amount of willpower that most people have access to.
Over-Integration of AI Cognition
Related non-prerequisite: AI Deception: A Survey of Examples, Risks, and Potential Solutions
The previous section details a failure mode that is targeted towards average-willpower, average-agency people. However, the high-willpower, highly agentic people are still at risk. These people want to do things, and they realize that they can pick up these giant piles of utility by using AIs as an external brain to enhance their cognition and agency even further. The more work you can successfully offload to AIs, the more room you have to be agentic and the more utility you can pick up.
But we cannot trust AIs to be a reliable external brain to us, just as we cannot reliably trust humans with that. Say you talk to a friend about something complex, you two work through the reasoning together, and you come to a conclusion that seems right, given the reasoning you just went through. You go home that night, let your mind wander, and you realize that one of the steps in the reasoning is subtly off upon further inspection. You have a reflex to generalize, and you notice that any of the other steps in reasoning that you skimmed over could also be similarly completely wrong, and they could be harder to disentangle than the one you just ran into.
LLMs are at the level where they can not only produce mistakes that mislead smart-but-not-omnicareful people in that way, but they can also produce intentional deceptions that mislead those people as well! I tested this: It took a bit of prompting and back-and-forth, but I was able to get o3-mini-high to generate deceptive arguments about ML (my area of most experience) that I couldn't find a flaw in, even knowing there was a flaw, even after seeing a hint about which step of reasoning it was in. Admittedly, it was not in an area of ML that I was particularly familiar with.[2] I later prompted trying to get it to provide similarly deceptive arguments for areas that I know very well, and it failed. I think that "can intentionally create successfully deceptive-to-me arguments in all but my relatively specific expertise" is a scarily high level of capabilities already, but I also expect that in the next generation or two of models, it will be able to pull this trick much more seamlessly in practically all domains.
If you are trying to pick up a lot of utility, you might try to offload a lot of cognition to LLMs by tightly integrating them into your thinking to massively accelerate whatever work you do. In cases where the LLMs are aligned, this Just Works and you get the utility, but if you don't have that guarantee of safety, you suddenly have a large attack surface for subtle deception like I've described. Here are some other related failure modes:
- Misdirecting your attention to some subset of the things you're working on, by
- Talking more about that subset/Framing things so that your attention goes there
- Selectively sandbagging on some set of tasks, so that your attention goes to the tractable-seeming problems instead
- (for situations in which the LLM handles some memory) strategically gaslighting you on minute details, modifying details and relationships between details Additionally, even in situations where the LLM isn't tightly integrated into your cognitive loop, there's always the classic strategy of presenting a monolithic argument for e.g. why it's a really good idea to let it out of its box or something else that you wouldn't commit in advance to doing.
Does this End Up Mattering in Real Life?
Right now, at least, it seems rather unlikely that LLMs are secretly being deceptive and performing these subtle manipulations, even though they are in principle capable of it in most domains. What reasons are there for thinking about this?
There is of course the Security Mindset reasoning that you are uncomfortable with letting a counterfactual adversary into your cognitive processes, and you are uncomfortable with there being a way for such a counterfactual adversary to get in, even in principle.
However, there is also the fact that the appearance of serious deception problems is weighted much more towards the later end of AGI development, where models are becoming situationally aware and strategic (see #13 in AGI Ruin: A List of Lethalities). Working on this now is important work for preparing our future selves. Further, this is a capability that is very plausibly something that shows up after situational awareness and before ASI, as it may be very useful to deceive humans in order to get better opportunities for recursive self-improvement.
Finally, we can predict that the world is going to get very weird in the next few years before ASI. Weird in technological advancements, but also very quickly weird and tense in politics as the wider world wakes up to what is happening. If we expect to see any nation use AIs for a mass persuasion campaign, for example, then it is even more important to quickly become robust to AIs attempting to disrupt your cognitive loops.
In Search of Shovel-Ready Mitigations
There are some readers who will see this post and automatically keep these failure modes in mind and spend their time to cautiously reexamine the important aspects of their LLM usage. There are yet many more readers who would greatly benefit from some ready-to-go remedies. The only perfect remedies are "solve alignment" and "live in an internet-free bunker and never let AIs influence your cognition in any way," and things on this list are not intended to fill that gap. This list is not intended to be exhaustive; you are in fact highly encouraged to add to this list.
- If you're working on something that is not highly complex: talking to a small, trusted model before you talk to an untrusted model.
- Weakness: "trusted" models are not secure systems, they just do not deceive their users in a strategically meaningful way by default.
- Predict ways that you might update or classes of information that you might receive after interacting with a potentially deceptive AI. Treat these as your "allowed courses of action" and if your actual trajectory is outside of those predictions, review your reasoning with extra scrutiny.
- Weakness: the AI might bias you within your space of allowed options. This technique also completely falls apart past a certain capability level, since your commitment to your initial options is something that sufficiently advanced AI can break.
- Prioritize actions resulting from AI interaction that are cheaply reversible; be more suspect of you should buy a house" than "you should go for a walk," even if the arguments for both sound the same amount convincing on the object level, and you felt like your starting credence in both was similar.
- Weakness: this strategy can easily bias you into inaction, especially in times when the most beneficial course is to do irreversible things.
- Make the model express itself in higher resolution and give explicit epistemic justifications: it's harder to deceive when that deception also has to be grounded in accurate predictions about the world.
- Weakness: you only ever get to see one counterfactual branch, the consequences of one chain of actions and not their alternatives. It is very feasible to deceptively present epistemics regarding a path that the user will not take.
- Consult multiple models at the same capability level, preferably ones from different sources and that have no way of communicating with each other. Pay close attention to their disagreements.
- Weakness: this assumes two things that may not reliably hold: you have access to multiple models at the same level of usefulness and if they were to deceive you, they'd do so in different ways, rather than convergently instrumental/acausally cooperative ways. Further, this is plausibly a good way to end up just believing whichever deceptive argument is most compelling, rather than the truth.
- ^
I'm using Youtube as an example, but fill in the gaps with video games, social media, pornography, etc. if you find those more compelling. This argument holds for most (if not all) of the superstimuli that the internet has to offer.
- ^
Highly theoretical ML, stuff about the behavior of idealized limiting networks that don't actually represent most real use cases. I had to Google some stuff for the example o3-mini-high gave. I've interacted a bit with this area, but for simpler examples that output foundational facts like "networks under these idealized conditions are universal approximators for this class of functions."
This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):