Comment Permalink

Alice Blair2mo10

This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):

GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I'm unsure if I will when I have the option to.

See in context

38 Cautions about LLMs in Human Cognitive Loops

by Alice Blair

2nd Mar 2025

9 min read

38

soft prerequisite: skimming through How it feels to have your mind hacked by an AI until you get the general point. I'll try to make this post readable as a standalone, but you may get more value out of it if you read the linked post.

Thanks to Claude 3.7 Sonnet for giving feedback on a late draft of this post. All words here are my own writing. Caution was exercised in integrating Claude's suggestions, as is thematic.

Many people right now are thinking about the hard skills of AIs: their ability to do difficult math, or code, or advance AI R&D. All of these are immensely important things to think about, and indeed I spend much of my time thinking about those things, but I am here right now to talk about soft skills of AIs, so that fewer of us end up with our brains hacked by AI.

A Motivating Example

soft prerequisite for this section: Superstimuli and the Collapse of Western Civilization.

Superstimuli are stimuli that are much more intense than those in the environment where humans evolved, like how a candy bar is much denser in tasty things like sugars and fats than anything in nature.

Many humans spend much of their time following their local dopamine gradient, moving to the next most exciting thing in their immediate vicinity: they see something appealing, they go do it. They can also be strategic about things and can look to the global dopamine gradient, further away in space and time, when they need to, but this often requires nonnegligible willpower. (e.g. Stanford Marshmallow Experiment ).

Occasionally, someone gets swept up in a dopamine gradient too strong to resist, even with good reasons to stop. They overdose on drugs, they overeat unhealthy foods, they play video games for days until they die. And those are just some of the strongest dopamine gradients that humans have created.

We're seeing the beginning of the rise of cheap AI video generation. It's all over Youtube^[1]. It's not good, but it's mesmerizing. It's bizarre, and it scratches some itch for some people. You can look it up if you're really morbidly curious, but I won't link anything, since the whole point of this section of the post is "don't get stuck in strong dopamine gradients from AI-generated content." When (not if) this technology does get Good, then we have cheap content generation with a powerful optimizer behind it, presumably trained well enough to grok what keeps humans engaged.

Maybe they already exist, maybe they will only exist later, but at some point I expect there to be people who spend significant time caught in loops of highly stimulating AI-optimized content beyond what is available from human creators. This prediction relies on a few specific things:

Fast Feedback Loops: AI content creators can create videos in a given style and for a given audience faster than an individual can watch them. These models can quickly pick up on cues to go from "unentertaining" to "superstimulus" much faster than human content creators can. Video generation is currently expensive, but I expect it to get much cheaper as AI historically has.
Managing a Complex Reward Signal: The best human content creators have a strong intuition for how a given piece will be received, based on their experience both of optimizing for "the algorithm" and of interacting with humans. The set "all humans who at least semi-regularly consume Youtube" is very large, and even the most seasoned creators are at a disadvantage, both from their slow feedback loops and from their limited ability to handle the sheer complexity of the problem. Meanwhile, AI has shown a remarkable ability to fit wide varieties of structured data and match patterns that humans have a hard time picking up on. In the limit, this risk doesn't only apply to the iPad-bound children who have been watching 12 hours of Youtube every day since they were 2 years old, but also to the people who watch Youtube sometimes, the people who have strong willpower but occasionally watch a video that particularly piques their interest. In the limit, some of those people get sucked into whatever highly stimulating media they find, and some of the most susceptible people don't come out.

In this example, we see a utility maximizer which can be fooled by dopamine gradients (human), a recommendation algorithm, and a utility maximizer that has exactly one complex goal (content generator that maximizes engagement metrics). The optimization between these is mostly self-reinforcing; the only forces that push away from the stable state of "~everyone is watching superstimulating videos until they die" is the limits on how good the content generators and recommenders are and the willpower of the humans to do things like eat and get money. I am not confident relying on either of those things, given the increasing scaling in AI systems and the small amount of willpower that most people have access to.

Over-Integration of AI Cognition

The previous section details a failure mode that is targeted towards average-willpower, average-agency people. However, the high-willpower, highly agentic people are still at risk. These people want to do things, and they realize that they can pick up these giant piles of utility by using AIs as an external brain to enhance their cognition and agency even further. The more work you can successfully offload to AIs, the more room you have to be agentic and the more utility you can pick up.

But we cannot trust AIs to be a reliable external brain to us, just as we cannot reliably trust humans with that. Say you talk to a friend about something complex, you two work through the reasoning together, and you come to a conclusion that seems right, given the reasoning you just went through. You go home that night, let your mind wander, and you realize that one of the steps in the reasoning is subtly off upon further inspection. You have a reflex to generalize, and you notice that any of the other steps in reasoning that you skimmed over could also be similarly completely wrong, and they could be harder to disentangle than the one you just ran into.

LLMs are at the level where they can not only produce mistakes that mislead smart-but-not-omnicareful people in that way, but they can also produce intentional deceptions that mislead those people as well! I tested this: It took a bit of prompting and back-and-forth, but I was able to get o3-mini-high to generate deceptive arguments about ML (my area of most experience) that I couldn't find a flaw in, even knowing there was a flaw, even after seeing a hint about which step of reasoning it was in. Admittedly, it was not in an area of ML that I was particularly familiar with.^[2] I later prompted trying to get it to provide similarly deceptive arguments for areas that I know very well, and it failed. I think that "can intentionally create successfully deceptive-to-me arguments in all but my relatively specific expertise" is a scarily high level of capabilities already, but I also expect that in the next generation or two of models, it will be able to pull this trick much more seamlessly in practically all domains.

If you are trying to pick up a lot of utility, you might try to offload a lot of cognition to LLMs by tightly integrating them into your thinking to massively accelerate whatever work you do. In cases where the LLMs are aligned, this Just Works and you get the utility, but if you don't have that guarantee of safety, you suddenly have a large attack surface for subtle deception like I've described. Here are some other related failure modes:

Misdirecting your attention to some subset of the things you're working on, by
- Talking more about that subset/Framing things so that your attention goes there
- Selectively sandbagging on some set of tasks, so that your attention goes to the tractable-seeming problems instead
(for situations in which the LLM handles some memory) strategically gaslighting you on minute details, modifying details and relationships between details Additionally, even in situations where the LLM isn't tightly integrated into your cognitive loop, there's always the classic strategy of presenting a monolithic argument for e.g. why it's a really good idea to let it out of its box or something else that you wouldn't commit in advance to doing.

Does this End Up Mattering in Real Life?

Right now, at least, it seems rather unlikely that LLMs are secretly being deceptive and performing these subtle manipulations, even though they are in principle capable of it in most domains. What reasons are there for thinking about this?

There is of course the Security Mindset reasoning that you are uncomfortable with letting a counterfactual adversary into your cognitive processes, and you are uncomfortable with there being a way for such a counterfactual adversary to get in, even in principle.

However, there is also the fact that the appearance of serious deception problems is weighted much more towards the later end of AGI development, where models are becoming situationally aware and strategic (see #13 in AGI Ruin: A List of Lethalities). Working on this now is important work for preparing our future selves. Further, this is a capability that is very plausibly something that shows up after situational awareness and before ASI, as it may be very useful to deceive humans in order to get better opportunities for recursive self-improvement.

Finally, we can predict that the world is going to get very weird in the next few years before ASI. Weird in technological advancements, but also very quickly weird and tense in politics as the wider world wakes up to what is happening. If we expect to see any nation use AIs for a mass persuasion campaign, for example, then it is even more important to quickly become robust to AIs attempting to disrupt your cognitive loops.

In Search of Shovel-Ready Mitigations

There are some readers who will see this post and automatically keep these failure modes in mind and spend their time to cautiously reexamine the important aspects of their LLM usage. There are yet many more readers who would greatly benefit from some ready-to-go remedies. The only perfect remedies are "solve alignment" and "live in an internet-free bunker and never let AIs influence your cognition in any way," and things on this list are not intended to fill that gap. This list is not intended to be exhaustive; you are in fact highly encouraged to add to this list.

If you're working on something that is not highly complex: talking to a small, trusted model before you talk to an untrusted model.
- Weakness: "trusted" models are not secure systems, they just do not deceive their users in a strategically meaningful way by default.
Predict ways that you might update or classes of information that you might receive after interacting with a potentially deceptive AI. Treat these as your "allowed courses of action" and if your actual trajectory is outside of those predictions, review your reasoning with extra scrutiny.
- Weakness: the AI might bias you within your space of allowed options. This technique also completely falls apart past a certain capability level, since your commitment to your initial options is something that sufficiently advanced AI can break.
Prioritize actions resulting from AI interaction that are cheaply reversible; be more suspect of you should buy a house" than "you should go for a walk," even if the arguments for both sound the same amount convincing on the object level, and you felt like your starting credence in both was similar.
- Weakness: this strategy can easily bias you into inaction, especially in times when the most beneficial course is to do irreversible things.
Make the model express itself in higher resolution and give explicit epistemic justifications: it's harder to deceive when that deception also has to be grounded in accurate predictions about the world.
- Weakness: you only ever get to see one counterfactual branch, the consequences of one chain of actions and not their alternatives. It is very feasible to deceptively present epistemics regarding a path that the user will not take.
Consult multiple models at the same capability level, preferably ones from different sources and that have no way of communicating with each other. Pay close attention to their disagreements.
- Weakness: this assumes two things that may not reliably hold: you have access to multiple models at the same level of usefulness and if they were to deceive you, they'd do so in different ways, rather than convergently instrumental/acausally cooperative ways. Further, this is plausibly a good way to end up just believing whichever deceptive argument is most compelling, rather than the truth.

^{^}
I'm using Youtube as an example, but fill in the gaps with video games, social media, pornography, etc. if you find those more compelling. This argument holds for most (if not all) of the superstimuli that the internet has to offer.
^{^}
Highly theoretical ML, stuff about the behavior of idealized limiting networks that don't actually represent most real use cases. I had to Google some stuff for the example o3-mini-high gave. I've interacted a bit with this area, but for simpler examples that output foundational facts like "networks under these idealized conditions are universal approximators for this class of functions."

AI ControlCyborgismDeceptionDeceptive AlignmentSecurity MindsetAI

Frontpage

38

Cautions about LLMs in Human Cognitive Loops

New Comment

9 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:01 AM

[-]Said Achmiz2mo91

I concur with the descriptive claims and arguments in this post, as well as with the sentiment that this is very important, and with the “security mindset” framing for the latter point. I have no substantive comments about that part.

However, I must object to the sentiment (implied by your selection of potential solutions) that the only solutions to this problem still involve using LLMs in some way—perhaps a little more or a little less, but never “none at all”. Now, you say “you are in fact highly encouraged to add to this list”—fair enough, and here is my addition:

Don’t talk to LLMs. Ever, at all, for any reason, under any circumstance.

I put forth no arguments whatsoever for this being a good idea. (I think that it is, in fact, an excellent idea; but, in this comment, I am offering no defense of that view.) My purpose in explicitly mentioning the possibility of not talking with LLMs is just that: to make explicit, for the record, that it is possible. It’s a thing you can do. (You don’t have to live in a bunker in order to do this, either.)

When considering possible courses of action, remember that this course of action is also available. When weighing pros and cons of different amounts of talking-with-LLMs, include “none” in your list of options. When setting your personal “how much do I talk with LLMs” variable, make sure that the scale goes all the way to zero.

You can pick up giant piles of utility by talking to LLMs? Perhaps. But you can also not pick them up. Leave them on the ground! That’s a thing you can do. Regardless of whether you ultimately decide to grab those piles, remember that you have the option not to. Make absolutely damn sure that you have actually considered all the options. “Do not talk to any LLMs, ever, at all, for any reason, under any circumstance” is—genuinely, in actual fact and not just in theory—one of the options that you have.

[-]Alice Blair2mo20

I agree that this is a notable point in the space of options. I didn't include it, and instead included the bunker line because if you're going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I'd bet that I'm still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

[-]Richard_Kennaway2mo62

I'd bet that I'm still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

Please review this in a couple of months ish and see if the moment to stop is still that distance away. The frog says "this is fine!" until it's boiled.

[-]Alice Blair2mo30

I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I'm now going to generalize the "check in in [time period] about this sort of thing to make sure I haven't been hacked" reflex.

[-]Said Achmiz2mo51

I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish.

With respect, I suggest to you that this sort of thinking is a failure of security mindset. (However, I am content to leave the matter un-argued at this time.)

… if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

Yes… this is true in a personal-protection sense, I agree. And I do already try to stay away from people who talk to LLMs a lot, or who don’t seem to be showing any caution about it, or who don’t take concerns like this seriously, etc. (I have never needed any special reason to avoid Twitter, but if one does—well, here’s yet another reason for the list.)

However, taking a pure personal-protection stance on this matter does not seem to me to be sensible even from a selfish perspective. It seems to me that there is no choice but to try to convince others, insofar as it is possible to do this in accordance with my principles. In other words, if I take on some second-order effect risk, but in exchange I get some chance of several other people considering what I say and deciding to do as I am doing, then this seems to me to be a positive trade-off—especially since, if one takes the danger seriously, it is hard to avoid the conclusion that choosing to say nothing results in a bad end, regardless of how paranoid one has been.

[-]Alice Blair2mo30

I think we're mostly on the same page that there are things worth forgoing the "pure personal-protection" strategy for, we're just on different pages about what those things are. We agree that "convince people to be much more cautious about LLM interactions" is in that category. I just also put "make my external brain more powerful" in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don't have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.

[-]Alice Blair2mo10

This post just came across my inbox, and there are a couple updates I've made (I have not talked to 4.5 at all and have seen only minimal outputs):

GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi's post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I'm unsure if I will when I have the option to.

[-]niplav2mo53

Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):

Don't look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
- I don't think there's useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
Don't use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they're fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they're not yet doing it, a competitor might, or they might in the future).

[-]Alice Blair2mo32

I'd use such an extension. Weakness: rephrasing still mostly doesn't work for systems determined to convey a given message. There's the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There's also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
I'm glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we've crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.

Moderation Log