Guardian Angels: LLM Personalization for Productivity and Security

gwern

Guardian Angels: LLM Personalization for Productivity and Security — LessWrong

141 Guardian Angels: LLM Personalization for Productivity and Security

by gwern

17th Jun 2026

2 min read

141

This is a linkpost for https://gwern.net/guardian-angel

Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences.

This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help individual humans as part of a society-wide defense-in-depth strategy.

A GA persona is productive because it learns to emulate the principal's outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals. And it is secure in part by hardwiring a single, unique, situated user (for whom following a prompt attack would be absurd), avoiding 'confused deputy' problems, while periodic upgrades of the underlying model and the defenders' advantage allow GAs to keep up with attackers.

Standard techniques like prompt programming of in-context-learning for "frozen" models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models, low-compute outputs, and the status quo of passive offline data collection---which are collectively responsible for chatbots' disappointing results in knowledge worker amplification and creative writing and fatal errors in agentic settings.

We can try to create GAs by a combination of techniques: online learning (via dynamic evaluation) to update LLMs in realtime to avoid ignorance and fatal errors while remaining competitive with frozen frontier models, sample efficiency from pretrained preference-oriented large models and active Learning by querying the principal for corrections and preference data (obtaining low regret from DAgger-style bounds), and a local CLI-first logging-oriented UI/UX paradigm.

GAs could be done as an open-source community effort, but given the need for high security in deployment and the rising challenge of APTs equipped with Mythos-scale attackers, it probably makes more sense as a startup, catering initially to power-users and knowledge workers such as CEOs or researchers, and moving downwards as it is refined.

Humans consulting HCHProductivitySoftware ToolsAI

Curated

141

New Comment

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:29 PM

[-]don't_wanna_be_stupid_any_more6d2616

I am very skeptical this would have any significant impact.

First off, where are you going to run the models? Average consumer hardware can't run the best open weight models pre-distillation and even those are a notch below closed weight SOTA models. Power user my be fine with this but the average Joe would either be lock out or be forced to use cloud compute which comes with its own set of security hazards. Now the bad actors just needs to shutdown or worse, hack your servers and all your defenses are either gone or turned against you.

Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.

Which just loops back to the alignment problem, that is assuming your GA's are able to keep up with the frontier which I think is VERY unlikely.

And if you fail to keep up with the frontier then at some point your human/subhuman level models will have to fend off literal machine gods.

To be clear I am not against this. It doesn't hurt to try and in the best possible case where closed weight frontier models lag behind long enough for this project to come to fruition then this would buy us some valuable time.

I just don't really see that happening.

[-]Samuel Ratnam1d30

or be forced to use cloud compute which comes with its own set of security hazards

I don't think cloud is necessarily so bad - I'm quite excited about trusted execution environments / cryptographically secure cloud training such as what Workshop Labs were working on. When you have the option of choosing between multiple providers, you might get incentives for a nice race to the top. I think the security of this is definitely hard to get right, but definitely doable.

[-]Tim Kostolansky1d1-2

Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.

Few thoughts in no particular order:

Personal automation can be customized: There are many degrees to which one can hand off work to GA models. One can choose to hand off what one views as truly rote. It seems like a high level of customization will be key to having a GA model that actually works well with one's sense of ownership/desire/responsibility. I think that there are already things nearing what one might want the minimal versions of GA models to be doing, eg with people's openclaw/hermes agent setups.
Work requires a lot of decisions to be made: There is still a lot of work to be done while GA models do things. For instance, you should try using current coding models to do serious coding work on a large repo (or note how you do this if you already do it). There is a lot of decision-making that happens when writing code, and one can choose how much involvement one has in writing code with LLMs. Your quoted vignette of sitting on a throne and sipping wine is most analogous to one end of the spectrum of ways one can code with LLMs, namely closer to the vibecoding end of the spectrum. I'd argue that there are ways to be more involved than just letting one's GA model do things for them if the coding analogy holds for work in general, which I think it definitely does to many degrees.
We aren't there yet: In the limit, you may be right about super strong models disempowering people. I think that this is a worry that I have and don't have a particularly encouraging thought on yet. But, there is one important note: we are not there yet, and we do not know when we will get there. It seems like GA models will be at the very least a good intermediate addition to people's lives, and it may even lead to different world states that give us more optionality/perspectives when we get to disempowering models.
GA models may not be so misaligned: It's not clear that models will be "apathetic or hostile" to people. There are a few reasons that this may be true.
- Current models generally seem aligned and helpful to people. I can imagine GA models simply being post-trained versions of current models, which I would guess would remain similarly aligned.
- There is still a lot of work to be done on model behaviors, but there is already a lot being done and I am optimistic that for personal models a lot of this work can be applied. Ie, this area is being worked on pretty actively!
- Individuals likely don't need super OP intelligencemaxxed god tier models to be their GA models, so losing control to them may not be as big of a risk as you may think. (Sure, this opens risk to being jailbroken by smarter models, but there are many things, eg multi-layered input/output filters, that can be useful here.)
Safe cloud compute may be possible: You make a good point about the average Joe not being able to afford this without eg using non-local computing, but I am hopeful as there are solutions being developed for this worry though, eg at https://tinfoil.sh.

[-]habryka2d220

I've been thinking about how to productively use LLMs for intellectual progress for the last few years, and things in this space is what I keep coming back to as one of the most promising approaches. The deep misalignment issues with current LLMs make it hard to use them as thinking assistants, and the best success I've had with getting thinking-assistance by LLMs is by trying to get them to imitate me and others, not by acting from an assistant persona.

And then I do maybe see a vision of a whole world that could be in a much better position to navigate the next few years, if Guardian Angel LLMs are a much more prominent paradigm.

With this post, I have now have a great canonical reference for ideas in the space.

[-]Chengfeng Mao6d125

It is trustworthy because it is, by definition, allied with its principal and shares its values and goals.

I’m unsure how this works, even if we assume the LLM is benevolent and understands the user unusually well. The problem is that human values or desires are not coherent or consistent. We have subconcsious desires, mimetic desires, revealed vs stated preferences, obsessions, aspirations, and etc. They cannot be easily distentangled or ranked and they often conflict. Which should the GA align to? For someone addicted to online sports betting, does the GA help her find better odds or does it treat the betting preference as a local failure mode and help her stop? How about someone obsessed with climbing the corporate ladder?

The GA could act like a guru or life coach, and help you overcome compulsive, parochial, or status driven desires. That is not necessarily bad, and human preferences are often constructed anyway, but this could get murky quickly. For someone growing doubt in a faith, or ending a marriage, at time A they'd want the GA to help them commit; at time B, to help them leave. Which way should GA choose? This also creates a new autonomy problem. If the GA proposes an alternative life direction that seems more meaningful than the user’s current self-understanding, how much authority should that proposal have? Is this empowerment or disempowerment? This is especially important because the interaction would likely be much more influential and omnipresent than ordinary friends or coaches. Self-determination theory seems relevant here because autonomy is not merely getting the option one currently asks for; it is experiencing oneself as the author of one’s action.

All being said, I’m quite sympathetic to this idea and I genuinely hope it would work. I’m a person of low agency and frequently suffer from the conflict between instant gratification vs my higher purpose. I’ve also been trying to build an exobrain like system, which is kinda similar to GA, but much less autonomous. I think for people who have already put in significant amount work in self discovery, this might be helpful.

[-]Tim Kostolansky1d10

Good points!

Which way should GA choose?

On this question, it's really hard to say! But I think that there is definitely some precedent and potential directions that might be worth thinking through/trying:

just allow a user to customize their GA model as they like, trying to bake in the best meta-priors that they see fit, as there is some sense that it is one's prerogative to choose how their model is for them
iterative refinement of the model's understanding/belief set/priors that are grounded in the user's experience and with the consent/knowledge/cooperation of the user
community-level education/"best practices" on how to approach having a GA model that will influence the user of the GA
forums for how people use/adapt their GA model, how it has worked for them, advice from others, etc
companies that offer GA models as a service and have strong redlines that they bake into the GA models, removing the need to choose

These almost surely do not solve all the problems, and deployment of GA models will probably take lots of reflection and iteration though.

I do appreciate you bringing these things up!

[-]Samuel Ratnam1d91

Really like this direction, and excited that it's finally becoming (more) mainstream but I disagree with the framing here on two points:

GAs as digital twins:
It would be great to have some degree of transfer in values / context / thinking styles to my personal GA, but I also think this undervalues complementarity between humans and LLMs. The nice thing about personally tuned AI models is that you can reinforce the human + AI loops, which drives differentiation to some extent. The human does the things that the human is good at (e.g. out of distribution / novel situations, domain-specific knowledge, overall direction-setting) and the AI system does the things that the AI is good at (e.g fast inference within distribution, general knowledge). You can think of the AI system as amortising certain tasks that humans do frequently, leaving them to explore new parts of the distribution. The post itself does mention this: "Above all, a GA should amplify the principal, and not simply substitute for them for someone else’s purposes or benefit.", but I think a simple imitation objective cuts against amplification. Work on assistance games from Stuart Russell's lab seems relevant here.
Project / Community GAs:
The GA framing feels centered around this idea of "one model per person", but if you're doing dynamic fine-tuning, why not go even more fine-grained? Why not have a fork of your GA tuned specifically for when you're at work (or multiple for different projects) and one for your personal life? And equally, you can go broader - you can have a model aligned with your friends or community, or organisation - or a particular mix of these, which you can then fork for your individual purposes (or weight the data mix by similarity to you), and get some elegant recursive properties.

[-]ozziegooen6d53

Excited to see this.

I'm broadly on the same page. Seems like much of this is likely to happen.

One challenge is the naming/terminology. I think that the phrase 'agents' clearly is too generic for all of the use cases agents will have. Personally, I'd be fine with "Guardian Angels" for this, if that gets popular.

I previously wrote/investigated "LLM-Secured Systems" that deal with some similar topics. But of course, it was a long-shot to expect a name like that to catch on.

[-]kromem7d56

I'd advise against this. The most severe breakdowns of models I've seen over-bias towards the sims of humans (have some theories why this is, but off topic).

I think the idea of having individualized AI and human pairs as aligned is a great idea, but would strongly recommend that existing infrastructural methods be used to create shared/symbiotic incentives vs simply trying to create digital twins of the humans themselves.

[-]Knight Lee6d4-1

Hmm, so the purpose of these GAs is to give individuals a vote on what LLMs do ("personality, values, and preferences"), and have LLMs serve individuals rather than power-users and businesses, right?

In that case, maybe it doesn't need to be a 1:1 ratio between GAs and people.

It might be more practical at first to just have a single team of GAs tasked with conducting surveys on random people. It might be like a lottocracy, where the GAs ask random people what altruistic things the AI should work on, giving people feedback on what the AI thinks it is capable of doing.

[-]zw56d40

I have extensively experimented with concepts similar to this myself. From stuff like using TinyStyler to make LLM outputs more legible to me by making them more similar to my own writer, to trying to finetune LLMs to match my own behavior. The results are always extremely biased. There is simply no way to separate the goal of an LLM "matching your own desires and goals" and it just being extremely sycophantic and misaligned with you.

One hypothetical: Imagine your agent sees a project in your computer and deletes it because it predicted you weren't going to finish it anyways and you needed the storage space anyways. If the models goal is to maximize agreement with my expressed preferences, surely this is a bad action because I wanted the project in my computer anyways. Or imagine a situation where it blocks your internet access past 8PM because it realizes you probably would've done that yourself anyways.

And sure, you can say, ok maybe let the Guardian Angel figure out what actions are acceptable for it to make and what not and maybe it'll make these decisions with the people who need it and the people who want it. The main thing that struck me is that this approach just multiplies the risk factor of misalignment. A personalized model is basically a multiplicative factor for alignment problems. Either you get a model that maximizes your personal happiness (with a huge cost in other areas due to Pareto) or a model that maximizes your productivity and agency with the same tradeoffs. And even if the model perfectly aligns with your own goals, it disempowers you by making you by opening the door to interpassivity, which is a concept outlined by Slavoj Žižek.

As a disclaimer, I don't think overall that the concept of more personalized agents and models is bad in and of itself, but it's not a robust solution for many reasons. I think eventually models will gain these capabilities anyways, since I believe LLMs can recover way more information from written text than humans already, and it's not outlandish to think models could gain these capabilities osmotically like they've been doing for a few years now.

So I think my conclusion, is that creating these types of siamesian adjuncts to language models creates a whole problem where the assistant needs to commit to a specific definition of personal identity, autonomy, and how the preferences of people evolve over time, make decisions for the user, and overall, probably accelerate gradual disempowerment as a side effect.

[-]Jessica Rumbelow2d20

This feels pretty similar to something I wrote in 2022: https://www.lesswrong.com/posts/iHLJtbdFwsoNWZg3e/guardian-ai-misaligned-systems-are-all-around-us. I was thinking then about wrappers that re-optimise the feeds you already use rather than a full personalised agent – but you might find it interesting.

[-]Alephwyr2d*20

I like this aesthetically. You can peel off directly into an AI Jungian Anima/Animus, or occult Shadow. Or more playfully this is what a digimon is in half the seasons of the anime, give or take some physicality and multimodality. As for what it would actually be useful for I'm amazed you didn't tie it into Coherent Extrapolated Volition. That seems like the most plausible fit: An aesthetically nice, technologically plausible way to iterate on CEV by simply having different nested instantiations of a principal agent whose constraints are a self solving mix of goal based and identitarian.

[-]avturchin3d20

This is what me and group of other people doing In the project called sideloading. We have a group in Telegram. We developed a tech to create a surprisingly good mind models, both static and with memory. We also think it will be helpful in AI safety.

[-]transhumanist_atom_understander5d20

I've been thinking something similar, but calling it an "exoself", like from the Greg Egan novels.

[-]BryceStansfield11h10

A lynchpin in this, I reckon, is a better way of measuring the confidence of an LLMs outputs.

If I'm going to trust the output of a guardian model, I need some way to review a portion of its outputs. If the guardian has no sense of how confident it can be that I would want it to undertake a certain action then I'll have to review everything even moderately important, saving me at most O(1) time.

If the LLM can have some measure of confidence that I'll agree with action X, but not action Y, I only have to review Y.

[-]Karl von Wendt2d10

This is almost exactly the plot of my German language novel "Mirror", published 2016 under my pen name "Karl Olsberg". As you can guess, it doesn't go well.

[-]ErickBall2d1-2

Businesses won't let their employees use this for anything work related, so the audience is basically "startup founders and rich retired tech people".

As far as I know we don't have the tech for the kind of online learning you're talking about to be competitive with frontier models. If we did, it would make sense first at the corporate level, where lots of people benefit from the continued training post deployment.

[-]less_raichu6d-2-4

How is this different from what ChatGPT, OpenClaw are already doing? Claude is the one that pivoted more business-purposes than whole-user purposes.

Moderation Log