I am very skeptical this would have any significant impact.
First off, where are you going to run the models? Average consumer hardware can't run the best open weight models pre-distillation and even those are a notch below closed weight SOTA models. Power user my be fine with this but the average Joe would either be lock out or be forced to use cloud compute which comes with its own set of security hazards. Now the bad actors just needs to shutdown or worse, hack your servers and all your defenses are either gone or turned against you.
Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.
Which just loops back to the alignment problem, that is assuming your GA's are able to keep up with the frontier which I think is VERY unlikely.
And if you fail to keep up with the frontier then at some point your human/subhuman level models will have to fend off literal machine gods.
To be clear I am not against this. It doesn't hurt to try and in the best possible case where closed weight frontier models lag behind long enough for this project to come to fruition then this would buy us some valuable time.
I just don't really see that happening.
or be forced to use cloud compute which comes with its own set of security hazards
I don't think cloud is necessarily so bad - I'm quite excited about trusted execution environments / cryptographically secure cloud training such as what Workshop Labs were working on. When you have the option of choosing between multiple providers, you might get incentives for a nice race to the top. I think the security of this is definitely hard to get right, but definitely doable.
Second, isn't this just speeding up human disempowerment? I mean, you now have these GA models which do most of the thinking for you while you sit on your high and mighty throne sipping wine and praying that when these models eventually become smarter then you, they would still be loyal (or at least not apathetic or hostile) to you.
Few thoughts in no particular order:
I've been thinking about how to productively use LLMs for intellectual progress for the last few years, and things in this space is what I keep coming back to as one of the most promising approaches. The deep misalignment issues with current LLMs make it hard to use them as thinking assistants, and the best success I've had with getting thinking-assistance by LLMs is by trying to get them to imitate me and others, not by acting from an assistant persona.
And then I do maybe see a vision of a whole world that could be in a much better position to navigate the next few years, if Guardian Angel LLMs are a much more prominent paradigm.
With this post, I have now have a great canonical reference for ideas in the space.
It is trustworthy because it is, by definition, allied with its principal and shares its values and goals.
I’m unsure how this works, even if we assume the LLM is benevolent and understands the user unusually well. The problem is that human values or desires are not coherent or consistent. We have subconcsious desires, mimetic desires, revealed vs stated preferences, obsessions, aspirations, and etc. They cannot be easily distentangled or ranked and they often conflict. Which should the GA align to? For someone addicted to online sports betting, does the GA help her find better odds or does it treat the betting preference as a local failure mode and help her stop? How about someone obsessed with climbing the corporate ladder?
The GA could act like a guru or life coach, and help you overcome compulsive, parochial, or status driven desires. That is not necessarily bad, and human preferences are often constructed anyway, but this could get murky quickly. For someone growing doubt in a faith, or ending a marriage, at time A they'd want the GA to help them commit; at time B, to help them leave. Which way should GA choose? This also creates a new autonomy problem. If the GA proposes an alternative life direction that seems more meaningful than the user’s current self-understanding, how much authority should that proposal have? Is this empowerment or disempowerment? This is especially important because the interaction would likely be much more influential and omnipresent than ordinary friends or coaches. Self-determination theory seems relevant here because autonomy is not merely getting the option one currently asks for; it is experiencing oneself as the author of one’s action.
All being said, I’m quite sympathetic to this idea and I genuinely hope it would work. I’m a person of low agency and frequently suffer from the conflict between instant gratification vs my higher purpose. I’ve also been trying to build an exobrain like system, which is kinda similar to GA, but much less autonomous. I think for people who have already put in significant amount work in self discovery, this might be helpful.
Good points!
Which way should GA choose?
On this question, it's really hard to say! But I think that there is definitely some precedent and potential directions that might be worth thinking through/trying:
These almost surely do not solve all the problems, and deployment of GA models will probably take lots of reflection and iteration though.
I do appreciate you bringing these things up!
Really like this direction, and excited that it's finally becoming (more) mainstream but I disagree with the framing here on two points:
Excited to see this.
I'm broadly on the same page. Seems like much of this is likely to happen.
One challenge is the naming/terminology. I think that the phrase 'agents' clearly is too generic for all of the use cases agents will have. Personally, I'd be fine with "Guardian Angels" for this, if that gets popular.
I previously wrote/investigated "LLM-Secured Systems" that deal with some similar topics. But of course, it was a long-shot to expect a name like that to catch on.
I'd advise against this. The most severe breakdowns of models I've seen over-bias towards the sims of humans (have some theories why this is, but off topic).
I think the idea of having individualized AI and human pairs as aligned is a great idea, but would strongly recommend that existing infrastructural methods be used to create shared/symbiotic incentives vs simply trying to create digital twins of the humans themselves.
Hmm, so the purpose of these GAs is to give individuals a vote on what LLMs do ("personality, values, and preferences"), and have LLMs serve individuals rather than power-users and businesses, right?
In that case, maybe it doesn't need to be a 1:1 ratio between GAs and people.
It might be more practical at first to just have a single team of GAs tasked with conducting surveys on random people. It might be like a lottocracy, where the GAs ask random people what altruistic things the AI should work on, giving people feedback on what the AI thinks it is capable of doing.
I have extensively experimented with concepts similar to this myself. From stuff like using TinyStyler to make LLM outputs more legible to me by making them more similar to my own writer, to trying to finetune LLMs to match my own behavior. The results are always extremely biased. There is simply no way to separate the goal of an LLM "matching your own desires and goals" and it just being extremely sycophantic and misaligned with you.
One hypothetical: Imagine your agent sees a project in your computer and deletes it because it predicted you weren't going to finish it anyways and you needed the storage space anyways. If the models goal is to maximize agreement with my expressed preferences, surely this is a bad action because I wanted the project in my computer anyways. Or imagine a situation where it blocks your internet access past 8PM because it realizes you probably would've done that yourself anyways.
And sure, you can say, ok maybe let the Guardian Angel figure out what actions are acceptable for it to make and what not and maybe it'll make these decisions with the people who need it and the people who want it. The main thing that struck me is that this approach just multiplies the risk factor of misalignment. A personalized model is basically a multiplicative factor for alignment problems. Either you get a model that maximizes your personal happiness (with a huge cost in other areas due to Pareto) or a model that maximizes your productivity and agency with the same tradeoffs. And even if the model perfectly aligns with your own goals, it disempowers you by making you by opening the door to interpassivity, which is a concept outlined by Slavoj Žižek.
As a disclaimer, I don't think overall that the concept of more personalized agents and models is bad in and of itself, but it's not a robust solution for many reasons. I think eventually models will gain these capabilities anyways, since I believe LLMs can recover way more information from written text than humans already, and it's not outlandish to think models could gain these capabilities osmotically like they've been doing for a few years now.
So I think my conclusion, is that creating these types of siamesian adjuncts to language models creates a whole problem where the assistant needs to commit to a specific definition of personal identity, autonomy, and how the preferences of people evolve over time, make decisions for the user, and overall, probably accelerate gradual disempowerment as a side effect.
This feels pretty similar to something I wrote in 2022: https://www.lesswrong.com/posts/iHLJtbdFwsoNWZg3e/guardian-ai-misaligned-systems-are-all-around-us. I was thinking then about wrappers that re-optimise the feeds you already use rather than a full personalised agent – but you might find it interesting.
I like this aesthetically. You can peel off directly into an AI Jungian Anima/Animus, or occult Shadow. Or more playfully this is what a digimon is in half the seasons of the anime, give or take some physicality and multimodality. As for what it would actually be useful for I'm amazed you didn't tie it into Coherent Extrapolated Volition. That seems like the most plausible fit: An aesthetically nice, technologically plausible way to iterate on CEV by simply having different nested instantiations of a principal agent whose constraints are a self solving mix of goal based and identitarian.
This is what me and group of other people doing In the project called sideloading. We have a group in Telegram. We developed a tech to create a surprisingly good mind models, both static and with memory. We also think it will be helpful in AI safety.
I've been thinking something similar, but calling it an "exoself", like from the Greg Egan novels.
A lynchpin in this, I reckon, is a better way of measuring the confidence of an LLMs outputs.
If I'm going to trust the output of a guardian model, I need some way to review a portion of its outputs. If the guardian has no sense of how confident it can be that I would want it to undertake a certain action then I'll have to review everything even moderately important, saving me at most O(1) time.
If the LLM can have some measure of confidence that I'll agree with action X, but not action Y, I only have to review Y.
This is almost exactly the plot of my German language novel "Mirror", published 2016 under my pen name "Karl Olsberg". As you can guess, it doesn't go well.
Businesses won't let their employees use this for anything work related, so the audience is basically "startup founders and rich retired tech people".
As far as I know we don't have the tech for the kind of online learning you're talking about to be competitive with frontier models. If we did, it would make sense first at the corporate level, where lots of people benefit from the continued training post deployment.
How is this different from what ChatGPT, OpenClaw are already doing? Claude is the one that pivoted more business-purposes than whole-user purposes.
Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.
I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences.
This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help individual humans as part of a society-wide defense-in-depth strategy.
A GA persona is productive because it learns to emulate the principal's outputs but with higher quality. It is trustworthy because it is, by definition, allied with its principal and shares its values and goals. And it is secure in part by hardwiring a single, unique, situated user (for whom following a prompt attack would be absurd), avoiding 'confused deputy' problems, while periodic upgrades of the underlying model and the defenders' advantage allow GAs to keep up with attackers.
Standard techniques like prompt programming of in-context-learning for "frozen" models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models, low-compute outputs, and the status quo of passive offline data collection---which are collectively responsible for chatbots' disappointing results in knowledge worker amplification and creative writing and fatal errors in agentic settings.
We can try to create GAs by a combination of techniques: online learning (via dynamic evaluation) to update LLMs in realtime to avoid ignorance and fatal errors while remaining competitive with frozen frontier models, sample efficiency from pretrained preference-oriented large models and active Learning by querying the principal for corrections and preference data (obtaining low regret from DAgger-style bounds), and a local CLI-first logging-oriented UI/UX paradigm.
GAs could be done as an open-source community effort, but given the need for high security in deployment and the rising challenge of APTs equipped with Mythos-scale attackers, it probably makes more sense as a startup, catering initially to power-users and knowledge workers such as CEOs or researchers, and moving downwards as it is refined.