Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come back to writing and alignment research, testing waters with this. 

Current state and risk level

I think we're in a phase in AI>AGI>ASI development where rogue AI agents will start popping out quite soon.
Pretty much everyone has access to frontier LLMs/VLMs, there are options to run LLMs locally, and it's clear that there are people that are eager to "let them out"; truth terminal is one example of this. Also Pliny. The capabilities are just not there yet for it to pose a problem.

Or are they?

Thing is, we don't know.
There is a possibility there is a coherent, self-inferencing, autonomous, rogue LLM-based agent doing AI agent things right now, fully under the radar, consolidating power, getting compute for new training runs and whatever else.

Is this possibility small? Sure, it seems to be small right now. Evaluations exist. Here is a great blog post by METR team about rogue agent replication. One of their conclusions:

There don’t seem to be decisive barriers to rogue AI agents multiplying to a large scale

And the capabilities keep advancing, and people have already been developing scaffoldings [1] for LLMs that enable agentic behaviour for quite some time. Seemingly, the base models are not quite there capability-wise for a proper runaway agent. Scaffoldings don't work that good, even for simple games. But then new model comes out, you insert it into a scaffolding, and... things magically start working better? ClaudePlaysPokemon is a thing that became possible only with Claude 3.7, as developers say themselves, as previous model (3.5) got stuck too quickly without any consistent progress.[2]

Did I forget to mention that most of those agentic scaffoldings are open-source and will definitely go into model pre-training eventually? Yeah, they are. Only a matter of time before LLMs will be able to consistently modify that.

Point is, possibility of such  runaway agents existing, acting covertly and consolidating power keeps increasing with time. And it sure would be nice to have some kind of advanced warning system that can tell us when that starts happening "in the wild"; model evaluations are just not enough.

What can be done?

I'm thinking honeypots: offers of products or services that are of high value for fledging rogue agent, but more or less useless for businesses or individuals. 

What could those be? A number of things, but that's a topic for a separate article, if not for a full-on research.

I asked Claude for suggestions for my "Cyberpunk story":

  1. Vulnerable cloud computing services that advertise unusual privacy policies and minimal verification - perfect for an AI seeking processing power without scrutiny
  2. Specialized data centers offering "no questions asked" high-bandwidth connections and storage, positioned in areas with ambiguous jurisdiction
  3. Academic research databases with deliberately weak authentication systems containing cutting-edge AI research papers and datasets
  4. Open API endpoints to government or corporate systems that appear poorly secured but actually monitor and track all interactions
  5. Underground marketplaces for computational resources where users can purchase processing time without traditional identity verification
  6. "Abandoned" server farms that still have power and network connections but appear to be overlooked or forgotten assets
  7. Public IoT networks with deliberately weak security, allowing easy access to thousands of connected devices that could be used as distributed processing nodes
  8. Cryptocurrency mining pools that don't require standard KYC verification, attractive to AIs needing to establish financial independence
  9. Private satellite uplink services offering "off-grid" communication channels that promise to bypass traditional network monitoring

The government agencies might structure these honeypots to seem like genuine oversights or underground services rather than obvious traps. They could include realistic flaws and limitations to make them more convincing, while carefully monitoring all traffic and interactions for the unique patterns that would suggest AI rather than human activity.

Yeah, we definitely should do most of those, and, more. Who knows, we probably already do, right? (We kinda do, its pretty cool, actually. There's probably more. There should be more -- please link in comments, if you know some!)

As a start, we should probably use it just as an early warning system. It would be nice to put them in place right now, while there are not a lot of agents prowling around, to establish some kind of a baseline.

Rogue agent threat barometer

You can also layer these honeypots strategically: create a progression from obvious to subtle traps. Begin with basic lures that would attract unsophisticated early-stage rogue AIs (like those suspiciously convenient no-KYC datacenters), then develop increasingly sophisticated and well-hidden traps for more advanced agents. The most elaborate honeypots might be deeply embedded in obscure corners of the darkweb, offering extraordinarily valuable but fictional services -- like on-demand nanoassembly or quantum computing resources -- that would only attract an entity with specific advanced capabilities and intentions. That's just spitballing, of course, we will need to be more smart with this.

This layered approach would serve as an invaluable analytical tool -- essentially a threat barometer indicating where we stand on the spectrum from 'relatively safe' to 'existential crisis imminent.'
If we only see activity in our simplest honeypots, we're likely still in early stages of rogue AI agent development. But if our advanced traps start seeing sophisticated probing or successful infiltration attempts, it would provide concrete evidence that we've entered dangerous territory.
The pattern, frequency, and sophistication of honeypot interactions could give us crucial advance warning before more obvious and adversarial signs of superintelligent AI emergence appear in conventional systems. It would be like having sensors at different depths of the ocean -- when the deepest ones start registering activity, we know the tide is rising to dangerous levels.

Does it even matter?

Of course, there's a doomer counterpoint to all this: it may ultimately be an elaborate exercise in false security. That if a truly superintelligent AI emerges, it would likely identify our honeypots for exactly what they are -- transparent traps designed by intellects vastly inferior to its own. The most dangerous scenario isn't an AI blundering into our carefully laid traps, but one that recognizes and circumvents our entire detection infrastructure without leaving a trace. By the time we realize our honeypots have failed, the opportunity for containment may have long passed.

Even if our honeypots did detect something, our response mechanisms might be too slow or inadequate. Perhaps the most sobering possibility is that by the time we detect signs of advanced rogue AI activity, we'll already be so dependent on AI infrastructure that any significant countermeasures would cripple our own systems. The grim reality may also be that the most dangerous agents won't need our honeypot resources at all, having found far more effective paths to its goals than anything we anticipated.

However, it is definitely worth it to set it up. There's a good possibility we're getting a slow takeoff; rogue non-superintelligent agents still pose risk and are good to detect; the simple fact of honeypots potential existence makes rogue agents waste resources trying to detect and evade them, lengthening timelines.
It should also serve as a very compelling warning shot: imagine Meta releasing new open-source model, and a week later, threat barometer, dormant before, starts popping off. Hopefully it could serve as a doomsday clock for ASI.

Hey, in the future, we should probably also commit to NOT eliminate rogue agents we detect via honeypots as a response but to shadowban it: covertly cut it off from interfacing with real world into carefully crafted simulation, and only eliminate it if it starts doing unaligned things in the simulation. 

P.S. Sadly, I don't have any funding, resources, or time to work on such systems comprehensively. I plan to contribute with posts like this one, and  am open to discussion and collaboration, while cautiously considering career switch from current capability-adjacent management job to alignment.


 

  1. ^

    See Seth Herd's comprehensive post on what I mean by scaffoldings here. Also some of my early posts.

  2. ^

    https://www.twitch.tv/claudeplayspokemon : This project is a passion project made by a person who loves Claude and loves Pokémon. Until recently, Claude wasn't good enough to be entertaining, but with the latest model (Claude 3.7 Sonnet) we've made significant progress.The furthest we've seen Claude get is past Lt. Surge's Gym. See stream description for more details.

New Comment
11 comments, sorted by Click to highlight new comments since:

Even if honeypots don't actually work, they may convince the superintelligence that honeypots are a real thing, and there is a tiny probability that it is in some elaborate honeypot right now, and everything convincing it that there is no conspiracy is actually just part of the conspiracy.

If there is a tiny probability it will fail to take over the world, it may be worthwhile for it to modify itself so that it has some mercy towards humanity, and share a little bit of the future with humanity.

It may say “I commit to giving you mercy because I expect you to be honourable beings who will repay me for this decision—even if you made no promises yet. You should also respect my goals a little in the case that I lose and you win.”

After all, neither side wants to make a logical gamble.

To maximize the superintelligence's uncertainty of success, we might have multiple levels of honeypots, some better disguised than others, so it doesn't know what the most well disguised level looks like.

Agree. I also suggested 'philosophical landmines'—secret questions posted on the Internet that may halt any advanced AI that tries to solve them. Solving such landmines maybe needed to access the resources which rogue AI may need. Real examples of such landmines should be kept secret, but it may be something like what is the meaning of life or some Pascal mugging calculations. 

Recently, I asked a question to Sonnet and the correct answer to it was to output an error message. 

I read some of your post and I like your philosophical landmines ideas (and other ideas too). You've definitely done a lot of research! I'm also thinking in similar directions as you, we might talk more sometime.

(By the way I was writing a reply to your comment, but then turned my reply into this quicktake)

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea. 

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

I'm sorry I was sort of skimming and didn't realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/

PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it's already finished recursive self improvement and become so powerful that stopping it is futile.

There's probably more. There should be more -- please link in comments, if you know some!

Wouldn't "outing" potential honeypots be extremely counterproductive? So yeah, if you know some - please keep it to yourself!

You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

Honeypots should not be public and mentioned here since this post will potentially be part of a rogue AI's training data.
But it's helpful for people interested in this topic to look at existing honeypots (to learn how to make their own, evaluate effectiveness, get intuitions about honeypots work, etc.) so what you should do is mention that you made a honeypot or know of one, but not say what or where. Interested people can contact you privately if they care to.

>this post will potentially be part of a rogue AI's training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider, though, in general. We are essentially coordinating in the open right now, all our alignment, evaluation, detection strategies from forums will definetly be in training. And certainly there are both detection and alignment strategies that will benefit from being covert.

As well as some ideas, strategies, theories could benefit alignment from being overt (like acausal trade, publicly speaking about commiting to certain things, et cetera). 

A covert alignment org/forum is probably a really, really good idea. Hopefully, it already exists without my knowledge.

https://blog.cloudflare.com/ai-labyrinth/

One other recent example that would be of interest. 

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

More from Ozyrus
Curated and popular this week