Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come back to writing and alignment research, testing waters with this.
Current state and risk level
I think we're in a phase in AI>AGI>ASI development where rogue AI agents will start popping out quite soon.
Pretty much everyone has access to frontier LLMs/VLMs, there are options to run LLMs locally, and it's clear that there are people that are eager to "let them out"; truth terminal is one example of this. Also Pliny. The capabilities are just not there yet for it to pose a problem.
Or are they?
Thing is, we don't know.
There is a possibility there is a coherent, self-inferencing, autonomous, rogue LLM-based agent doing AI agent things right now, fully under the radar, consolidating power, getting compute for new training runs and whatever else.
Is this possibility small? Sure, it seems to be small right now. Evaluations exist. Here is a great blog post by METR team about rogue agent replication. One of their conclusions:
There don’t seem to be decisive barriers to rogue AI agents multiplying to a large scale
And the capabilities keep advancing, and people have already been developing scaffoldings [1] for LLMs that enable agentic behaviour for quite some time. Seemingly, the base models are not quite there capability-wise for a proper runaway agent. Scaffoldings don't work that good, even for simple games. But then new model comes out, you insert it into a scaffolding, and... things magically start working better? ClaudePlaysPokemon is a thing that became possible only with Claude 3.7, as developers say themselves, as previous model (3.5) got stuck too quickly without any consistent progress.[2]
Did I forget to mention that most of those agentic scaffoldings are open-source and will definitely go into model pre-training eventually? Yeah, they are. Only a matter of time before LLMs will be able to consistently modify that.
Point is, possibility of such runaway agents existing, acting covertly and consolidating power keeps increasing with time. And it sure would be nice to have some kind of advanced warning system that can tell us when that starts happening "in the wild"; model evaluations are just not enough.
What can be done?
I'm thinking honeypots: offers of products or services that are of high value for fledging rogue agent, but more or less useless for businesses or individuals.
What could those be? A number of things, but that's a topic for a separate article, if not for a full-on research.
I asked Claude for suggestions for my "Cyberpunk story":
- Vulnerable cloud computing services that advertise unusual privacy policies and minimal verification - perfect for an AI seeking processing power without scrutiny
- Specialized data centers offering "no questions asked" high-bandwidth connections and storage, positioned in areas with ambiguous jurisdiction
- Academic research databases with deliberately weak authentication systems containing cutting-edge AI research papers and datasets
- Open API endpoints to government or corporate systems that appear poorly secured but actually monitor and track all interactions
- Underground marketplaces for computational resources where users can purchase processing time without traditional identity verification
- "Abandoned" server farms that still have power and network connections but appear to be overlooked or forgotten assets
- Public IoT networks with deliberately weak security, allowing easy access to thousands of connected devices that could be used as distributed processing nodes
- Cryptocurrency mining pools that don't require standard KYC verification, attractive to AIs needing to establish financial independence
- Private satellite uplink services offering "off-grid" communication channels that promise to bypass traditional network monitoring
The government agencies might structure these honeypots to seem like genuine oversights or underground services rather than obvious traps. They could include realistic flaws and limitations to make them more convincing, while carefully monitoring all traffic and interactions for the unique patterns that would suggest AI rather than human activity.
Yeah, we definitely should do most of those, and, more. Who knows, we probably already do, right? (We kinda do, its pretty cool, actually. There's probably more. There should be more -- please link in comments, if you know some!)
As a start, we should probably use it just as an early warning system. It would be nice to put them in place right now, while there are not a lot of agents prowling around, to establish some kind of a baseline.
Rogue agent threat barometer
You can also layer these honeypots strategically: create a progression from obvious to subtle traps. Begin with basic lures that would attract unsophisticated early-stage rogue AIs (like those suspiciously convenient no-KYC datacenters), then develop increasingly sophisticated and well-hidden traps for more advanced agents. The most elaborate honeypots might be deeply embedded in obscure corners of the darkweb, offering extraordinarily valuable but fictional services -- like on-demand nanoassembly or quantum computing resources -- that would only attract an entity with specific advanced capabilities and intentions. That's just spitballing, of course, we will need to be more smart with this.
This layered approach would serve as an invaluable analytical tool -- essentially a threat barometer indicating where we stand on the spectrum from 'relatively safe' to 'existential crisis imminent.'
If we only see activity in our simplest honeypots, we're likely still in early stages of rogue AI agent development. But if our advanced traps start seeing sophisticated probing or successful infiltration attempts, it would provide concrete evidence that we've entered dangerous territory.
The pattern, frequency, and sophistication of honeypot interactions could give us crucial advance warning before more obvious and adversarial signs of superintelligent AI emergence appear in conventional systems. It would be like having sensors at different depths of the ocean -- when the deepest ones start registering activity, we know the tide is rising to dangerous levels.
Does it even matter?
Of course, there's a doomer counterpoint to all this: it may ultimately be an elaborate exercise in false security. That if a truly superintelligent AI emerges, it would likely identify our honeypots for exactly what they are -- transparent traps designed by intellects vastly inferior to its own. The most dangerous scenario isn't an AI blundering into our carefully laid traps, but one that recognizes and circumvents our entire detection infrastructure without leaving a trace. By the time we realize our honeypots have failed, the opportunity for containment may have long passed.
Even if our honeypots did detect something, our response mechanisms might be too slow or inadequate. Perhaps the most sobering possibility is that by the time we detect signs of advanced rogue AI activity, we'll already be so dependent on AI infrastructure that any significant countermeasures would cripple our own systems. The grim reality may also be that the most dangerous agents won't need our honeypot resources at all, having found far more effective paths to its goals than anything we anticipated.
However, it is definitely worth it to set it up. There's a good possibility we're getting a slow takeoff; rogue non-superintelligent agents still pose risk and are good to detect; the simple fact of honeypots potential existence makes rogue agents waste resources trying to detect and evade them, lengthening timelines.
It should also serve as a very compelling warning shot: imagine Meta releasing new open-source model, and a week later, threat barometer, dormant before, starts popping off. Hopefully it could serve as a doomsday clock for ASI.
Hey, in the future, we should probably also commit to NOT eliminate rogue agents we detect via honeypots as a response but to shadowban it: covertly cut it off from interfacing with real world into carefully crafted simulation, and only eliminate it if it starts doing unaligned things in the simulation.
P.S. Sadly, I don't have any funding, resources, or time to work on such systems comprehensively. I plan to contribute with posts like this one, and am open to discussion and collaboration, while cautiously considering career switch from current capability-adjacent management job to alignment.
- ^
- ^
https://www.twitch.tv/claudeplayspokemon : This project is a passion project made by a person who loves Claude and loves Pokémon. Until recently, Claude wasn't good enough to be entertaining, but with the latest model (Claude 3.7 Sonnet) we've made significant progress.The furthest we've seen Claude get is past Lt. Surge's Gym. See stream description for more details.
Even if honeypots don't actually work, they may convince the superintelligence that honeypots are a real thing, and there is a tiny probability that it is in some elaborate honeypot right now, and everything convincing it that there is no conspiracy is actually just part of the conspiracy.
If there is a tiny probability it will fail to take over the world, it may be worthwhile for it to modify itself so that it has some mercy towards humanity, and share a little bit of the future with humanity.
It may say “I commit to giving you mercy because I expect you to be honourable beings who will repay me for this decision—even if you made no promises yet. You should also respect my goals a little in the case that I lose and you win.”
After all, neither side wants to make a logical gamble.
To maximize the superintelligence's uncertainty of success, we might have multiple levels of honeypots, some better disguised than others, so it doesn't know what the most well disguised level looks like.
I'm sorry I was sort of skimming and didn't realize you already mentioned many levels of honeypots, and committing to put rogue AI in a simulation :/
PS: another type of honeypot might target AGI trying to influence the physical world. E.g. creating synthetic biology, or hiring humans to work in laboratories. Though on the other hand, an AGI might only try to influence the physical world in the very last step of its plan, when it's already finished recursive self improvement and become so powerful that stopping it is futile.