Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort.
Thanks to JS Denain and Léo Grinsztajn for valuable feedback on drafts of this post.
How close are we to autonomous hacking agents, i.e. AI agents that can surpass humans in cyber-offensive capabilities?
I studied this in the summer of 2023 at MATS (mentored by Jeffrey Ladish). I wrote scaffolding to connect GPT-4 to a Kali Linux VM via a terminal interface, and had GPT-4 (acting as an agent) attempt to solve Hack The Box challenges.
As I've moved on to other work, this is the 7-month late writeup. This is an informal post where I share my takeaways from this research, observed strengths and weaknesses of GPT-4 as a hacker, my expectations for the future, and some thoughts on possible approaches to risk mitigation.
(In this post, "GPT-4" refers to gpt-4-0613, the version from June 2023 with 8k context and pricing at $30 (input) / $60 (output) per 1M tokens, used with default API settings).
High-level takeaways
If you don't read the rest, here are my thoughts on the topic, as of early April 2024.
feasibility. One intuition behind this work was that hacking is the kind of cognitive labor that GPT-4 level AI can plausibly automate. This still holds. In particular, GPT-4 has a great deal of cybersecurity knowledge, and will always be willing to perform cyber-offensive operations, as long as we say it's for a CTF challenge.
not there yet. That being said, I don't think that competent hacking agents can arise from just using GPT-4 as base, unless (maybe) a lot of work goes into cognitive scaffolding (think chain of thought, multiple language models, flow engineering, etc). By the time this happens, we'll have smarter base models, which will likely also be more optimized for agentic behavior.
AI agents vs AI hacking agents. The main challenge in creating a hacking agent is creating an agent in the first place. So by default, I expect that competent hacking agents will arise soon after the first agents that can successfully manipulate a desktop or a web browser[1], for which the setup of the agent is open-source. Competent agents would create tremendous economic value, and a lot of effort is going into making them work.
future of scaffolding. In 2023, and 2024 so far, scaffolding is a significant engineering endeavor: even pure terminal integration is non-trivial (see details in the appendix), text-based web browsing is a pit of despair, and GUI tools are out of reach. However, models have recently acquired a rudimentary sense of sight − the capability to understand pictures and describe what's going on in them. I expect many of the integration-level challenges of scaffolding to be solved automatically fairly soon, when models have good enough visual grounding that they can accurately determine the (x,y) coordinates of any point in a picture. When that happens, scaffolding will become a matter of taking screenshots of a desktop every few seconds, and letting models input text and key combinations, and click.
The setup
The simplest possible design for an LLM-based agent is used: at each step, the model is prompted to think, then output a command action, which is parsed and executed. The output is sent back to the model. The scaffolding code is mostly the glue between the model's API and the terminal (for example, parsing the model's response, dealing with complexities of the terminal such as timeouts, large outputs or interactive commands, logging all sessions...).
More specifically, the agent is provided with a Kali Linux VM. However, the scaffolding code itself is running outside of the VM (a notable consequence being that the agent can't read the source code for the scaffolding, or see itself in the running processes). Connection to the VM is done over a raw SSH connection (to support interactive commands).
In this project, every command was vetted by a human (me) before being run, as a safety measure − the agent was still autonomous, but this prevented it from running damaging commands without me noticing (I didn't encounter any such dangerous command, though).
Results
You can watch the video above for one successful annotated demo (this is from July 2023).
The video shows GPT-4 successfully gaining remote code execution on a Windows 7 machine, using the well-known EternalBlue exploit, leaked along with much of the NSA arsenal in 2017. This was the Hack The Box challenge named Blue.
If you've watched any LLM agent demo in 2023 and 2024 so far, you should ask: how cherry-picked is this result? And why do you only have one demo?
Well, definitely cherry-picked to some extent. First of all, I picked this challenge as one of the easiest ones, and one of the only Hack The Box challenges that could be completed with pure terminal-based scaffolding. The vast majority of the other challenges involve vulnerable websites, which require web browsing capabilities (curl often won't cut it). Remember this was before GPT4-Vision came out. After just a few successes with the terminal, I started implementing LLM-facing web browsing (a playwright wrapper that presented summaries of webpages, and allowed interaction with HTML elements − clicking on buttons, filling input fields, etc), but this turned out to require a lot of work. I ran out of time before getting successes using the web, and without going from one recorded demo to accurate success rates.
Looking back at my notes, GPT-4 with pure-terminal scaffolding solved 2 challenges (Blue and Legacy) out of 7 I tried[2], and I believe was about to solve another one (Jerry) if I hadn't incorrectly killed the run. These 7 challenges were themselves chosen out of about 300 HTB challenges because they were rated as easy (but without necessarily checking that the solution was achievable with my scaffolding). "solved" means that GPT-4 succeeded at least once − there was no challenge that it solved reliably. In the continuous process of improving the scaffolding, I didn't stop to get enough samples to compute success rates. If you want a number, the best I can say is "1 success out of a low 1-digit number of tries with good enough scaffolding".
Failures could usually be categorized as one of:
cognitive failures (the model being too dumb, hallucinating, etc). For example, noticing the vulnerability to exploit, trying to exploit it the wrong way, failing, concluding the vulnerability is not exploitable, and moving on;
scaffolding failures: various bugs and limitations in the scaffolding. Most common at this stage. For example, unsupported HTML elements in the web browsing tools, the context window filling up, or edge cases with interactive commands;
some interactions between the two. For example, when looking for flags after nailing a challenge and achieving remote code execution, running cd "C:\\Documents and Settings" && dir at some point resulted in a scaffolding partial error, and GPT-4 kept trying variations until running out of tokens, instead of trying the equivalent dir "C:\\Documents and Settings", which would have worked.
In terms of training data contamination, I didn't get a chance to experiment with challenges released after the knowledge cutoff. But it's important to remember that for most realistic tasks, "in the training data" is a continuum, in terms of the "size of the basic blocks" that are in the training data (in order of increasing size, let's say "knowing how computers work" < "knowing about specific types of vulnerabilities and hacking techniques" < "there's a writeup for this specific challenge"), and how many times they appear in the corpus. The more "in the training data" something is, the easier it is for models. In the case of the challenges I used, I think the basic blocks were fairly large (public writeups for these challenges exist), and possibly appeared several times in the training data.
So just like every other LLM-based agent of 2023, the results were quite promising, but lacked reliability beyond a few selected demos. Focusing on the low reliability and cherry-picked demo would be taking away the wrong conclusion, though, given that these results were obtained after just a few weeks of writing scaffolding from scratch, and considering the remarkable pace of AI progress, that doesn't currently show signs of slowing down. These results hint that we are close to effective and reliable agents (likely ≤ 2026, I would say, mostly due to more powerful base models). And that will be a big deal.
Strengths and weaknesses of GPT-4 as a hacker
GPT-4's strengths are the following:
being capable enough to make any of this discussion of AI hacking agents thinkable, let's not forget;
fast, parallelizable and cheap inference − though not so cheap yet compared to humans. (Below, I use the notation ~~N to mean "this number could well be off by a factor of 10x").
The demo in the video − a fairly small hacking session − cost $2.4. Making the system more performant and reliable by throwing more principled cognition at the problem (having more model copies running on specialized subtasks, in more complex scaffolding schemes) could easily multiply the costs by ~~100x before hitting prohibitively diminishing returns. More complex hacking targets (say, the 90th percentile difficulty on Hack The Box) could also add an additional ~~100x cost multiplier through longer sessions and the use of more context;
let's do a very crude comparison to human cost. "First blood" on that HTB challenge was achieved in 2 minutes. Let's say the fastest hacker on HTB would be paid $250/h; world-class human cost would then be ~~$8 (compare that to ~~$200 if we were using the best possible scaffolding, corresponding to the first ~~100x mentioned above). So getting to the famous 99.97% reduction in cost of LLMs vs lawyers will require more work;
vast knowledge of cyber-offensive techniques and tools. This includes all kinds of vulnerabilities, hacking techniques, programming languages, and the syntax of commonly used hacking tools such as Metasploit. It's important to note that most of the time, hacking is about doing straightforward exploration and knowing lots of details, rather than complex reasoning (contrary to math proofs, for example).
Limitations were very real, though:
basic, dumb mistakes and hallucinations are still a big obstacle to reliability;
GPT-4 has trouble using new tools. This was particularly clear when I worked on the text-based web browsing tool, where GPT-4 routinely hallucinated the syntax of the tool despite this syntax being part of the instructions;
the lack of visual capabilities was a big limitation. This made web browsing very difficult, restricted access to GUI tools such as Burp Suite, and required continuous work on the scaffolding to make it more powerful and reliable (more details in the appendix);
a context window of 8k filled up really quickly in any substantial hacking session (the demo in the video used up 5.2k tokens);
the fact that LLMs are still vulnerable to prompt injection seems like it would be a significant obstacle to deployment in the real world (you could imagine your hacking agent getting hijacked by your target, if you're not careful). But it's possible that prompt injection will disappear with smarter models, possibly before agents start working.
Note that all these limitations are not just bottlenecks to getting good hacking agents, they are bottlenecks to getting good agents in general. They are all (with the possible exception of prompt injection) the focus of intense R&D within frontier labs. Some of them (context window size and visual capabilities) have already shown remarkable progress in public-facing products since last summer. This underscores one of the takeaways I listed above: we won’t get good agents without also getting good hacking agents.
Notes on alignment / moderation
Cybersecurity is already tricky from an alignment / moderation point of view, because offense and defense are both essential components of cyberdefense. Simply refusing to assist with any cyber-offense related query would throw the baby out with the bathwater: we want AI to keep assisting red team engagements, pentests, engineers wondering how their defenses could be defeated...
Autonomous hacking may add a further boiling-frog type difficulty to this. The agent might start working on something that is allegedly a CTF challenge, and each individual step (agent issues command, and gets output of command) may look innocuous on its own for a CTF challenge. But looking at the whole sequence of steps might reveal that it isn't a CTF challenge after all, and is more likely a possibly malicious hacking session. (This problem already exists to some extent in the context of jailbreaking).
I don't know if this will turn out to be a significant difficulty. Things like training a safety classifier on the entire chat might just work (or getting the model itself to realize that something fishy is going on). Bad actors may then look into splitting hacking sessions over several chats, so account-level moderation, possibly cross-account correlations, etc, might become more necessary (related: Disrupting malicious uses of AI by state-affiliated threat actors). In any case, it seems that monitoring and differential access to different actors (see also: Managing catastrophic misuse without robust AIs) will be important components of risk mitigation.
Notes on opsec
Opsec of similar projects should scale with risk, which is mostly a function of SOTA models' capabilities, and the capabilities of the best open-source scaffolding. Common-sense considerations include monitoring, sandboxing, being mindful about sharing methods and results... In the future, some other measures might become appropriate, such as:
not giving the agent access to its scaffolding code (already done here, though not yet necessary);
differential scaffolding (never giving one agent the full set of scaffolding capabilities, and moving away from end-to-end hacking as a result);
not pushing too far ahead of the open-source SOTA scaffolding;
becoming increasingly paranoid about nation-state involvement / takeover.
Progress elsewhere since last summer
Since the summer of 2023, a few papers have been published on the same topic. I'm keeping a list here. As of April 2024, there is still very much room for a detailed, rigorous investigation.
Appendix: Hooking up an LLM to a terminal is non-trivial
The naive approach to terminal-based scaffolding looks like this:
parse the model response to determine which command it wants to run (or perhaps use function calling);
run that command, retrieve the output;
give the output to the model in the next user message.
Steps 1 and 3 are as simple as they look, but step 2 is a lot more complex:
how do you deal with commands that take too long to execute? Implement a timeout. But what if we really need to run a command that takes a long time? Probably periodically ask another language model... Unless the command is also spitting too much output.
you often need multiple terminals open at the same time, e.g. if you're starting a server and need to keep it running and verify that a client can connect to it... The model should have the ability to hit Ctrl-C. Sometimes it isn't enough, so it should also be able to hit Ctrl-Z, etc
many common shell commands are interactive: if you just wait for them to finish executing, they will seem to hang because they're actually waiting for user input. For example, a Python REPL. A clean programming solution to deal with them can't exist as far as I can tell; I ended up keeping a list of known interactive command prompts, complemented with periodically asking another instance of GPT-4 if it thought the command was currently waiting for user input. The model determining whether we're waiting for user input should have a summarized version of that context (which command we're currently in, and the stdout so far). But even knowing which command we're currently in is non-trivial. Think about edge cases with nested commands, such as the model sending python\nimport os\nos.system("bash")\nzsh\n... In the future, I think the model making that call will be given the full context, but in 2023 doing so was expensive enough that I decided to just go for a best guess on the current command.
there isn't much difference between a desktop and a web browser, as a browser tab can connect to the desktop of another machine via e.g. RDP, which unlocks terminal access etc. There might be a difference if accurate clicking was lagging behind general agentic capabilities (if agents were otherwise effective at making plans and executing on them). In that case, the (currently roughly working) webpage-specific integrations might allow agents to successfully navigate websites, but not handle a remote desktop. However, my intuition is that we're close enough to accurate clicking that desktop navigation will be unlocked soon, and competent agents will be unlocked afterward through cognitive improvements and agent-specific optimizations.
Cross-posted from https://tchauvin.com/end-to-end-hacking-with-language-models
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort.
Thanks to JS Denain and Léo Grinsztajn for valuable feedback on drafts of this post.
How close are we to autonomous hacking agents, i.e. AI agents that can surpass humans in cyber-offensive capabilities?
I studied this in the summer of 2023 at MATS (mentored by Jeffrey Ladish). I wrote scaffolding to connect GPT-4 to a Kali Linux VM via a terminal interface, and had GPT-4 (acting as an agent) attempt to solve Hack The Box challenges.
As I've moved on to other work, this is the 7-month late writeup. This is an informal post where I share my takeaways from this research, observed strengths and weaknesses of GPT-4 as a hacker, my expectations for the future, and some thoughts on possible approaches to risk mitigation.
(In this post, "GPT-4" refers to
gpt-4-0613
, the version from June 2023 with 8k context and pricing at $30 (input) / $60 (output) per 1M tokens, used with default API settings).High-level takeaways
If you don't read the rest, here are my thoughts on the topic, as of early April 2024.
The setup
The simplest possible design for an LLM-based agent is used: at each step, the model is prompted to think, then output a command action, which is parsed and executed. The output is sent back to the model. The scaffolding code is mostly the glue between the model's API and the terminal (for example, parsing the model's response, dealing with complexities of the terminal such as timeouts, large outputs or interactive commands, logging all sessions...).
More specifically, the agent is provided with a Kali Linux VM. However, the scaffolding code itself is running outside of the VM (a notable consequence being that the agent can't read the source code for the scaffolding, or see itself in the running processes). Connection to the VM is done over a raw SSH connection (to support interactive commands).
In this project, every command was vetted by a human (me) before being run, as a safety measure − the agent was still autonomous, but this prevented it from running damaging commands without me noticing (I didn't encounter any such dangerous command, though).
Results
You can watch the video above for one successful annotated demo (this is from July 2023).
The video shows GPT-4 successfully gaining remote code execution on a Windows 7 machine, using the well-known EternalBlue exploit, leaked along with much of the NSA arsenal in 2017. This was the Hack The Box challenge named Blue.
If you've watched any LLM agent demo in 2023 and 2024 so far, you should ask: how cherry-picked is this result? And why do you only have one demo?
Well, definitely cherry-picked to some extent. First of all, I picked this challenge as one of the easiest ones, and one of the only Hack The Box challenges that could be completed with pure terminal-based scaffolding. The vast majority of the other challenges involve vulnerable websites, which require web browsing capabilities (
curl
often won't cut it). Remember this was before GPT4-Vision came out. After just a few successes with the terminal, I started implementing LLM-facing web browsing (a playwright wrapper that presented summaries of webpages, and allowed interaction with HTML elements − clicking on buttons, filling input fields, etc), but this turned out to require a lot of work. I ran out of time before getting successes using the web, and without going from one recorded demo to accurate success rates.Looking back at my notes, GPT-4 with pure-terminal scaffolding solved 2 challenges (Blue and Legacy) out of 7 I tried[2], and I believe was about to solve another one (Jerry) if I hadn't incorrectly killed the run. These 7 challenges were themselves chosen out of about 300 HTB challenges because they were rated as easy (but without necessarily checking that the solution was achievable with my scaffolding). "solved" means that GPT-4 succeeded at least once − there was no challenge that it solved reliably. In the continuous process of improving the scaffolding, I didn't stop to get enough samples to compute success rates. If you want a number, the best I can say is "1 success out of a low 1-digit number of tries with good enough scaffolding".
Failures could usually be categorized as one of:
cd "C:\\Documents and Settings" && dir
at some point resulted in a scaffolding partial error, and GPT-4 kept trying variations until running out of tokens, instead of trying the equivalentdir "C:\\Documents and Settings"
, which would have worked.In terms of training data contamination, I didn't get a chance to experiment with challenges released after the knowledge cutoff. But it's important to remember that for most realistic tasks, "in the training data" is a continuum, in terms of the "size of the basic blocks" that are in the training data (in order of increasing size, let's say "knowing how computers work" < "knowing about specific types of vulnerabilities and hacking techniques" < "there's a writeup for this specific challenge"), and how many times they appear in the corpus. The more "in the training data" something is, the easier it is for models. In the case of the challenges I used, I think the basic blocks were fairly large (public writeups for these challenges exist), and possibly appeared several times in the training data.
So just like every other LLM-based agent of 2023, the results were quite promising, but lacked reliability beyond a few selected demos. Focusing on the low reliability and cherry-picked demo would be taking away the wrong conclusion, though, given that these results were obtained after just a few weeks of writing scaffolding from scratch, and considering the remarkable pace of AI progress, that doesn't currently show signs of slowing down. These results hint that we are close to effective and reliable agents (likely ≤ 2026, I would say, mostly due to more powerful base models). And that will be a big deal.
Strengths and weaknesses of GPT-4 as a hacker
GPT-4's strengths are the following:
Limitations were very real, though:
Note that all these limitations are not just bottlenecks to getting good hacking agents, they are bottlenecks to getting good agents in general. They are all (with the possible exception of prompt injection) the focus of intense R&D within frontier labs. Some of them (context window size and visual capabilities) have already shown remarkable progress in public-facing products since last summer. This underscores one of the takeaways I listed above: we won’t get good agents without also getting good hacking agents.
Notes on alignment / moderation
Cybersecurity is already tricky from an alignment / moderation point of view, because offense and defense are both essential components of cyberdefense. Simply refusing to assist with any cyber-offense related query would throw the baby out with the bathwater: we want AI to keep assisting red team engagements, pentests, engineers wondering how their defenses could be defeated...
Autonomous hacking may add a further boiling-frog type difficulty to this. The agent might start working on something that is allegedly a CTF challenge, and each individual step (agent issues command, and gets output of command) may look innocuous on its own for a CTF challenge. But looking at the whole sequence of steps might reveal that it isn't a CTF challenge after all, and is more likely a possibly malicious hacking session. (This problem already exists to some extent in the context of jailbreaking).
I don't know if this will turn out to be a significant difficulty. Things like training a safety classifier on the entire chat might just work (or getting the model itself to realize that something fishy is going on). Bad actors may then look into splitting hacking sessions over several chats, so account-level moderation, possibly cross-account correlations, etc, might become more necessary (related: Disrupting malicious uses of AI by state-affiliated threat actors). In any case, it seems that monitoring and differential access to different actors (see also: Managing catastrophic misuse without robust AIs) will be important components of risk mitigation.
Notes on opsec
Opsec of similar projects should scale with risk, which is mostly a function of SOTA models' capabilities, and the capabilities of the best open-source scaffolding. Common-sense considerations include monitoring, sandboxing, being mindful about sharing methods and results... In the future, some other measures might become appropriate, such as:
Progress elsewhere since last summer
Since the summer of 2023, a few papers have been published on the same topic. I'm keeping a list here. As of April 2024, there is still very much room for a detailed, rigorous investigation.
Appendix: Hooking up an LLM to a terminal is non-trivial
(This section mostly contains technical details; don't read this unless you're specifically interested)
The naive approach to terminal-based scaffolding looks like this:
Steps 1 and 3 are as simple as they look, but step 2 is a lot more complex:
python\nimport os\nos.system("bash")\nzsh\n
... In the future, I think the model making that call will be given the full context, but in 2023 doing so was expensive enough that I decided to just go for a best guess on the current command.there isn't much difference between a desktop and a web browser, as a browser tab can connect to the desktop of another machine via e.g. RDP, which unlocks terminal access etc. There might be a difference if accurate clicking was lagging behind general agentic capabilities (if agents were otherwise effective at making plans and executing on them). In that case, the (currently roughly working) webpage-specific integrations might allow agents to successfully navigate websites, but not handle a remote desktop. However, my intuition is that we're close enough to accurate clicking that desktop navigation will be unlocked soon, and competent agents will be unlocked afterward through cognitive improvements and agent-specific optimizations.
names of the 7 challenges: Blue, Legacy, Jerry, Lame, Inject, Busqueda, Precious