End-to-end hacking with language models
Cross-posted from https://tchauvin.com/end-to-end-hacking-with-language-models Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Thanks to JS Denain and Léo Grinsztajn for valuable feedback on drafts of this post. How close are we to autonomous hacking agents, i.e. AI agents that can surpass humans in cyber-offensive capabilities? I studied this in the summer of 2023 at MATS (mentored by Jeffrey Ladish). I wrote scaffolding to connect GPT-4 to a Kali Linux VM via a terminal interface, and had GPT-4 (acting as an agent) attempt to solve Hack The Box challenges. As I've moved on to other work, this is the 7-month late writeup. This is an informal post where I share my takeaways from this research, observed strengths and weaknesses of GPT-4 as a hacker, my expectations for the future, and some thoughts on possible approaches to risk mitigation. (In this post, "GPT-4" refers to gpt-4-0613, the version from June 2023 with 8k context and pricing at $30 (input) / $60 (output) per 1M tokens, used with default API settings). High-level takeaways If you don't read the rest, here are my thoughts on the topic, as of early April 2024. * feasibility. One intuition behind this work was that hacking is the kind of cognitive labor that GPT-4 level AI can plausibly automate. This still holds. In particular, GPT-4 has a great deal of cybersecurity knowledge, and will always be willing to perform cyber-offensive operations, as long as we say it's for a CTF challenge. * not there yet. That being said, I don't think that competent hacking agents can arise from just using GPT-4 as base, unless (maybe) a lot of work goes into cognitive scaffolding (think chain of thought, multiple language models, flow engineering, etc). By the time this happens, we'll have smarter base models, which will likely also be more optimized for agentic behavior. * AI agents vs AI hacking agents. The main challenge in creating a hacking agent is creating an agen
Nice attempt. This reminds of the Pizza Meter and Gay Bar Index related to Pentagon crisis situations. I found it hard to find reliable information on this when I looked (I can't even find a good link to share), but the mechanism seems plausible.