Keeping content out of LLM training datasets

Ben Millwood

This post collects methods to exclude internet resources from LLM training datasets.

I plan to at least try to keep this post up-to-date with respect to new things I learn on the topic. Please feel encouraged to suggest any additions or amendments.

This post is about how to do it. Discuss whether to apply these techniques in a separate post: Should we exclude alignment research from LLM training datasets?

Link preview image by Steve Douglas on Unsplash.

Documentation from model vendors

OpenAI (ChatGPT)

See docs for GPTBot and ChatGPT-User.

GPTBot is for training data, and ChatGPT-User is used by plugins which can access the internet during inference. They document the user-agents used, the robots.txt identities, and the IP ranges they access from. There's some commentary about how ChatGPT-User is used in training, which I didn't find very illuminating.

Anthropic (Claude)

Does Anthropic crawl data from the web, and how can site owners block the crawler?

Pretty similar to OpenAI's offering, except that they don't have fixed IP ranges, and Claude (as far as I understand?) doesn't directly access the internet, so that distinction isn't relevant.

Some nice touches are that their crawler will not scrape anything that already blocked the Common Crawl bot (see below) and specifically commits to not trying to bypass CAPTCHAs (see further below).

Google (Gemini)

Appears to use the Google-Extended crawler, which can be blocked with robots.txt. It doesn't use a different user-agent header relative to other Google products, so user-agent blocking is only possible if you're willing to block the Search scraper as well. I assume they also don't use fixed IP ranges, but I haven't really tried to check, since that always seemed like the most clumsy method anyway.

Meta (LLaMA)

LLaMA 2's model card doesn't disclose their training data, but LLaMA 1 (per its model card) was trained on a variety of sources including CCNet and C4, which are both derived from the Common Crawl dataset (see below), so it seems likely that excluding your data from LLaMA at a minimum requires excluding it from Common Crawl as well.

Common Crawl

Common Crawl is a large, publicly-available dataset which in principle any training process could use (and Meta did), so keeping out of AI training datasets necessitates keeping out of Common Crawl. Their FAQ documents their user-agent and robots.txt identifier.

xAI (Grok)

At first glance, I wasn't able to find any documentation about this. I'll update if I do.

External resources that aim to help block AI-related crawlers

https://darkvisitors.com/ collects and categorises crawlers of various kinds, and among other things offers a service where you can create an account with them and fetch a pre-made robots.txt from their API that includes all bots of a given type.

Discussion of general methods

robots.txt

See Wikipedia for full details on the technique. For the sake of this article, here are the main points:

Using this method requires the ability to place a file at the root of your website, and one file for each origin (e.g. subdomain) that serves your site.
Selectively excluding individual pages would be a little tedious unless you can place them all under a single "directory" in the URL. Wikipedia does mention per-page metadata equivalents (html tags and response headers), but they are more explicitly about search engine indexing, so I doubt they apply to crawlers for other purposes.
You ban scrapers one-by-one, and it'll be hard to know about a new scraper before it's already read your site.
Compliance with the file is voluntary, and some major scrapers (e.g. the Internet Archive) have elected to ignore it in the past. (I see a lot of confusion online about the fact that Google may list your page as a search result even if it's blocked in robots.txt. But I think this is consistent with them respecting robots.txt: they can list search results if other pages link to them, even if they never read the page themselves.)
If content on your site is mirrored to another site, the mirror may not have your robots.txt. This would be IMO bad behaviour by the mirror, but (again, IMO) not especially surprising. (Probably this isn't relevant for most users; GitHub and Wikipedia are notoriously unilaterally mirrored, but that's probably only because they're such large and well-known resources.)

User-Agent or IP-based blocking

Using this method likely requries sysadmin-level access to your webserver (similar to robots.txt).
In principle it shouldn't be too hard to implement user-agent filtering based on arbitrary request criteria, e.g. on a per-page basis.
Again, it's specific to each scraper, and not protective against future scrapers.
While compliance with user-agent or IP-based filtering isn't voluntary in the same way as robots.txt, both methods are relatively easy to evade for a motivated adversary. I'd guess this is a bit too blatant for good-faith scrapers, but it seems relevant that user-agent strings from modern browsers all pretend to be Mozilla/5.0 in order to work around historical instances of bad user-agent filters.
Mirroring likely defeats this method too.

Inclusion canaries

BIG-bench is a benchmark suite to run against LLMs which, by nature, is only valid if the LLMs did not train on the benchmark data. To this end, the benchmarks include a UUID to attempt to facilitate excluding the data from training sets, and detect if it has been trained on after all. Per a thread on niplav's shortform, Claude and GPT-4-base (but not, it seems, GPT-4o) have learned the canary.

From the name "canary", I originally guessed that these strings were meant to indicate whether some other exclusion method had worked or not. But noting that this string is in a GitHub repo that can be forked to any user, and public mirrors of GitHub exist (e.g. GitCode), surely it's hopeless to exclude the benchmarks by URL, and the "canary" can only work by actually causing exclusion from the training data – by AI vendors configuring their scrapers to drop documents that contain it.

Empirically, this seems not to be happening. I'm not sure how it's supposed to happen. Are there attempts to coordinate on this that I don't know about? Or is the idea that these canaries are primarily relevant for research on non-frontier models with more carefully chosen training sets?

ARC also maintains their own evals canary, about which I have the same questions, though I don't think Claude (currently) knows about it. (I've contacted them using their form to ask for more information, and I'll edit it in here if they reply.)

Canaries would have some unique strengths, if they worked: they are intrinsic to your content, so are realistically the only option for content submitted to websites you don't control (e.g. GitHub, or uh, LessWrong), and are robust to content mirroring.

CAPTCHAs

See Wikipedia for general discussion. This is the first technique in my list that attempts to make crawler access impossible rather than merely discouraged. The disadvantages are that it's annoying for human users, and it prevents all scraping, including e.g. search engine indexing. Also, many CAPTCHAs were eventually solved by automated systems, and this is only likely to get worse over time (though again, it's hard to imagine a good-faith scraper doing this, and as mentioned above, Anthropic has explicitly promised not to.)

User accounts

Require users to prove their identity before accessing your resource (implicitly depending on the fact that automated users won't want to or won't be able to create accounts). This is a serious technical hurdle if your website didn't already support authentication, it again prevents search engine indexing, and presents a higher barrier of entry to humans who don't want to (or can't be bothered to) share their identity with you. For some content, though, a mild barrier to human entry may be no bad thing.

Authorised user lists

This is the final level of secrecy that I considered: a system that doesn't only require users to prove their identity, but whitelists specific identities for access according to some notion you have of who should be able to read your resource. You then use whatever method you deem appropriate to choose an audience who will appropriately guard its contents. This seems primarily relevant for documents that either are extremely critical not to leak, or are a potential target for anonymous human adversaries to deliberately leak. Of course, it has the highest maintenance requirements, and the biggest cost to human access, of any method on this list.

What non-public data could be trained on?

This section is primarily speculative, but seemed worth raising. In principle:

OpenAI could have privileged access to data from Microsoft, e.g. Outlook or Office data, or GitHub, e.g. private repositories,
Google obviously has privileged access to data in Gmail or Google Drive,
Meta has an enormous volume of data in Facebook and Instagram.

I expect that these companies generally promise not to use private data in training, but I haven't reviewed their promises for their specifics or robustness. Happy to hear takes on this from the comments.

LESSWRONG
LW

3