There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.

LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.

But what about things that don’t belong on LW?

I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???

What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?

I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.

New Answer
New Comment

4 Answers sorted by

90

I encourage responses even if they feel like obvious things to you.

In many places.

Milan W

60

Github repos. There, your text won't be forced into people's feeds yet will probably be scraped.

Also: I recommend writing in Markdown, because LLMs tend to write in Markdown.

Milan W

22

If you have a big pile of text that you want people training their LLMs on, I recommend compiling and publishing it as a Huggingface dataset

Daniel Tan

20

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

More from keltan
Curated and popular this week