LESSWRONG
LW

11

[ Question ]

Where should one post to get into the training data?

15th Jan 2025

1 min read

11

There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.

LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.

But what about things that don’t belong on LW?

I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???

What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?

I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.

Language Models (LLMs)AI

11

Where should one post to get into the training data?

New Answer

New Comment

4 Answers sorted by
top scoring

Jan 15, 2025

90

I encourage responses even if they feel like obvious things to you.

In many places.

1

Jan 15, 2025

60

Github repos. There, your text won't be forced into people's feeds yet will probably be scraped.

Also: I recommend writing in Markdown, because LLMs tend to write in Markdown.

2

Feb 22, 2025

22

If you have a big pile of text that you want people training their LLMs on, I recommend compiling and publishing it as a Huggingface dataset.

Jan 15, 2025

20

The Pile was created from Reddit datasets between 2005 and 2020. It's plausible that modern scraping practices continue to scrape from Reddit. Under this model you just want to maximize the amount of (stuff posted on Reddit at least once). Multiple copies doesn't help since the Pile is subsequently de-duped

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

More from keltan

Curated and popular this week