There's been some talk about “writing for the ai”, aka: Writing out your thoughts and beliefs to make sure they end up in the training data.
LessWrong seems like an obvious place that will be scraped. I expect when I post things here, they’ll be eaten by the Shoggoth.
But what about things that don’t belong on LW?
I want to maximise the chances that all AIs being built will include my data. So posting to Twitter (X) seems like I’ll just be training Grok???
What about a personal blog I start on a website I own? Does making the robots.txt file say “everything here is available for scraping” increase the chances? Does linking to that website in more places increase the chances?
I feel like I’m lacking a lot of knowledge here. I encourage responses even if they feel like obvious things to you.
If you have a big pile of text that you want people training their LLMs on, I recommend compiling and publishing it as a Huggingface dataset.