You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Viliam_Bur comments on Open Thread, March 16-31, 2012 - Less Wrong Discussion

2 Post author: OpenThreadGuy 16 March 2012 04:53AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (114)

You are viewing a single comment's thread. Show more comments above.

Comment author: Viliam_Bur 20 March 2012 12:10:28PM *  1 point [-]

Is the LW database structure available? If yes, you could prepare some SELECT queries and ask admins to run them for you and send you the result.

Anonymization: Replace user ids with "f(id+c)" where "f" is a hash function and "c" is a constant that will be modified by the admin before running you script. Replace times of karma clicks with "ym(time+r)" where "r" is a random value between 0 and 30 days, and "ym" is a function that returns only month and year. Select only data from recent year and only from users who are were active during the whole year (made at least one vote in the first and last months of the time period). Would such data be still useful to you?

Comment author: sixes_and_sevens 20 March 2012 01:45:24PM 0 points [-]

My day job is DB admin and development. In the unlikely event of LW back-end admin-types being comfortable running a query sent in by some dude off the site, I wouldn't be comfortable giving it to them. The effort of due diligence on a foreign script is probably greater than that required to put it together.

The data I want correspond to:

  • the IDs (i.e. primary key, not the username) of all the users
  • the IDs (PK) and authorship (user ID) of all posts and comments in a contiguous ~3 month period
  • the adjacency of users and posts as upvotes and downvotes over this period (I assume this is a single junction table)

If I were providing this data, I would also scramble the IDs in some fashion while maintaining the underlying relationships, as consecutive IDs could provide some small clue as to the identity and chronology of users or posts. While this is pretty straightforward, the mechanism for such scrambling should not be known to recipients of the data.