ChristianKl comments on What should normal people do? - Less Wrong

23 Post author: seez 25 October 2013 02:28AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (93)

You are viewing a single comment's thread. Show more comments above.

Comment author: gwern 28 October 2013 05:21:17PM 1 point [-]

The Mnemosyne data lies around for years without anyone analysing it. Going through that data and doing a bit of modeling with it should be easy for anyone who's searching a bachlor thesis for computer science or otherwise seeks a project.

It's a real pain to, though, because it's so big. A month after I started, I'm still only halfway through the logs->SQL step.

Comment author: ChristianKl 28 October 2013 06:28:51PM 0 points [-]

It's a real pain to, though, because it's so big. A month after I started, I'm still only halfway through the logs-SQL step.

That sounds like you do one insert per transaction which is the default way SQL operates. It possible to batch multiple inserts together to one transaction.

If I remember right the data was something in the size of 10GB. I think that a computer should be able to do the logs->SQL step in less than a day provided one doesn't do one insert per transaction.

Comment author: gwern 28 October 2013 10:31:44PM *  0 points [-]

I believe so, yeah. You can see an old copy of the script at http://github.com/bartosh/pomni/blob/master/mnemosyne/science_server/parse_logs.py (or download the Mnemosyne repo with bzr). My version is slightly different in that I made it a little more efficient by shifting the self.con.commit() call up into the exception handler, which is about as far as my current Python & SQL knowledge goes. I don't see anything in http://docs.python.org/2/library/sqlite3.html mentioning 'union', so I don't know how to improve the script.

If I remember right the data was something in the size of 10GB.

The .bz2 logs are ~4GB; the half-done SQL database is ~18GB so I infer the final database will be ~36GB.

EDIT: my ultimate solution was to just spend $540 on an SSD, which finished the import process in a day; the final uploaded dataset was 2.8GB compressed and 18GB uncompressed (I'm not sure why it was half the size I expected).