Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

gwern comments on Worse Than Random - Less Wrong

25 Post author: Eliezer_Yudkowsky 11 November 2008 07:01PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (99)

Sort By: Old

You are viewing a single comment's thread.

Comment author: gwern 08 May 2011 02:22:37AM 2 points [-]

I recently ran into a nice example of how randomness can be improved upon but you may not want to de-randomize.

One of my personal tools is a setup which archives & preserves URLs for later reference, since pretty much every web page will disappear or die eventually. The core is a loop which reads a URL out of a text file and then run the Internet requests to archive it.

The problem is that I want to archive hundreds & thousands of URLs a week, but I can only archive 1 every 20 seconds. The file is sorted to eliminate duplicates. The obvious approach is to archive the first URL in the file, which is also alphabetically first.

The problem with this is that as ever more URLs are dumped into the file, we wind up with a sort of resource starvation where URLs which begin with an 'x' or 'z' may never get archived because the archiving keeps on processing the 'a' and 'b' URLs first. We don't want to disproportionately favor any particular domain.

So what do we do? Select a random URL, of course. This avoids the bias problem - now the 'x' URLs have the same chance as a similar number of 'a' URLs.

But is random selection optimal? Nope. As with the lock example, we can improve the distribution by adding some sort of 'memory' about what domains have already gotten a URL archived; so one could randomly select a URL - as long as its domain hasn't gotten a URL archived in the last n archives.

Why haven't I done this? Because the archiver is meant to be able to survive frequent crashes and shutdowns (it runs on my laptop), and to get a memory over more than a few days, I would have to implement code to regularly serialize out the memory and so on. This wouldn't be too bad in Haskell (perhaps another 5 or 6 lines to the code), but that represents a pretty big increase in source code and the complexity of the design, and the benefit is fairly small. In this case, I'm happy to use the simpler dumber randomized solution.