EHeller comments on Welcome to Less Wrong! (5th thread, March 2013) - Less Wrong

27 Post author: orthonormal 01 April 2013 04:19PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (1750)

You are viewing a single comment's thread. Show more comments above.

Comment author: HumanitiesResearcher 17 April 2013 01:14:57AM *  7 points [-]

Hi everyone,

I'm a humanities PhD who's been reading Eliezer for a few years, and who's been checking out LessWrong for a few months. I'm well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart's desire is to workshop research methods with the LW community.

It may break protocol, but I'd like to offer a preview of my project in this introduction. I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence. The Big Book had multiple unknown sites of production, which we'll call Print Shop(s) [1-n]. I'm interested in pinning down which parts of the Big Book were made in which Print Shop. Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

The most obvious solution that I can see is

  • to catalog all Marks in the Big Book by sheet (a unit of print production, as opposed to the page), then
  • sort sheets by patterns of Marks, then
  • make some associations between the patterns of Marks and Print Shops, and then
  • propose Print Shops [x,y,z] to be the sites of production for the Big Book.

If nothing else, this method can define the n-number of Print Shops responsible for the Big Book.

The Bayesian twist on the obvious solution is to add some testing onto the associations, above. Specifically,

  • find some books strongly associated with Print Shops [x,y,z], in order to

  • assign probability of patterns of Marks to each Print Shop, then

  • revise initial associations between Print Shops [x,y,z] and the Big Book proportionally.

I'm far from an expert in Bayesian methods, but it seems already that there's something missing here. Is there some stage where I should take a control sample? Also, how can I find a logical basis for the initial association step, when there are many potential Print Shops? Lastly, how can I account for the decay of Tools, thus increasing Marks, over time?

Comment author: EHeller 17 April 2013 01:28:24AM *  3 points [-]

Any time you are doing statistical analysis, you always want a sample of data that you don't use to tune the model and where you know the right answer. (a 'holdout' sample)

In this case, you should have several books related to the various print shops that you don't feed into your Bayesian algorithm. You can then assess the algorithm by seeing if it gets these books correct.

To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you'd need to know how old the tools wee that made those books. Either that, or you'd have to have some understanding of how the tools decay from a theoretical model.

Comment author: Vaniver 17 April 2013 02:48:32PM 1 point [-]

To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you'd need to know how old the tools wee that made those books. Either that, or you'd have to have some understanding of how the tools decay from a theoretical model.

If you assume that the marks result from defects in the tool that accumulate, it should be relatively easy to build (and test) a monotonic model. Suppose we have an unordered collection of sheets, with some variable number of defects per sheet. If the defects are repeated (i.e. we can recognize defect A whenever we see it, as well as B, and so on), then we can build together paths- all of the sheets without defects pointing towards all of the sheets with just defect A, then defect A and B, and so on. There should be divergence- if we never see sheets with both defect A and C, then we can conclude the 0-A-B path is one tool (with the only some of the 0 defect sheets coming from that tool, obviously), the 0-C-D-E path is another tool, and the 0-F-G path is a third tool. (Noting that here 'tool' refers to one repair cycle, not the entire lifecycle.)

Comment author: EHeller 17 April 2013 06:26:47PM 1 point [-]

If you assume that the marks result from defects in the tool that accumulate, it should be relatively easy to build (and test) a monotonic model

The first assumption seems bad to me- I would assume defects accumulate only until equipment is reset or repaired, which is why I think you'd want some actual data.

Comment author: Vaniver 17 April 2013 07:09:14PM 1 point [-]

The first assumption seems bad to me- I would assume defects accumulate only until equipment is reset or repaired, which is why I think you'd want some actual data.

That looks to me like it agrees with my assumption; I suspect my grammar is somehow unclear. (Note the last line of the grandparent.)

Comment author: HumanitiesResearcher 18 April 2013 05:18:51AM 0 points [-]

Yes, I see an accord between your statement and Vaniver's. As I said below, most tools have very slow repair cycles.

Comment author: HumanitiesResearcher 17 April 2013 01:57:49PM 1 point [-]

Very helpful points, thanks. The scholarly community already has a pretty good working knowledge of the Tools, and thus the theoretical model of Tool breakage ("breakage" may be more accurate than "decay," since the decay is non-incremental and stochastic). We know the order in which parts of the Tools break, and we have some hypotheses correlating breakage to gross usage. The twist is that we don't know when any Print Shops produced the Big Book, so we can only extrapolate a timeline based on Tool breakage

Can you say more about the holdout sample? Should the holdout sample be a randomly selected sample of data, or something suspected to be associated with Print Shops [x,y,z] ? Print Shops [a,b,c] ?