gwern comments on Welcome to Less Wrong! (5th thread, March 2013) - Less Wrong

27 Post author: orthonormal 01 April 2013 04:19PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (1750)

You are viewing a single comment's thread. Show more comments above.

Comment author: HumanitiesResearcher 17 April 2013 01:14:57AM *  7 points [-]

Hi everyone,

I'm a humanities PhD who's been reading Eliezer for a few years, and who's been checking out LessWrong for a few months. I'm well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart's desire is to workshop research methods with the LW community.

It may break protocol, but I'd like to offer a preview of my project in this introduction. I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence. The Big Book had multiple unknown sites of production, which we'll call Print Shop(s) [1-n]. I'm interested in pinning down which parts of the Big Book were made in which Print Shop. Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

The most obvious solution that I can see is

  • to catalog all Marks in the Big Book by sheet (a unit of print production, as opposed to the page), then
  • sort sheets by patterns of Marks, then
  • make some associations between the patterns of Marks and Print Shops, and then
  • propose Print Shops [x,y,z] to be the sites of production for the Big Book.

If nothing else, this method can define the n-number of Print Shops responsible for the Big Book.

The Bayesian twist on the obvious solution is to add some testing onto the associations, above. Specifically,

  • find some books strongly associated with Print Shops [x,y,z], in order to

  • assign probability of patterns of Marks to each Print Shop, then

  • revise initial associations between Print Shops [x,y,z] and the Big Book proportionally.

I'm far from an expert in Bayesian methods, but it seems already that there's something missing here. Is there some stage where I should take a control sample? Also, how can I find a logical basis for the initial association step, when there are many potential Print Shops? Lastly, how can I account for the decay of Tools, thus increasing Marks, over time?

Comment author: gwern 17 April 2013 02:10:50AM 6 points [-]

I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence.

It's the Bible, isn't it.

Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

I'm far from an expert in Bayesian methods, but it seems already that there's something missing here.

Have you considered googling for previous work? 'Bayesian inference in phylogeny' and 'Bayesian stylometry' both seem like reasonable starting points.

Comment author: Vaniver 17 April 2013 02:26:50PM 2 points [-]

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

Not quite. You can get quite a bit of insight out of unsupervised clustering.

Comment author: gwern 17 April 2013 03:43:00PM 1 point [-]

'No free lunches', right? If you're getting anything out of your unsupervised methods, that just means they're making some sort of assumptions and proceeding based on those.

Comment author: Vaniver 17 April 2013 04:20:38PM 4 points [-]

Right, but this isn't a free lunch so much as "you can see a lot by looking."

Comment author: HumanitiesResearcher 18 April 2013 05:29:38AM 4 points [-]

Sorry to interrupt a perfectly lovely conversation. I just have a few things to add:

  • I may have overstated the case in my first post. We have some information about print shops. Specifically, we can assign very small books to print shops with a high degree of confidence. (The catch is that small books don't tend to survive very well. The remaining population is rare and intermittent in terms of production date.)

  • There are some hypotheses that could be treated as priors, but they're very rarely quantified (projects like this are rare in today's humanities).

Comment author: HumanitiesResearcher 17 April 2013 01:47:45PM *  2 points [-]

Interesting feedback.

It's the Bible, isn't it.

Ha, I wish. No, it's more specific to literature.

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

We have minimal information about Print Shops. I wouldn't say the existing data are garbage, just mostly unquantified.

Have you considered googling for previous work?

Yes, but thanks to you I know the shibboleth of "Bayesian stylometry." Makes sense, and I've already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?

EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?

Comment author: gwern 17 April 2013 03:47:18PM 1 point [-]

Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks.

Have to define your features somehow.

Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods.

Really? I was under the opposite impression, that stylometry was, since the '60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.

I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?

No, not really. I think I would describe GIGO in this context as 'data which is equally consistent with all theories'.

Comment author: HumanitiesResearcher 18 April 2013 12:52:41AM 0 points [-]

Have to define your features somehow.

I don't understand what this means. Can you say more?

Comment author: gwern 18 April 2013 01:04:52AM 0 points [-]

http://en.wikipedia.org/wiki/Feature_%28machine_learning%29 A specific concrete variable you can code up, like 'total number of commas'.

Comment author: HumanitiesResearcher 18 April 2013 05:12:18AM 1 point [-]

I have just such a thing, referred to as "Marks." I haven't yet included that in the code, because I wanted to explore the viability of the method first. So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?

Comment author: gwern 18 April 2013 04:26:36PM 1 point [-]

So to retreat to the earlier question, why does my proposal strike you as a GIGO situation?

You claimed to not know what printers there were, how many there were, and what connection they had to 'Marks'. In such a situation, what on earth do you think you can infer at all? You have to start somewhere: 'we have good reason to believe there were not more than 20 printers, and we think the London printer usually messed up the last page. Now, from this we can start constructing these phylogenetic trees indicating the most likely printers for our sample of books...' There is no view from nowhere, you cannot pick yourself up by your bootstraps, all observation is theory-laden, etc.

Comment author: HumanitiesResearcher 21 April 2013 04:27:25PM 0 points [-]

This all sounds good to me. In fact, I believe that researchers in the humanities are especially (perhaps overly) sensitive to the reciprocal relationship between theory and observation.

I may have overstated the ignorance of the current situation. The scholarly community has already made some claims connecting the Big Book to Print Shops [x,y,z]. The problem is that those claims are either made on non-quantitative bases (eg, "This mark seems characteristic of this Print Shop's status.") or on a very naive frequentist basis (eg, "This mark comes up N times, and that's a big number, so it must be from Print Shop X"). My project would take these existing claims as priors. Is that valid?

Comment author: gwern 21 April 2013 05:14:54PM 0 points [-]

I have no idea. If you want answers like that, you should probably go talk to a statistician at sufficient length to convey the domain-specific knowledge involved or learn statistics yourself.