Comment author: gwern 17 April 2013 03:47:18PM 1 point [-]

Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks.

Have to define your features somehow.

Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods.

Really? I was under the opposite impression, that stylometry was, since the '60s or so with the Bayesian investigation of Mosteller & Wallace into the Federalist papers, one of the areas of triumph for Bayesianism.

I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?

No, not really. I think I would describe GIGO in this context as 'data which is equally consistent with all theories'.

Comment author: HumanitiesResearcher 18 April 2013 12:52:41AM 0 points [-]

Have to define your features somehow.

I don't understand what this means. Can you say more?

Comment author: Vaniver 17 April 2013 02:36:17PM 5 points [-]

This is a problem that machine learning can tackle. Feel free to contact me by PM for technical help.

To make sure I understand your problem:

We have many copies of the Big Book. Each copy is a collection of many sheets. Each sheet was produced by a single tool, but each tool produces many sheets. Each shop contains many tools, but each tool is owned by only one shop.

Each sheet has information in the form of marks. Sheets created by the same tool at similar times have similar marks. It may be the case that the marks monotonically increase until the tool is repaired.

Right now, we have enough to take a database of marks on sheets and figure out how many tools we think there were, how likely it is each sheet came from each potential tool, and to cluster tools into likely shops. (Note that a 'tool' here is probably only one repair cycle of an actual tool, if they are able to repair it all the way to freshness.)

We can either do this unsupervised, and then compare to whatever other information we can find (if we have a subcollection of sheets with known origins, we can see how well the estimated probabilities did), or we can try to include that information for supervised learning.

Comment author: HumanitiesResearcher 17 April 2013 03:42:14PM *  5 points [-]

That's a hell of a summary, thanks!

I'm glad you mentioned the repair cycle of tools. There are some tools that are regularly repaired (let's just call them "Big Tools") and some that aren't ("Little Tools"). Both are expensive at first and to repair, but it seems the Print Shops chose to repair Big Tools because they were subject to breakage that significantly reduced performance.

I should add another twist since you mentioned sheets of known origins: Assume that we can only decisively assign origins to single sheets. There are two problems stemming from this assumption: first, not all relevant Marks are left on such sheets; second, very few single sheet publications survive. Collations greater than one sheet are subject to all of the problems of the Big Book.

I'm most interested in the distinction between unsupervised and supervised learning. And I will very likely PM you to learn more about machine learning. Again, thanks for your help!

EDIT: I just noticed a mistake in your summary. Each sheet is produced by a set of tools, not a single tool. Each mark is produced by a single tool.

Comment author: EHeller 17 April 2013 01:28:24AM *  3 points [-]

Any time you are doing statistical analysis, you always want a sample of data that you don't use to tune the model and where you know the right answer. (a 'holdout' sample)

In this case, you should have several books related to the various print shops that you don't feed into your Bayesian algorithm. You can then assess the algorithm by seeing if it gets these books correct.

To account for the decay of the books, you need books that you know not only came from print shop x,y or z, but also you'd need to know how old the tools wee that made those books. Either that, or you'd have to have some understanding of how the tools decay from a theoretical model.

Comment author: HumanitiesResearcher 17 April 2013 01:57:49PM 1 point [-]

Very helpful points, thanks. The scholarly community already has a pretty good working knowledge of the Tools, and thus the theoretical model of Tool breakage ("breakage" may be more accurate than "decay," since the decay is non-incremental and stochastic). We know the order in which parts of the Tools break, and we have some hypotheses correlating breakage to gross usage. The twist is that we don't know when any Print Shops produced the Big Book, so we can only extrapolate a timeline based on Tool breakage

Can you say more about the holdout sample? Should the holdout sample be a randomly selected sample of data, or something suspected to be associated with Print Shops [x,y,z] ? Print Shops [a,b,c] ?

Comment author: gwern 17 April 2013 02:10:50AM 6 points [-]

I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence.

It's the Bible, isn't it.

Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

I'm far from an expert in Bayesian methods, but it seems already that there's something missing here.

Have you considered googling for previous work? 'Bayesian inference in phylogeny' and 'Bayesian stylometry' both seem like reasonable starting points.

Comment author: HumanitiesResearcher 17 April 2013 01:47:45PM *  2 points [-]

Interesting feedback.

It's the Bible, isn't it.

Ha, I wish. No, it's more specific to literature.

How can you possibly get off the ground if you have no information about any of the Print Shops, much less how many there are? GIGO.

We have minimal information about Print Shops. I wouldn't say the existing data are garbage, just mostly unquantified.

Have you considered googling for previous work?

Yes, but thanks to you I know the shibboleth of "Bayesian stylometry." Makes sense, and I've already read some books in a similar vein, but there are some problems. Most fundamentally, I have trouble translating the methods to a different type of data: from textual data like word length to the aforementioned Marks. Otherwise, my understanding of most stylometric analysis was that it favors frequentist methods. Can you clear any of this up?

EDIT: I have a follow-up question regarding GIGO: How can you tell what data are garbage? Are the degrees of certainty based on significant digits of measurement, or what?

Comment author: PrawnOfFate 17 April 2013 02:14:27AM *  0 points [-]

How about talking clearly about whatever you are currently hinting at?

Comment author: HumanitiesResearcher 17 April 2013 01:42:44PM 0 points [-]

Thanks for the feedback. I actually cleared up the technical language considerably. I don't think there's any need to get lost in the weeds of the specifics while I'm still hammering out the method.

Comment author: HumanitiesResearcher 17 April 2013 01:14:57AM *  7 points [-]

Hi everyone,

I'm a humanities PhD who's been reading Eliezer for a few years, and who's been checking out LessWrong for a few months. I'm well-versed in the rhetorical dark arts, due to my current education, but I also have a BA in Economics (yet math is still my weakest suit). The point is, I like facts despite the deconstructivist tendency of humanities since the eighties. Now is a good time for hard-data approaches to the humanities. I want to join that party. My heart's desire is to workshop research methods with the LW community.

It may break protocol, but I'd like to offer a preview of my project in this introduction. I'm interested in associating the details of print production with an unnamed aesthetic object, which we'll presently call the Big Book, and which is the source of all of our evidence. The Big Book had multiple unknown sites of production, which we'll call Print Shop(s) [1-n]. I'm interested in pinning down which parts of the Big Book were made in which Print Shop. Print Shop 1 has Tools (1), and those Tools (1) leave unintended Marks in the Big Book. Likewise with Print Shop 2 and their Tools (2). Unfortunately, people in the present don't know which Print Shop had which Tools. Even worse, multiple sets of Tools can leave similar Marks.

The most obvious solution that I can see is

  • to catalog all Marks in the Big Book by sheet (a unit of print production, as opposed to the page), then
  • sort sheets by patterns of Marks, then
  • make some associations between the patterns of Marks and Print Shops, and then
  • propose Print Shops [x,y,z] to be the sites of production for the Big Book.

If nothing else, this method can define the n-number of Print Shops responsible for the Big Book.

The Bayesian twist on the obvious solution is to add some testing onto the associations, above. Specifically,

  • find some books strongly associated with Print Shops [x,y,z], in order to

  • assign probability of patterns of Marks to each Print Shop, then

  • revise initial associations between Print Shops [x,y,z] and the Big Book proportionally.

I'm far from an expert in Bayesian methods, but it seems already that there's something missing here. Is there some stage where I should take a control sample? Also, how can I find a logical basis for the initial association step, when there are many potential Print Shops? Lastly, how can I account for the decay of Tools, thus increasing Marks, over time?

View more: Prev