Comment author:[deleted]
11 August 2015 02:31:02PM
3 points
[-]

Suppose a research team works on the methods comparability issue. (Specifically, 'is it possible to compare the percent of fungal colonization [the length of roots occupied by fungus divided by the total length of the root system], obtained by using Technique A [staining with fuchsin] with the figure for Technique B […Trypan blue]? For further specifics, see Gange et al, 'A comparision of visualisation techniques for recording arbuscular mycorrhizal colonisation', although I promise it is not required for this SQ.)

The problem with both Techniques A and B is that different researchers' estimates may vary plus minus 10% (for the same slide). The team, aware of that, assigned just one, experienced member to the recording part, and I think doubled the sample (the number of viewed fields of vision per slide). They showed that the figures for A and B can be about 20% different.

If I have to guesstimate how my own results for a different set of roots of the same species of plant can be compared with somebody else's, and I use A and he uses B, do I have to allow for a possible 40% gap of total variation? Because that would make it a bloody useless comparison...

Suppose I do a similar methodology study myself, and I can ask a (even less experienced partner) to score all those fields of vision that I view myself, would our composite estimates for A and B be more, well, robust than if I did it alone? It seems so, but then again, maybe it is better if she scored different fields of vision on the same slide. I am confused.

Comment author:[deleted]
11 August 2015 04:56:42PM
*
1 point
[+]
(9
children)

Comment author:[deleted]
11 August 2015 04:56:42PM
*
1 point
[-]

There is a variable X, x belongs to [0, 100]. There are n ways of measuring it, among them A and B are widely used. For any given measurer, the difference between x(A) and x(B) can be up to 20 points. Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B.

Measurer 1 wants to know if it is meaningful to compare her results, x(A)1, with Measurer 2's results, x(B)2. Does the interval in which the true x lies include 40 points?

If Measurer 1 herself establishes the difference between x(C)1 and x(D)1, where C and D are two other ways to measure x, how much more useful for any given Measurer 3 will be her results, if she also invites Measurer 2's opinion - that is, x(C)2 and x(D)2?

Comment author:gwern
11 August 2015 07:37:37PM
*
3 points
[-]

Factor analysis/measurement error/multilevel models/Value of Information: X is a latent variable, which yields more latent variables (one for each kind of method), which are themselves measured with error by the raw datapoints. So you have multiple kinds of measurements, each with their own error, giving you a multilevel model. You can write a multilevel model expressing this in a Bayesian language like JAGS or you could use a SEM library like lavaan, where it'd be something like 'x ~ A, x ~ B, A ~ a-datapoints, B ~ b-datapoints'...

(To give an analogy: imagine you are measuring Gf. Gf is a latent variable which is predicted from things like WM or executive function; WM and executive function are themselves latent variables, which are measured by tests like forwards digit span. The graph would look like a little pyramid. So you measure someone's intelligence by doing forwards digit span several times, giving you a reasonably precise estimate of the latent variable WM, which then gives you a imprecise estimate of the highest latent variable Gf.)

Measurer 1 wants to know if it is meaningful to compare her results, x(A)1, with Measurer 2's results, x(B)2. Does the interval in which the true x lies include 40 points?

Comparing measurer 1 and measurer 2's results is not really the same thing as simply asking for the posterior distribution of the latent x, but yes, with the posterior, it's easy to calculate the probability of anything you like such as '>=40' or '<=90'.

If Measurer 1 herself establishes the difference between x(C)1 and x(D)1, where C and D are two other ways to measure x, how much more useful for any given Measurer 3 will be her results, if she also invites Measurer 2's opinion - that is, x(C)2 and x(D)2?

I only know a Bayesian approach here: it sounds like Expected Value of Sample Information. You need a loss function on error (mean squared, perhaps?) and then you can repeatedly sample from the posterior based on all of Measurer 3's data as a hypothetical, and then look at how much loss is reduced based an additional sample (or more) from Measurer 2.

Comment author:[deleted]
11 August 2015 08:13:40PM
*
1 point
[+]
(6
children)

Comment author:[deleted]
11 August 2015 08:13:40PM
*
1 point
[-]

Thank you! (In part, for such faith in my abilities:) Have to go hunt myself a programmer for dinner...)

It seems that if M-r 1 gives M-r 2 the same subsample of the middle latent variables (photoes of fields of vision, scoring them gives you the datapoints), and the x1 is compared with x2, they can see the least difference between them, which is (largely?) sample-independent. If, however, M-r 1 and M-r 2 each draw their subsamples independently, the difference between x1 and x2 should be larger due to chance, right?.. So if we look at the difference in differences between x1and x2, and it is greater for some middle latent variables (ways of staining) than for others, can we use it as a measure of 'the overall variability of the measuring method'? Say, if we have ten measurers and four measuring methods...

(I'm asking you this because it is relatively simple to do in practice, not because I think this would be the most efficient way.)

Comment author:gwern
12 August 2015 12:57:15AM
*
3 points
[-]

You can estimate the bias of each measurer much more efficiently if you have them measure the same sample, yes, analogous to crossover: now the differences are due less to the wide diversity of the sampled population and more to the particular measurer.

(To put it a little more mathily, when each measurer measures different samples, then the measurements will be spread very widely because it's Var(measurer-bias) + Var(population); but if we have the measurers measure the same sample, then Var(population) drops out and now there's just Var(measurer-bias). If I measure a sample and get 2.9 and you measure it as well and get 3.1, then probably the sample is really ~3.0 and my bias is -0.1 and your bias is +0.1. If I measure one sample and get 2.9 and you measure a different sample and get 3.1, then my bias and your bias are... ???)

For example, the classic example for MLMs is you have n classrooms' test scores, and you want to figure out the teachers' effects. It's hard to tell because the classrooms' average scores will differ a lot on their own. This is analogous to your original description: each measurer gets their own batch of samples. But what if you had a crossed design of one classroom with test scores after it's taught by each teacher? Then much of the differences in the average score will be due to the particular effect of each teacher and that will be much easier to estimate.

So if we look at the difference in differences between x1and x2, and it is greater for some middle latent variables (ways of staining) than for others, can we use it as a measure of 'the overall variability of the measuring method'? Say, if we have ten measurers and four measuring methods...

I guess. From a factor analysis perspective, you just want to pick the one with the highest loading on X, I think.

Comment author:[deleted]
17 August 2015 12:12:21PM
1 point
[+]
(0
children)

Comment author:[deleted]
17 August 2015 12:12:21PM
1 point
[-]

Huh. Your answer was even more useful for me than I expected. My 'secret agenda' is to put forth another mountant medium, which might have advantages over the one in use, but I will have to show that they do not differ in preparation quality. I think I am going to do a 2-by-2 crossover.

Comment author:[deleted]
12 August 2015 03:49:21AM
1 point
[+]
(2
children)

Comment author:[deleted]
12 August 2015 03:49:21AM
1 point
[-]

The problem is that whatever one I will find the most desirable, other people will continue using the methods they are good at. And I will have to somehow compare x(A)1, x(B)32 and x(C)3...

And this is a relatively straightforward situation, things are often much less clear in environmental science, already on the methodology level.

Comment author:gwern
12 August 2015 03:34:48PM
*
2 points
[-]

The problem is that whatever one I will find the most desirable, other people will continue using the methods they are good at. And I will have to somehow compare x(A)1, x(B)32 and x(C)3...

I don't really understand the problem. Yes, maybe you can't control them and get everyone onto the same method page. But I've already explained how you deal with that, given you the relevant keywords to search for like 'measurement error', and also given you example R code implementing several approaches.

They all take the basic approach of treating it as data/measurements which load on a latent variable for each method, and each method loads on the latent variable which is what you actually want; then you can infer whatever you need to. The first level of latent variables helps you estimate the biases of each category, some of which may be smaller than others, and then you collectively use them to estimate the final latent variable. Now you have a principled way to unify all your data from disparate methods which measure in similar but not identical way the variable you care about. If someone else comes up with a new method, it can be incorporated like the rest.

Comment author:[deleted]
12 August 2015 04:22:01PM
0 points
[-]

Right - sorry, melting brain. (Also, I had just thought that the assumed 10% difference between two measurers has not, in fact, been established rigorously, and that derailed the still-solid brain...)

## Comments (130)

BestSuppose a research team works on the methods comparability issue. (Specifically, 'is it possible to compare the percent of fungal colonization [the length of roots occupied by fungus divided by the total length of the root system], obtained by using Technique A [staining with fuchsin] with the figure for Technique B […Trypan blue]? For further specifics, see Gange et al, 'A comparision of visualisation techniques for recording arbuscular mycorrhizal colonisation', although I promise it is not required for this SQ.)

The problem with both Techniques A and B is that different researchers' estimates may vary plus minus 10% (for the same slide). The team, aware of that, assigned just one, experienced member to the recording part, and I think doubled the sample (the number of viewed fields of vision per slide). They showed that the figures for A and B can be about 20% different.

If I have to guesstimate how

myown results for a different set of roots of the same species of plant can be compared with somebody else's, and I use A and he uses B, do I have to allow for a possible 40% gap of total variation? Because that would make it a bloody useless comparison...Suppose I do a similar methodology study myself, and I can ask a (even less experienced partner) to score all those fields of vision that I view myself, would our composite estimates for A and B be more, well, robust than if I did it alone? It seems so, but then again, maybe it is better if she scored different fields of vision on the same slide. I am confused.

For a stupid questions thread the language sounds remarkably domain-specific. Consider rephrasing in ELI5.

*1 point [-]There is a variable X, x belongs to [0, 100]. There are n ways of measuring it, among them A and B are widely used. For any given measurer, the difference between x(A) and x(B) can be up to 20 points. Between two any measurers, x(A)1 and x(A)2 can differ on average 10 points, likewise with B.

Measurer 1 wants to know if it is meaningful to compare her results, x(A)1, with Measurer 2's results, x(B)2. Does the interval in which the true x lies include 40 points?

If Measurer 1 herself establishes the difference between x(C)1 and x(D)1, where C and D are two

otherways to measure x, how much more useful for any given Measurer 3 will be her results, if she also invites Measurer 2's opinion - that is, x(C)2 and x(D)2?(Is this ok?)

*3 points [-]Factor analysis/measurement error/multilevel models/Value of Information: X is a latent variable, which yields more latent variables (one for each kind of method), which are themselves measured with error by the raw datapoints. So you have multiple kinds of measurements, each with their own error, giving you a multilevel model. You can write a multilevel model expressing this in a Bayesian language like JAGS or you could use a SEM library like lavaan, where it'd be something like 'x ~ A, x ~ B, A ~ a-datapoints, B ~ b-datapoints'...

(To give an analogy: imagine you are measuring Gf. Gf is a latent variable which is predicted from things like WM or executive function; WM and executive function are themselves latent variables, which are measured by tests like forwards digit span. The graph would look like a little pyramid. So you measure someone's intelligence by doing forwards digit span several times, giving you a reasonably precise estimate of the latent variable WM, which then gives you a imprecise estimate of the highest latent variable Gf.)

Comparing measurer 1 and measurer 2's results is not really the same thing as simply asking for the posterior distribution of the latent x, but yes, with the posterior, it's easy to calculate the probability of anything you like such as '>=40' or '<=90'.

I only know a Bayesian approach here: it sounds like Expected Value of Sample Information. You need a loss function on error (mean squared, perhaps?) and then you can repeatedly sample from the posterior based on all of Measurer 3's data as a hypothetical, and then look at how much loss is reduced based an additional sample (or more) from Measurer 2.

(You could always go ask the Statistics Stack Overflow.)

*1 point [-]Thank you! (In part, for such faith in my abilities:) Have to go hunt myself a programmer for dinner...)

It seems that if M-r 1 gives M-r 2 the same subsample of the middle latent variables (photoes of fields of vision, scoring them gives you the datapoints), and the x1 is compared with x2, they can see the

leastdifference between them, which is (largely?) sample-independent. If, however, M-r 1 and M-r 2 each draw their subsamples independently, the difference between x1 and x2 should be larger due to chance, right?.. So if we look at the difference in differences between x1and x2, and it is greater for some middle latent variables (ways of staining) than for others, can we use it as a measure of 'the overall variability of the measuring method'? Say, if we have ten measurers and four measuring methods...(I'm asking you this because it is relatively simple to do in practice, not because I think this would be the most efficient way.)

*3 points [-]You can estimate the bias of each measurer much more efficiently if you have them measure the same sample, yes, analogous to crossover: now the differences are due less to the wide diversity of the sampled population and more to the particular measurer.

(To put it a little more mathily, when each measurer measures

differentsamples, then the measurements will be spread very widely because it's Var(measurer-bias) + Var(population); but if we have the measurers measure thesamesample, then Var(population) drops out and now there's just Var(measurer-bias). If I measure a sample and get 2.9 and you measure it as well and get 3.1, then probably the sample is really ~3.0 and my bias is -0.1 and your bias is +0.1. If I measure one sample and get 2.9 and you measure a different sample and get 3.1, then my bias and your bias are... ???)For example, the classic example for MLMs is you have n classrooms' test scores, and you want to figure out the teachers' effects. It's hard to tell because the classrooms' average scores will differ a lot on their own. This is analogous to your original description: each measurer gets their own batch of samples. But what if you had a crossed design of one classroom with test scores after it's taught by each teacher? Then much of the differences in the average score will be due to the particular effect of each teacher and that will be much easier to estimate.

I guess. From a factor analysis perspective, you just want to pick the one with the highest loading on X, I think.

Huh. Your answer was even more useful for me than I expected. My 'secret agenda' is to put forth another mountant medium, which might have advantages over the one in use, but I will have to show that they do not differ in preparation quality. I think I am going to do a 2-by-2 crossover.

So - thank you! Analogies for the win!

The problem is that whatever one I will find the most desirable,

other peoplewill continue using the methods they are good at. And I will have to somehow compare x(A)1, x(B)32 and x(C)3...And this is a relatively straightforward situation, things are often much less clear in environmental science,

already on the methodology level.*2 points [-]I don't really understand the problem. Yes, maybe you can't control them and get everyone onto the same method page. But I've already explained how you deal with that, given you the relevant keywords to search for like 'measurement error', and also given you example R code implementing several approaches.

They all take the basic approach of treating it as data/measurements which load on a latent variable for each method, and each method loads on the latent variable which is what you actually want; then you can infer whatever you need to. The first level of latent variables helps you estimate the biases of each category, some of which may be smaller than others, and then you collectively use them to estimate the final latent variable. Now you have a principled way to unify all your data from disparate methods which measure in similar but not identical way the variable you care about. If someone else comes up with a new method, it can be incorporated like the rest.

Right - sorry, melting brain. (Also, I had just thought that the assumed 10% difference between two measurers has not, in fact, been established rigorously, and that derailed the still-solid brain...)

Here are some examples in R of the mentioned approaches including EVSI. Syntax-highlighted on Pastebin: http://pastebin.com/8090dgvB

For another example of VoI, see my draft essay http://gwern.net/Mail%20delivery