In the 2012 paper "How To Grade a Test Without Knowing the Answers — A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing" (PDF), Bachrach et al describe a mathematical model for starting with

  • a set of questions without answers
  • a set of answerers (participants) of unknown quality

and ending up with assessments of:

  • correct answers for the questions
  • assessed difficulty for the questions
  • evaluations of the participants, including overall reliability and areas of expertise

Can the model actually do this? Is it the best model for doing this? What kind of problems does it handle well and poorly?

New Answer
New Comment

1 Answers sorted by

DanielFilan

90

One example of the ability of the model: in the paper, the model is run on 120 responses to a quiz consisting of 60 Raven's Progressive Matrices questions, each question with 8 possible answers. As it happens, no responder got more than 50 questions right. The model correctly inferred the answers to 46 of the questions.

A key assumption in the model is that errors are random: so, in domains where you're only asking a small number of questions, and for most questions a priori you have reason to expect some wrong answers to be more common than the right one (e.g. "What's the capital of Canada/Australia/New Zealand"), I think this model would not work (although if there were enough other questions such that good estimates of responder ability could be made, that could ameliorate the problem). If I wanted to learn more, I would read this 2016 review paper of the general field.

That being said, as long as you know which answers will be more common among people who don't know the right answer, and roughly how much more common they will be, you can probably add that knowledge to the algorithm without too much difficulty and have it still work as long as there are some questions where you don't expect the uninformed to reliably lean one way.

1 comment, sorted by Click to highlight new comments since:
What kind of problems does it handle well and poorly?

Presumably ones that meet the assumptions its built around which include:

We model a situation in which a set P of participants is available to answer a set Q of multiple choice questions. We assume that for each question q ∈ Q there are Rq possible answers, only one of which, yq ∈ Rq, is correct. We model the process by which participants p ∈ P produce responses rpq ∈ Rq to questions q ∈ Q.

*(They also make assumptions about how participant ability and question difficulty determine the probability of a given participant getting it right. Gaussian noise is involved.)*

The response rpq is modeled as a mixture of two distributions. If participant p knows the correct answer to question q, cpq = T, we constrain the response rpq to match the correct answer, rpq = yq, otherwise we assume that rpq is sampled uniformly at random from the available answers, rpq ∼ DiscreteUniform(Rq)

(Note that "xyz" is actually "x_yz".)

In order to do inference on the model, we need to define prior distributions for the variables of interest. We assume factorizing Gaussian priors for the abilities ap ∼ Normal(µp, σ2 p ) and difficulties dq ∼ Normal(µq, σ2 q ). We choose a Gaussian prior as it lets us specify a range of plausible values based on two parameters (mean and variance) per variable, and admits a relatively simple approximate inference. The factorization assumption reflects the belief that a priori knowing the difficulty of one question would not be informative about the difficulty of another question, and similarly for the abilities of participants. We also assume factorizing discrete uniform priors for the true answers yq ∼ DiscreteUniform(Rq) and for the responses rpq ∼ DiscreteUniform(Rq) for participantquestion pairs. Finally, we define factorizing Gamma priors for the precision parameters τq ∼ Gamma(k, θ). The Gamma prior is conveniently parameterized by a shape parameter k and a scale parameter θ, and is the conjugate prior for the precision parameter τ := σ −2 of the normal distribution if the mean µ is known. This choice simplifies inference by approximate message passing because the posterior also takes the functional form of the Gamma distribution.

The language here seems in conflict with the section wrapped in *s. "σ2 p" is "σ^2_p". (The actual method is one of factor graphs and probabilistic inference. They also made an assumption that's analogous to one explicitly not made in voting - that participants are choosing answers honestly.)

As shown in Section 4, providing some ground-truth question-answer pairs (a “goldset”), can improve the accuracy of the inferred answers

The method can use additional information. (I wonder what happens if participants do that - or coordinate in advance with some set of falsehoods...)

Inference in the model is done using approximate message passing

This is how they estimate the unobserved variables (ability, difficulty, and the correct answer).

We used Infer.NET (Minka et al., 2010), a package for probabilistic inference. Specifically, we used the expectation-propagation (EP) algorithm presented in (Minka, 2001). EP allows us to calculate marginal distributions of interest on a given factor graph by iteratively calculating messages along edges that propagate information across the factor graph. In our case, EP provides only an approximation to the exact solu- How To Grade a Test Without Knowing the Answers tion because a) the underlying factor graph is loopy , and b) the messages at the junction between cpq, tpq, and τq are approximations, and so are the messages going in and out of the gate connected to cpq. Thus EP is run iteratively until convergence, so its running time is linear in the input size (variables and observations).

It's possible a closed form solution, or better approximations exist.

Can the model actually do this? Is it the best model for doing this? What kind of problems does it handle well and poorly?

Part of the point of doing things the way outlined in the paper is for active learning:

Having a joint probabilistic model of the data has a number of advantages, including the ability to query different distributions of interest and to handle missing data in a principled way. In addition, maintaining information about the uncertainty present in the model allows us to reason about the impact of future observations on the model uncertainty. This idea forms the basis of active learning, a variant of which is known in the psychometrics literature as adaptive testing.
Often there is a considerable cost associated with obtaining additional data points, so one can use the model of the data to determine which measurements to take next so as to improve the inferred knowledge according to a pre-determined criterion. In the absence of problem specific information, a reasonable goal is reducing uncertainty in estimates of model parameters as measured by the entropy of the posterior distributions, an idea put forward in (MacKay, 1992).

It's unclear what the best way is - the paper just argues dynamic is better than static. They start talking about results in/after the "Empirical Analysis" section. The end noted some limitations:

Our approach is subject to several limitations. Our evaluation used an IQ dataset, whereas crowdsourcing tasks may exhibit different properties, such as a greater homogeneity in task difficulty levels. Also, we assume that participants answer to the best of their ability, but participants may be selfish agents with varying motives. For a game theoretic treatment of such issues see (DiPalantino & Vojnovic, 2009; Gao et al., 2012).
Many questions are open for future research. Are there better models for aggregating responses, or models better tailored to other domains? How can one tractably compute the optimal non-adaptive test for a given population? Can we use similar models to infer the ability levels of individuals when only their performance within the context of a group is known?