Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit

Vikas C. Raykar1 vikas.raykar@siemens.com Shipeng Yu1 shipeng.yu@siemens.com Linda H. Zhao2 lzhao@wharton.upenn.edu Anna Jerebko1 anna.jerebko@siemens.com Charles Florin1 charles.florin@siemens.com Gerardo Hermosillo Valadez1 gerardo.hermosillovaladez@siemens.com Luca Bogoni1 luca.bogoni@siemens.com Linda Moy3 Linda.Moy@nyumc.org 1 CAD and Knowledge Solutions (IKM CKS), Siemens Healthcare, Malvern, PA 19355 USA 2 Department of Statistics, University of Pennsylvania, Philadelphia, PA 19104 USA 3 Department of Radiology, New York University School of Medicine, New York, NY 10016 USA

Abstract
We describe a probabilistic approach for supervised learning when we have multiple experts/annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.

1. Introduction
A typical two-class supervised classification scenario consists of a training set D = {(xi , yi )}N containing i=1 N instances, where xi  Rd is an instance (the ddimensional feature vector) and yi  Y = {0, 1} is the corresponding known class label. The task is to learn a classification function f : Rd  Y which generalizes well on unseen data. However, for many tasks, it may not be possible, or may be too expensive to acquire the actual label yi (gold standard ) for training. Instead, we may have 1 R multiple (possibly noisy) labels yi , . . . , yi provided by R different experts or annotators. In practice, there might be a substantial amount of disagreement among the experts, and hence it is of great practical interest to determine the optimal way to learn a classifier.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Our motivation for this work comes from the area of computer-aided diagnosis (CAD) (see § 7.1), where the task is to build a classifier to predict whether a suspicious region on a medical image is malignant or benign. In order to train such a classifier, a set of images is collected from hospitals. The actual gold standard (whether it is cancer or not) can be obtained from biopsies, but since it is an expensive and an invasive process, often CAD systems are built from labels assigned by multiple radiologists who identify the locations of malignant lesions. Each radiologist visually examines the medical images and provides a subjective (possibly noisy) version of the gold standard. The domain of text classification offers another scenario. In this context the task is to predict the category for a token of text. The labels for training are assigned by human annotators who read the text and attribute their subjective category. With the advent of services like Amazon's Mechanical Turk, it is quite inexpensive to acquire labels from a large number of annotators in a short time (Sheng et al., 2008; Snow et al., 2008; Sorokin & Forsyth, 2008). In situations like these, the performance of different annotators can vary widely, and without the actual gold standard, it may not be possible to evaluate the annotators. In this work, we provide some principled probabilistic solutions to the following question: "How do we learn and evaluate classifiers when we have multiple annotators providing labels but no absolute gold standard?" A closely related problem­particularly relevant when there are a large number of annotators­ is to estimate how reliable/trustworthy is each annotator.

Supervised Learning from Multiple Experts

1.1. Majority Voting For binary classification problems, a common strategy is to use the majority label, i.e., yi = ^ 1 0 if (1/R) otherwise
R j=1 j yi  0.5

2. A two-coin model for annotators
Let y j  {0, 1} be the label assigned to the instance x by the j th annotator/expert. Let y be the actual (unobserved) label for this instance. Each annotator provides a version of this hidden true label based on two biased coins. If the true label is one, she flips a coin with bias j (sensitivity). If the true label is zero, she flips a coin with bias  j (specificity). In each case, if she gets heads she keeps the original label, otherwise she flips the label. If the true label is one, the sensitivity (true positive rate) for the j th annotator is defined as the probability that she labels it as one. j := Pr[y j = 1|y = 1].
j yi . j=1

,

(1)

as an estimate of the hidden true label and use this estimate to learn and evaluate classifiers/annotators. Another strategy is that of considering every pair (instance, label) provided by each expert as a separate example. Note that this amounts to using a soft probabilistic estimate of the actual ground truth to learn the classifier, i.e.,
R

(3)

Pr[yi =

1 R 1|yi , . . . , yi ]

= (1/R)

(2)

Majority voting assumes all experts are equally good. However, for example, if there is only one true expert and the majority are novices, and if novices give the same incorrect label to a specific instance, then the majority voting method would favor the novices since they are in a majority. One could address this problem by introducing a weight capturing how good each expert is. But how would one measure the performance of an expert when there is no gold standard available? 1.2. Proposed approach To address the apparent chicken-and-egg problem, we present a maximum-likelihood estimator (§ 4) that jointly learns the classifier, the annotator accuracy, and the actual true label. The performance of each annotator is measured in terms of the sensitivity and specificity with respect to the gold standard (§ 2). The proposed algorithm automatically discovers out the best experts and assigns a higher weight to them. In order to incorporate prior knowledge about each annotator, we impose a beta prior on the sensitivity and specificity and derive the maximum-a-posteriori estimate (§ 5). The final estimate is an EM-algorithm that iteratively establishes a particular gold standard, measures the performance of the experts given that gold standard, and refines the gold standard based on the performance measures. While the proposed approach is described using logistic regression as the base classifier (§ 3), it is quite general, and can be used with any black-box classifier (§ 6), and can also handle missing labels (i.e., each expert is not required to label all the instances). Furthermore, it can be extended to handle categorical, ordinal, and regression problems (§ 8). We extensively validate our approach using both simulated data and real data (§ 7).

On the other hand, if the true label is zero, the specificity (1-false positive rate) is defined as the probability that she labels it as zero.  j := Pr[y j = 0|y = 0]. (4)

The assumption introduced is that j and  j do not depend on the instance x. For example, in the CAD domain, this means that the radiologist's performance is consistent across different sub-groups of data. 1

3. Classification model
While the proposed method can be used for any classifier, for ease of exposition, we consider the family of linear discriminating functions: F = {fw }, where for any x, w  Rd , fw (x) = w x. The final classifier can be written in the following form: y = 1 if w x   ^ and 0 otherwise. The threshold parameter  determines the operating point of the classifier. The ROC curve is obtained as  is swept from - to . The posterior probability for the positive class is modeled as a logistic sigmoid acting on fw , i.e., Pr[y = 1|x, w] = (w x), (5)

where the logistic sigmoid function is defined as (z) = 1/(1 + e-z ). This classification model is known as logistic regression. Given the training data D consisting of N instances with annotations from R experts, i.e., D = 1 R {xi , yi , . . . , yi }N , the task is to estimate the weight i=1 vector w and also the sensitivity  = [1 , . . . , R ] and the specificity  = [ 1 , . . . ,  R ].
1 While this is a reasonable assumption, it is not entirely true. It is known that some radiologists are good at detecting certain kinds of malignant lesions based on their training and experience.

Supervised Learning from Multiple Experts

4. Maximum likelihood estimator
Assuming the instances are independently sampled, the likelihood of the parameters  = {w, , } given the observations D can be factored as
N

efficient iterative procedure to compute the maximumlikelihood solution in presence of missing/hidden data. We will use the unknown hidden true label yi as the missing data. Let y = [y1 , . . . , yN ], be the complete data log-likelihood can be written as
N

Pr[D|] =
i=1

1 R Pr[yi , . . . , yi |xi , ].

(6)

ln Pr[D, y|] =
i=1

yi ln pi ai +(1-yi ) ln(1-pi )bi . (10)

Conditioning on the true label yi , and also using the assumption that j and  j do not depend on the instance xi , the likelihood can be written as

Each iteration of the EM algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)-step. The M-step involves maximization of a lower bound on the log-likelihood that is N 1 R Pr[D|] = Pr[yi , . . . , yi |yi = 1, ] ˇ Pr[yi = 1|xi , w] refined in each iteration by the E-step. i=1 (1) E-step. Given the observation D and the current estimate of the model parameters , the conditional 1 R + Pr[yi , . . . , yi |yi = 0, ] ˇ Pr[yi = 0|xi , w] . expectation (which is a lower bound on the true like(7) lihood) is computed as
1 R Given the true label yi , we assume that yi , . . . , yi are independent, i.e., the annotators make their decisions independently. Hence, R R 1 Pr[yi , . . . , yi |yi = 1, ] = j=1 R j Pr[yi |yi = 1, j ] N

E {ln Pr[D, y|]} =
i=1

ľi ln pi ai + (1 - ľi ) ln(1 - pi )bi ,

(11) where the expectation is with respect to Pr[y|D, ], 1 R and ľi = Pr[yi = 1|yi , . . . , yi , xi , ]. Using Bayes theorem we can compute ľi
1 R  Pr[yi , . . . , yi |yi = 1, ] ˇ Pr[yi = 1|xi , ] ai pi = . (12) ai pi + bi (1 - pi )

=
j=1

[j ]yi [1 - j ]1-yi .

j

j

Similarly, we have
R 1 R Pr[yi , . . . , yi |yi = 0, ] = j=1

[ j ]

j 1-yi

[1 -  j ] .

j yi

Hence the likelihood can be written as
N

(2) M-step. Based on the current estimate ľi and the observations D, the model parameters  are then estimated by maximizing the conditional expectation. By equating the gradient of (11) to zero we obtain the following estimates for the sensitivity and specificity: (8) j =
N j i=1 ľi yi N i=1 ľi

Pr[D|] =
i=1

ai pi + bi (1 - pi ) ,

,

j =

N i=1 (1 - ľi )(1 - N i=1 (1 - ľi )

j yi )

.

where we define pi = Pr[yi = 1|xi , w] = (w xi ), and
R

ai =
j=1

[j ]yi [1 - j ]1-yi ,

j

j

R

bi =
j=1

[ j ]1-yi [1 -  j ]yi .

j

j

The maximum-likelihood estimator is found by maximizing the log-likelihood, i.e., ^ ^ ^ ^ ML = {, , w} = arg max{ln Pr[D|]}.


(9)

4.1. The EM algorithm This maximization problem can be simplified a lot if we use the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is an

Due to the non-linearity of the sigmoid, we do not have a closed form solution for w and we have to use gradient ascent based optimization methods. We use the Newton-Raphson update given by wt+1 = wt - H -1 g, where g is the gradient vector, H is the Hessian matrix, and  is the step length. The gradient vector is given by g(w) = N i=1 ľi - (w xi ) xi . The Hessian matrix is given N by H(w) = - i=1 (w xi ) 1 - (w xi ) xi xi . Essentially, we are solving a regular logistic regression problem with probabilistic labels ľi . These two steps (the E- and the M-step) can be iterated till convergence. We use majority voting ľi = R j 1/R j=1 yi as the initialization for ľi to start the EM-algorithm.

Supervised Learning from Multiple Experts

5. A Bayesian approach
In some applications we may want to trust a particular expert more than the others. This can be done by imposing priors on the sensitivity and specificity of the experts. Since j and j represent the probability of a binary event, a natural choice of prior is the beta prior. For any a > 0, b > 0, and   [0, 1] the beta distribution is given by Beta(|a, b) =
1

6. Discussions
6.1. Obtaining actual ground truth The value of the posterior probability ľi is a soft probabilistic estimate of the actual ground truth yi , i.e., 1 R ľi = Pr[yi = 1|yi , . . . , yi , xi , ]. The actual hidden label yi can be estimated by applying a threshold on ľi , i.e., yi = 1 if ľi   and zero otherwise. We can use  = 0.5 as the threshold. By varying  we can change the miss-classification costs. 6.2. Insight of the proposed framework A particularly revealing insight can be obtained in terms of the log-odds or the logit of the posterior probability ľi . From (15) the logit of ľi can be written as logit(ľi ) = log
1 R ľi Pr[yi = 1|yi , . . . , yi , xi , ] = log 1 , . . . , y R , x , ] 1 - ľi Pr[yi = 0|yi i i R j yi [logit(j ) + logit( j )]. j=1 1- is a constant term which where b = j=1 log  j does not depend on i. This indicates that the estimated ground truth (in the logit form of the posterior probability) is a weighted linear combination of the labels from all the experts. The weight of each expert is the sum of the logit of the sensitivity and specificity. R
j

 a-1 (1 - )b-1 , B(a, b)

(13)

where B(a, b) = 0  a-1 (1 - )b-1 d is the beta function. We assume a beta prior for both the sensitivity and the specificity as Pr[j |aj , aj ] = Beta(j |aj , aj ) 1 2 1 2 and Pr[j |bj , bj ] = Beta(j |bj , bj ). For sake of com1 2 1 2 pleteness we also assume a zero mean Gaussian prior on the weights w with inverse covariance matrix , i.e., Pr[w] = N (w|0, -1 ). Assuming that {j }, {j }, and w have independent priors, the maximum-a-posteriori (MAP) estimator is ^ found by maximizing the log-posterior, i.e., MAP = arg max {ln Pr[D|] + ln Pr[]}. An EM algorithm can be derived in a similar fashion for MAP estimation by relying on the interpretation of (Neal & Hinton, 1998). (1) Initialize ľi = (1/R)
R j=1 j yi by majority voting.

= w xi + b +

(2) Given ľi , estimate the sensitivity and specificity of each annotator as follows. j j = =
N j aj - 1 + i=1 ľi yi 1 . N aj + aj - 2 + i=1 ľi 1 2 N j bj - 1 + i=1 (1 - ľi )(1 - yi ) 1 . N bj + bj - 2 + i=1 (1 - ľi ) 1 2

6.3. Using any other classifier For ease of exposition we used logistic regression. However, the proposed algorithm can be used with any generalized linear model or in fact with any classifier that can be trained with soft probabilistic labels. 6.4. Obtaining ground truth with no features In some scenarios we may not have features xi and we wish to obtain an estimate of the actual ground truth based only on the labels from multiple annotators. Here instead of learning a classifier we estimate p which is the prevalence of the positive class, i.e., p = Pr[yi = 1]. We further assume a beta prior for the prevalence, i.e., Pr[p|p1 , p2 ] = Beta(p|p1 , p2 ). (1) Initialize ľi = (1/R)
R j=1 j yi by majority voting.

(14)

The Newton-Raphson update for optimizing w is given by wt+1 = wt - H -1 g, with step length , gradient vector g(w) = N i=1 ľi - (w xi ) xi - w and Hessian matrix N H(w) = - i=1 (w xi ) 1 - (w xi ) xi xi - . (3) Given the sensitivity and specificity of each annotator and the model parameters, update ľi as ľi = ai p i , ai pi + bi (1 - pi )
j R j yi j=1 [ ] [1 j

(15) - j ]1-yi ,
j

(2) Given ľi , estimate the sensitivity and specificity of each annotator using (14). The prevalence of the positive class is estimated as follows. p = p1 - 1 + i=1 ľi . p1 + p2 - 2 + N
N

where pi = (w xi ), ai = and bi =
j R j 1-yi [1 j=1 [ ]

(16)

-  j ]yi .

Iterate (2) and (3) till convergence.

(3) Given the sensitivity and specificity of each anno-

Supervised Learning from Multiple Experts

tator and prevalence, refine ľi as follows. ľi = ai p . ai p + bi (1 - p) (17)

Iterate (2) and (3) till convergence. This algorithm is similar to the one proposed in (Dawid & Skeene, 1979; Smyth et al., 1995). 6.5. Handling missing labels The proposed approach can easily handle missing labels. Let Ri be the number of radiologists labeling the ith instance, and let Nj be the number of instances labeled by the j th radiologist. Then in the EM algorithm, we just need to replace N by Nj for estimating the sensitivity and specificity in (14), and replace R by Ri for updating ľi in (15). 6.6. Evaluating a classifier We can use the probability scores ľi directly to evaluate classifiers. If zi are the labels obtained from any other classifier, then sensitivity and specificity can be estimated as =
N i=1 ľi zi , N i=1 ľi

instance is described by a set of 27 morphological features. In order to validate our proposed algorithm, we simulate the multiple radiologists according to the two-coin model described in § 2. Based on the labels from multiple radiologists, we can simultaneously (1) learn a logistic-regression classifier, (2) estimate the sensitivity and specificity of each radiologist, and (3) estimate the golden ground truth. We compare the results with the classifier trained using the biopsy proved ground truth as well as the majority-voting baseline. For the first set of experiments we use 5 radiologists with sensitivity  = [0.90 0.80 0.57 0.60 0.55] and specificity  = [0.95 0.85 0.62 0.65 0.58]. In this scenario the first two radiologists are experts and the last three are novices. The results are as follows: (1) Classifier performance Figure 1(Left) plots the ROC of the classifier on the training set. The dotted (black) line is the ROC for the classifier learnt using the actual ground truth. The solid (red) line is the ROC for the proposed algorithm and the dashed (blue) line is for the majority-voting scheme. The classifier learnt using the proposed method is as good as the one learnt using the golden ground truth. The AUC for the proposed algorithm is around 3.5% greater than that learnt using the majority-voting scheme. (2) Radiologist performance The actual sensitivity and specificity of each radiologist is marked as a black × in Figure 1(Right). The end of the solid red line shows the estimates of the sensitivity and specificity from the proposed method. We used a uniform prior on all the parameters. The ellipse plots the contour of one standard deviation as obtained from the beta posterior estimates. 2 The end of the dashed red line shows the estimate obtained from the majority- voting algorithm. We see that the proposed method is much closer to the actual values of sensitivity and specificity. (3) Actual ground truth Since the estimates of the actual ground truth are probabilistic scores, we can also plot the ROC curves of the estimated ground truth. From Figure 1(Right) we can see that the ROC curve for the proposed method dominates the majority voting ROC curve. Furthermore, the area under the ROC curve (AUC) is around 3% higher. The estimate obtained by majority voting is closer to the novices since they form a majority (3/5). The proposed algorithm appropriately weights each radiologist based on their estimated sensitivity and specificity.
At the end of each EM iteration, a good approximation to the posterior distribution can be obtained as j  Beta j |aj + 1 Beta j |bj + 1
N i=1 j ľi y i , a j + 2 N i=1 j ľi (1 - yi ) and j  N i=1 (1 j - ľi )yi . N i=1 (1 j - ľi )(1 - yi ), bj + 2 2

=

N i=1 (1 - ľi )(1 - zi ) . N i=1 (1 - ľi )

(18)

7. Experiments
We use two CAD and one text dataset in our experiments. The CAD datasets include a digital mammography data and a Breast MRI data, both of which are biopsy proven, i.e., the gold standard is available. For the digital mammography dataset we simulate the radiologists in order to validate our methods. The Breast MRI data has annotations from four radiologists. We also report results on a Recognizing Textual Entailment data collected by (Snow et al., 2008) using the Amazon's Mechanical Turk which has annotations from 164 annotators. 7.1. Digital Mammography Mammograms are used as a screening tool to detect early breast cancer. CAD systems search for abnormal areas (lesions) in a digitized mammographic image. In classification terms, given a set of descriptive morphological features for a region in a image, the task is to predict whether it is potentially malignant (1) or not (0). In order to train such a classifier, a set of mammograms is collected from hospitals. The ground truth (whether it is cancer or not) is obtained from biopsy. We use a proprietary biopsy-proven dataset containing 497 positive and 1618 negative examples. Each

Supervised Learning from Multiple Experts
ROC Curve for the classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 Golden ground truth AUC=0.915 Proposed EM algorithm AUC=0.913 Majority voting baseline AUC=0.882 0.4 0.6 False Positive Rate (1-specifcity) 0.8 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate (1-specifcity) 0.8 1 Proposed EM algorithm AUC=0.991 Majority voting baseline AUC=0.962 ROC Curve for the estimated true labels

True Positive Rate (sensitivity)

Figure 1. Results for the digital mammography dataset with annotations from 5 simulated radiologists. (Left) The ROC curve of the learnt classifier using the golden ground truth (dotted black line), the majority voting scheme (dashed blue line), and the proposed EM algorithm (solid red line). (Right) The ROC curve for the estimated ground truth. The actual sensitivity and specificity of each of the radiologist is marked as a ×. The end of the dashed blue line shows the estimates of the sensitivity and specificity obtained from the majority voting algorithm. The end of the solid red line shows the estimates from the proposed method. The ellipse plots the contour of one standard deviation.
ROC Curve for the classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 Proposed EM algorithm [Joint Estimation] AUC=0.905 Decoupled Estimation AUC=0.884 0.4 0.6 False Positive Rate (1-specifcity) 0.8 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 Proposed EM algorithm [Joint Estimation] AUC=0.972 Decoupled Estimation AUC=0.921 0.4 0.6 False Positive Rate (1-specifcity) 0.8 1 ROC Curve for the estimated true labels

True Positive Rate (sensitivity)

Figure 2. ROC curves comparing the proposed algorithm (solid red line) with 'Decoupled Estimation' procedure (dotted blue line), which refers to the algorithm where the ground truth is first estimated using just the lables from the five radiologists (§ 6.4) and then a logistic regression classifier is trained using the soft probabilistic labels. In contrast the proposed EM algorithm estimates the ground truth and learns the classifier simultaneously during the EM algorithm.
Leave-One-Out ROC Curve for the classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 Golden ground truth AUC=0.909 Majority voting baseline AUC=0.828 Proposed EM algorithm AUC=0.879 1 0.9 0.8 ROC Curve for the estimated true labels

True Positive Rate (sensitivity)

True Positive Rate (sensitivity)

True Positive Rate (sensitivity)

True Positive Rate (sensitivity)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 Proposed EM algorithm AUC=0.944 Majority voting baseline AUC=0.937

0.4 0.6 False Positive Rate (1-specifcity)

0.8

1

0 0

0.2

0.4 0.6 False Positive Rate (1-specifcity)

0.8

1

Figure 3. Breast MRI results. (Left) The leave-one-out cross validated ROC. (Right) ROC for the estimated ground truth.

Supervised Learning from Multiple Experts
0.95 0.9 0.85 0.8

Accuracy

0.75 0.7 0.65 0.6 0.55 0.5 20 40 60 80 100 120 Number of Annotators 140 160 Majority Voting EM Algorithm

the first. The data has 800 tasks and 164 distinct readers. The majority of the entries (94 %) in the 800x164 matrix are missing. Figure 4 plots the accuracy of the estimated ground truth as a function of the number of annotators. The proposed EM algorithm achieves a higher accuracy than majority voting.

8. Extensions
We briefly describe how the proposed approach can be extended to categorical, ordinal, and continuous data. 8.1. Categorical labels

Figure 4. The mean and the one standard deviation error bars for the accuracy of the estimated ground truth for the Recognizing Textual Entailment task as a function of the number of annotators. The plot was generated by randomly sampling R annotators 100 times.

(4) Joint Estimation To learn a classifier, Smyth (1995) proposed to first estimate the golden ground truth (§ 6.4) and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly as a part of the EM algorithm. Figure 2 shows that the classifier and the ground truth learnt obtained by the proposed algorithm is superior than that obtained by other procedures which first estimates the ground truth and then learns the classifier. 7.2. Breast MRI In this example, each radiologist reviews the breast MRI data and assesses the malignancy of each lesion on a BIRADS scale of 1 to 5. Our dataset comprises of 75 lesions with annotations from four radiologists, and the true labels from biopsy. Based on eight morphological features, we predict whether a lesion is malignant. We reduce the BIRADS scale to a binary one: any lesion with a BIRADS > 3 is considered malignant and benign otherwise. The set included 28 malignant and 47 benign lesions. Figure 3 summarizes the results. We show the leave-one-out cross validated ROC for the classifier. The cross-validated AUC of the proposed method is approximately 6% better than the majority voting baseline. 7.3. Recognizing Textual Entailment Finally we report results on Recognizing Textual Entailment data collected by (Snow et al., 2008) using the Amazon's Mechanical Turk. In this task, the annotator is presented with two sentences and given a choice of whether the second sentence can be inferred from

Suppose there are K  2 categories. An example for categorical data from the CAD domain is in LungCAD, where the radiologist needs to label whether a nodule (known to be precursors of cancer) is a solid, a part-solid, or a ground glass opacity. We can extend the previous model and introduce a multinomial paj j rameter j = (c1 , . . . , cK ) for each annotator, where c K j j j ck := Pr[y j = k|y = c] and k=1 ck = 1. Here ck denotes the probability that the annotator j assigns class k to an instance given the true class is c. When j j K = 2, 11 and 00 are sensitivity and specificity, respectively. A similar EM algorithm can be derived. 8.2. Ordinal labels In some situations, the outputs have an ordering j among the labels. Let yi  {1, . . . , K} be the label th assigned to the i instance by the j th expert. Note that there is an ordering in the labels 1 < . . . < K. A simple approach is to convert the ordinal data into a series of binary data (Frank & Hall, 2001). Specifically the K class ordinal labels are transformed into K - 1 binary class labels as follows: For c = 1, . . . , K - 1, jc j yi = 1 if yi > c and 0 otherwise. Applying the same procedure used for binary labels we can estimate Pr[yi > c] for c = 1, . . . , K - 1. The probability of the actual class values can then be obtained as Pr[yi = c] = Pr[yi > c - 1] - Pr[yi > c]. 8.3. Continuous labels As a part of the annotation process a task for a radiologist is also to measure the diameter of a nodule. This constitutes an example where the labels are real numbers. This situation can be handled as follows: Let j yi  R be the target value assigned to the ith instance by the j th annotator. The annotator provides a noisy version of the actual value yi . We will assume a Gaussian noise model with mean yi and inverse-variance j j (precision)  j , i.e., Pr[yi |yi ,  j ]  N (yi |yi , 1/ j ). The unknown precision  j is a measure of the accuracy

Supervised Learning from Multiple Experts

of each annotator. A similar EM algorithm can be derived­ (1) Given yi learn a regression function and estimate the precision for each annotator. (2) Given the precision for each annotator refine yi .

nary/categorical/ordinal classification and regression. We made two key assumptions­(1) the experts performance does not depend on the feature vector and (2) the experts are independent. We are currently exploring strategies to relax these two assumptions.

9. Related work
There has been some work in the biostatistics community­see (Dawid & Skeene, 1979; Hui & Zhou, 1998) and references therein­on latent variable models where the task is to get an estimate of the error rates based on the results from multiple diagnostic tests without a gold standard. In the machine learning community (Smyth et al., 1995) first addressed the same problem in the context of labeling volcanoes. There has been recent interest in the natural language processing (Sheng et al., 2008; Snow et al., 2008) and computer vision (Sorokin & Forsyth, 2008) communities where they show that using annotations from many people can be potentially as good as that provided by an expert. There is also some theoretical work (see (Lugosi, 1992) and reference therein) dealing with multiple experts. We differ from the previous body of work in the following aspects­(1) Unlike (Dawid & Skeene, 1979; Smyth et al., 1995) which just focused on estimating the ground truth, we specifically address the issue of learning a classifier. Estimating the ground truth and the expert/classifer performance is a byproduct of our proposed algorithm. (2) To learn a classifier (Smyth, 1995) propose to first estimate the ground truth and then use the probabilistic ground truth to learn a classifier. In contrast, our proposed algorithm learns the classifier and the ground truth jointly. Our experiments (see Figure 2) show that the classifier learnt and ground truth obtained by the proposed algorithm is superior to that obtained by other procedures which first estimates the ground truth and then learns the classifier. (3) Our solution is more general and can be easily extended to categorical, ordinal, and continuous data. It can also used in conjunction with any supervised learning algorithm.

References
Dawid, A. P., & Skeene, A. M. (1979). Maximum likeihood estimation of observed error-rates using the EM algorithm. Applied Statistics, 28, 20­28. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1­38. Frank, E., & Hall, M. (2001). A simple approach to ordinal classification. Lecture Notes in Computer Science, 145­156. Hui, S. L., & Zhou, X. H. (1998). Evaluation of diagnostic tests without a gold standard. Statistical Methods in Medical Research, 7, 354­370. Lugosi, G. (1992). Learning with an unreliable teacher. Pattern Recognition, 25, 79­87. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in Graphical Models (pp. 355­368). Kluwer Academic Publishers. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 614­ 622). Smyth, P. (1995). Learning with probabilistic supervision. In Computational learning theory and natural learning systems 3, 163­182. MIT Press. Smyth, P., Fayyad, U., Burl, M., Perona, P., & Baldi, P. (1995). Inferring ground truth from subjective labelling of venus images. In Advances in neural information processing systems 7, 1085­1092. Snow, R., O'Connor, B., Jurafsky, D., & Ng, A. (2008). Cheap and Fast - But is it Good? Evaluating NonExpert Annotations for Natural Language Tasks. Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 254­263). Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. Proceedings of the First IEEE Workshop on Internet Vision at CVPR 08 (pp. 1­8).

10. Conclusions and future work
In this paper we proposed a Bayesian framework for supervised learning in the presence of multiple annotators providing labels but no absolute gold standard. The proposed algorithm iteratively establishes a particular gold standard, measures the performance of the annotators given that gold standard, and then refines the gold standard based on the performance measures. The proposed algorithm can handle bi-