Estimating Lab els from Lab el Prop ortions Novi Quadrianto novi.quad@gmail.com Alex J. Smola alex.smola@gmail.com Tib erio S. Caetano tiberio.caetano@gmail.com Statistical Machine Learning, NICTA and RSISE, Australian National University Quo c V. Le Computer Science Department, Stanford University quocle@stanford.edu Abstract Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. spam (this is achieved e.g. by listing e-mails as spam bait), while user's inboxes typically contain a mix of spam and non-spam. We would like to use the inbox data to improve estimation of spam. In many cases it is possible to estimate the proportions of spam and non-spam in a user's inbox much more cheaply than the actual labels. We would like to use this information to categorize e-mails into spam and non-spam. Similarly, consider the problem of filtering images with "improper content". Datasets of such images are readily accessible thanks to user feedback, and it is reasonable to assume that this labeling is highly reliable. However the rest of images on the web (those not labeled) is a far larger dataset, albeit without labels (after all, this is what we would like to estimate the labels for). That said, it is considerably cheaper to obtain a good estimate of the proportions of proper and improper content in addition to having one dataset of images being of likely improper content. We would like to obtain a classifier based on this information. In this paper we present a method to estimate labels directly in such situations, assuming that only label proportions be known. In the above examples, this would be helpful in identifying potential customers, spam e-mails and improper images. We prove bounds indicating that the estimates obtained are close to those from a fully labeled scenario. The formal setting though is more general than the above examples might suggest: we do not require any label to be known, only their proportions within each of the involved datasets. Also we are not restricted to the binary case but instead can deal with large numbers of classes. Problem Formulatiox Assume to at we have n sets n h of observations Xi = i , . . . , xi i f respective samm 1 ple sizes mi (our calibration set) as well as a set X = {x1 , . . . , xm } (our test set). Moreover, assume that we know the fractions iy of patterns of labels y Y (|Y| n) contained in each set Xi and assume that we also know the marginal probability p(y ) of the 1 Intro duction Assume that a web services company wants to increase its profit in sales. Obviously sending out discount coupons will increase sales, but sending coupons to customers who would have purchased the goods anyway decreases the margins. Alternatively, failing to send coupons to customers who would only buy in case of a discount reduces overall sales. We would like to identify the class of would-be customers who are most likely to change their purchase decision when receiving a coupon. The problem is that there is no direct access to a sample of would-be customers. Typically only a sample of people who buy regardless of coupons (those who bought when there was no discount) and a mixed sample (those who bought when there was discount) are available. The mixing proportions can be reliably estimated using random assignment to control and treatment groups. How can we use this information to determine the would-be customers? Likewise, consider the problem of spam filtering. Datasets of spam are likely to contain almost pure Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Estimating Lab els from Lab el Prop ortions Table 1. Notation Conventions Xi mi X Y m iy (x, y ) ith set of observations: Xi = {xi , . . . , xi i } 1 m number of observations in Xi test set of observations: X = {x1 , . . . , xm } test set of labels: Y = {y1 , . . . , ym } number of observations in the test set X proportion of label y in set i map from (x, y ) to a Hilbert Space 2.1 Exp onential Families Denote by X the space of observations and let Y be the space of labels. Moreover, let (x, y ) : X × Y H be a feature map into a Reproducing Kernel Hilbert Space (RKHS) H with kernel k ((x, y ), (x , y )). In this case we may state conditional exponential models via p(y |x, ) = exp ( (x, y ), - g (|x)) with y exp (x, y ), , g (|x) = log Y (1) (2) Table 2. Ma jor quantities of interest in the paper Numbers on the left represent the order in which the corresponding quantity is computed in the algorithm (letters denote the variant of the algorithm: `a' for general feature map (x, y ) and `b' for factorizing feature map (x, y ) = (x) (y )). Lowercase subscripts refer to model expectations, uppercase subscripts are sample averages. Expectations with respect to the model: µxy := E(x,y)p(x,y) [(x, y )] µclass [y , y x µset [i, y x ] ] where the normalization g is called the log-partition function. For {(xi , yi )} drawn iid from a distribution p(x, y ) on X × Y the conditional log-likelihood is log p(Y |X, ) = im =1 := E(x)p(x|y) [(x, y )] := E(x)p(x|i) [(x, y )] := E(x)p(x|y) [ (x)] := E(x)p(x|i) [ (x)] [ (xi , yi ), - g (|xi )] im =1 (3) µclass [y ] x µset [i] x = m µX Y , - g (|xi ) Expectations with respect P data: to m 1 µX Y := m P=1 (xi , yi ) i ] set 1 := mi xXi (x, y ) (known) (1a) µX [i, y P 1 (1b) µset [i] := mi xXi (x) (known) X Estimates: (2) µclass ^x (3a) µX Y ^ (3b) µX Y ^ ^ (4) = P )-1 µset ( X = yY p(y )µclass [y , y ] ^x P = yY p(y )(y ) µclass [y ] ^x solution of (5) for µX Y = µX Y . ^ where µX Y is defined as in Table 2. In order to avoid overfitting one commonly maximizes the log-likelihood penalized by a prior p(). This means that we need to solve the following optimization problem := argmin [- log p(Y |X, )p()] . (4) test set X .1 It is our goal to design algorithms which are able to obtain conditional class probability estimates p(y |x) solely based on this information. As an illustration, take the spam filtering example. We have X1 = "mail in spam box" (only spam) and X2 = "mail in inbox" (spam mixed with non-spam). The test set X then may be X2 itself, for example. The goal is to find p(spam|mail) in X2 . Note that (for general iy ) this is more difficult than transduction, where we have at least one dataset with actual labels plus an unlabeled test set where we might have an estimate as to what the relative fractions of class labels might be. For instance, for a Gaussian prior on , i.e. for 2 - log p() = + const. we have m . i 2 (5) = argmin g (|xi ) - m µX Y , + =1 The problem is that in our setting we do not know the labels yi , so the sufficient statistics µX Y cannot be computed exactly. The only place where the labels enter the estimation process is via the mean µX Y . Our strategy is to exploit the fact that this quantity, however, is statistically well behaved and converges under 1 relatively mild technical conditions at rate O(m- 2 ) to its expected value (see Theorem 2) µxy := E(x,y)p(x,y) [(x, y )]. (6) 2 Mean Op erators Our idea relies on uniform convergence properties of the expectation operator and of corresponding risk functionals (Altun & Smola, 2006). In doing so, we are able to design estimators with the same performance guarantees in terms of uniform convergence as those with full access to the label information. Label dictionaries Yi do not need to be the same across all sets i: define Y := i Yi and allow for iy = 0 as needed. 1 Our goal therefore will be to estimate µxy and use it as a proxy for µX Y , and only then solve (5) with the estimated µX Y instead of µX Y . We will discuss explicit ^ convergence guarantees in Section 3 after describing how to compute the mean operator in detail. 2.2 Estimating the Mean Op erator In order to obtain we would need µX Y , which is impossible to compute exactly, since we do not have Estimating Lab els from Lab el Prop ortions Y . However, we know that µX Y and µxy are close. Hence, if we are able to approximate µxy this, in turn, will be a good estimate for µX Y . Our quest is therefore as follows: express µxy as a linear combination over expectations with respect to the distributions on the datasets X1 , . . . , Xn (where n |Y|). Secondly, show that the expectations of the distributions having generated the sets Xi (µset [i, y ], x see Table 2) can be approximated by empirical means (µset [i, y ], also see Table 2). Finally, we need to comX bine both steps to provide guarantees for µX Y . It will turn out that in certain cases some of the algebra can be sidestepped, in particular whenever we may be able to identify several sets with each other (e.g. the test set X is one of the training datasets Xi ) or whenever (x, y ) factorizes into (x) (y ). Mean Op erator: Since µxy is a linear operator mapping p(x, y ) into a Hilbert Space we may expand µxy y y p(y )µclass [y , y ] µxy = p(y )Exp(x|y) [(x, y )] = x Y Y Algorithm 1 Input datasets X , {Xi }, probabilities iy and p(y ) for i = 1 to n and y Y do Compute empirical means µset [i, y ] X end for Compute µclass = (y )-1 µset ^x X Compute µX Y = ^ ^class [y , y ] Y p(y )µx Solve the minimization problem R m i 2 ^ = argmin g (|xi ) - m µX Y , + ^ =1 ^ eturn . Obviously we cannot compute µset [i, y ] explicitly, x since we only have samples from p(x|i). However the same convergence results governing the convergence of µX Y to µxy also hold for the convergence of µset [i, y ] X to µset [i, y ]. Hence we may use the empirical average x µset [i, y ] as the estimate for µset [i, y ] and from that x X find an estimate for µX Y (see Algorithm 1). 2.3 Sp ecial Cases where the shorthand µclass [y , y ] is defined in Table 2. x This means that if we were able to compute µclass [y , y ] x we would be able to "reassemble" µxy from its individual components. We now show that µclass [y , y ] can x be estimated directly. Key to our assumptions is that p(x|y , i) = p(x|y ). In other words, we assume that the conditional distribution of x is independent of the index i, as long as we know the label y . This yields the following: y p(x|y )iy . (7) p(x|i) = This allows us define the following means µset [i, y ] := Exp(x|i) [(x, y )] = x (7) In some cases the calculations described in Algorithm 1 can be carried out more efficiently. Minimal numb er of sets, i.e. |Y| = n: Provided that has full rank, ( )-1 = -1 . This means that the inverse can be computed more directly. Testing on one of the calibration sets, i.e. X = Xi : This means that X is one of the training sets. We only need one less set of observations. This is particularly useful for factorizing feature maps. Sp ecial feature map (x, y ) = (x) (y ): In this case the calculations of µclass [y , y ] and µset [i, y ] are ^x X greatly simplified, since we may pull the dependency on y out of the expectations. Defining µclass [y ], µset [i], x x and µset [i] as in Table 2 allows us to simplify X y µX Y = ^ p(y )(y ) µclass [y ] ^x (9) Y y iy µclass [y , y ]. x Note that in order to compute µset [i, y ] we do not need x any label information with respect to p(x|i). However, since we have at least |Y| of those equations and we assumed that has full rank, they allow us to solve a linear system of equations and compute µclass [y , y ] x from µset [i, y ] for all i. That is, we may use x µset = µclass and hence µclass = ( x x x -1 where µclass ^x = ( -1 ) µset X. (10) ) µset x (8) to compute µclass [y , y ] for all y Y. Whenever x Rn×n is invertible (8) reduces to µclass = -1 µset . x x With some slight abuse of notation we have µclass and x µset represent the matrices of terms µclass [y , y ] and x x µset [i, y ] respectively. x A significant advantage of (10) is that we only need to perform O(n) averaging operations rather than r O(n · |Y|). Obviously the cost of computing ( )-1 emains unchanged but the latter is negligible in comparison to the operations in Hilbert Space. Binary classification: One may show that the feature map (x, y ) takes on a particularly appealing form of (x, y ) = y (x) where y {±1}. This follows since we can always re-calibrate (x, y ), by an offset in- Estimating Lab els from Lab el Prop ortions dependent of y such that (x, 1) + (x, -1) = 0. If we moreover assume that X1 only contains class 1 and X2 = X contains a mixture of classes with labels 1 and -1 with proportions p(1) =: and p(-1) = 1 - respectively, we obtain the mixing matrix 1 1 P 0 0 -1 = = - 1 1- 1- 1- lugging this into (10) and the result in (9) yields = - 1 µX Y = µset [1] - (1 - ) 1- µset [1] + 1- µset [2] ^ X X X 2µset [1] - µset [2]. X X (11) Let R > 0 such that for al l f F we have |f (x)| R. Moreover, assume that X is an m-sample drawn from p on X. For Ż > 0 we have that with probability at least 1 - exp(-Ż2m/2R2 ) the fol lowing holds: µX - µx B 2Rm (F, p) + Ż (13) For k 0 we only have a failure probability of 1 - exp(-Ż2m/R2 ). Theorem 3 (Bartlett & Mendelson (2002)) Whenever B is a Reproducing Kernel Hilbert Space with kernel k (x, x ) the Rademacher average can be 1 1 bounded from above by Rm (F) m- 2 [Ex [k (x, x)]] 2 Our approximation error can be bounded as follows. From the triangle inequality we have: µX Y - µX Y µX Y - µxy + µxy - µX Y . ^ ^ For the second term we may employ Theorem 2 directly. To bound the first term note that by linearity y ( p(y ) )-1 ^ y,y (14) := µX Y - µxy = ^ where we define the matrix of coefficients ^ [i, y ] := µset [i, y ] - µset [i, y ]. x X (15) Consequently taking a simple weighted difference between the averages on two sets, e.g. one set containing spam whereas the other one containing an unlabeled mix of spam and non-spam allows one to obtain the sufficient statistics needed for estimation. 3 Convergence Bounds The obvious question is how well µX Y manages to ap^ proximate µX Y and secondly, how badly any error in estimating µX Y would affect the overall quality of the solution. We approach this problem as follows: first we state the uniform convergence properties of µX Y and similar empirical operators relative to µxy . Secondly, we apply those bounds to the cases discussed above, and thirdly, we show that the approximate minimizer of the log-posterior has a bounded deviation from what we would have obtained by knowing µX Y exactly. 3.1 Uniform Convergence for Mean Op erators In order to introduce the key result we need to introduce Rademacher averages: Definition 1 (Rademacher Averages) Let X be a domain and p a distribution on X and assume that X := {x1 , . . . , xm } is drawn iid from p. Moreover, let F be a class of functions X R. Furthermore denote by i Rademacher random variables, i.e. {±1} valued with zero mean. The Rademacher average is 1 . s im Rm (F, p) := EX E up i f (xi ) (12) f F m =1 This quantity measures the flexibility of the function class F -- in our case linear functions in (x, y ). Theorem 2 (Convergence of Empirical Means) Denote by : X B a map into a Banach space B, denote by B its dual space and let F the class of linear functions on B with bounded B norm by 1. Now note that all ^ [i, y ] also satisfy the conditions of Theorem 2 since the sets Xi are drawn iid from the distributions p(x|i) respectively. We may bound each term individually in this fashion and subsequently apply the union bound to ensure that all n · |Y| components satisfy the constraints. Hence each of the terms needs to satisfy the constraint with probability 1 - /(n|Y|) to obtain an overall bound with probability 1 - . To obtain bounds we would need to bound the linear operator mapping ^ into . 3.2 Sp ecial Cases A closed form solution in the general case is not particularly useful. However, we give an explicit analysis for two special cases: firstly the situation where (x, y ) = (x) (y ) and secondly, the binary classification setting where (x, y ) = y (x) and Xi = X , where much tighter bounds are available. Sp ecial feature map We only need to deal with n rather than with n × |Y| empirical estimates, i.e. µset [i] X vs. µset [i, y ]. Hence (14) and (15) specialize to X = y p(y ) in (y ) ( -1 ) y i ^[i] (16) (17) ^ [i] := µset [i] x =1 - µset [i]. X Estimating Lab els from Lab el Prop ortions Assume that with high probability each ^[i] satisfies ^[i] ci (we will deal with the explicit constants ci later). Moreover, assume for simplicity that |Y| = n and that has full rank (otherwise we need to follow through on our expansion using ( )-1 instead of -1 ). This implies that 2 = i ,j Binary Classification Next we consider the special case of binary classification where X2 = X . Using (11) we see that the corresponding estimator is given by µX Y = 2µset [1] - µset [2]. ^ X X (19) ^[i], ^[j] × (y )p(y )k (y , y ) Since µX Y shares a significant fraction of terms with ^ µX Y we are able to obtain tighter bounds as follows: Theorem 5 With probability 1 - (for 1 > > 0) the fol lowing bound holds: 2 m m1 l -1 -2 µX Y - µX Y 2 + ^ og(2/ ) + m+ 2 1 + y ,y p -1 y -1 y i j i ,j ci cj -1 K y ,p -1 i j (18) is the number of observations with y = 1 in X2 . yp where Ky,,y = k (y , y )p(y )p(y ). Combining several bounds we have the following theorem: Theorem 4 Assume that we have n sets of observations Xi of size mi , each of which drawn from distributions with probabilities iy of observing data with label y . Moreover, assume that k ((x, y ), (x , y )) = k (x, x )k (y , y ) 0 where k (x, x) 1 and k (y , y ) 1. Final ly, assume that m = |X |. In this case the mean operator µX Y can be estimated by µX Y with probability ^ at least 1 - with precision 2 × l µX Y - µX Y + ^ og((n + 1)/ ) P m1 i K y,p -1 i 1 2 -1 -1 -2 -1 + m i 2 mj 2 ,j j Pro of Denote by µ[X+ ] and µ[X- ] the averages over the subsets of X2 with positive and negative labels respectively. By construction we have that µX Y = µ[X+ ] - (1 - )µ[X- ] µX Y = 2µset [1] - µ[X+ ] - (1 - )µ[X- ] ^ X Taking the difference yields 2 [µset [1] - µ[X+ ]]. To X prove the claim nµ te that we may use Theoo a set nd rem µ both for 2 X [1] - Exp(x|y =1) [ (x)] . for [X+ ] - Exp(x|y=1) [ (x)] Taking the union bound and summing over terms proves the claim. The bounds we provided show that µX Y converges at ^ the same rate to µxy as µX Y does, assuming that the sizes of the sets Xi increase at the same rate as X . 3.3 Stability Bounds ro of We begin our argument by noting that both for (x, y ) and for (x) the corresponding Rademacher averages Rm for functions of RKHS norm bounded by 1 1 is bounded by m- 2 . This is a consequence of all kernels being bounded by 1 in Theorem 3 and k 0. Next note that in Theorem 2 we may set R = 1, since for f 1 and k ((x, y ), (x, y )) 1 and k (x, x) 1 it follows from the Cauchy Schwartz inequality that |f (x)| 2 1. Solving exp -m 2 for yields . l 1 -2 m + og (1/ ) Finally, note that we have n + 1 deviations which we need to bound: one between µX Y and µxy , and n for each of the [i] respectively. Dividing the failure probability 2 nto n + 1 cases yields i a l 1 -2 + bounds of the form m og ((n + 1)/ ) nd 2 r l -1 + og ((n + 1)/ ) espectively. Plugging all mi 2 error terms into (18) and summing over terms yields the claim and substituting this back into the triangle inequality proves the claim. To complete our reasoning we need to show that those bounds translate in guarantees in terms of the minimizer of the log-posterior. In other words, estimates ^ using the correct mean µX Y vs. its estimate µX Y do not differ by a significant amount. For this purpose we make use of (Altun & Smola, 2006, Lemma 17). Lemma 6 Denote by f a convex function on H and let µ, µ H. Moreover let > 0. Final ly denote by ^ , H the minimizer of L(, µ) := f () - µ, + 2 (20) ^ ^^ with respect to and the minimizer of L(, µ) respectively. In this case the fol lowing inequality holds: -1 ^ - µ-µ . ^ (21) This means that a good estimate for µ immediately translates into a good estimate for the minimizer of the approximate log-posterior. This leads to the following bound on the risk minimizer. Estimating Lab els from Lab el Prop ortions Corollary 7 The deviation between , as defined in ^ (4) and , the minimizer of the approximate logposterior using µX Y rather than µX Y , is bounded by ^ i -1 1 -2 mi 2 ). O(m + Finally, we may use (Altun & Smola, 2006, Theorem ^ 16) to obtain bounds on the quality of when considering how well it minimizes the true negative logposterior. Using the bound ^ ^ ^ L( , µ) - L( , µ) - µ - µ (22) yields the following bound for the log-posterior: ^ Corollary 8 The minimizer of the approximate log-posterior using µX Y rather than µX Y incurs a ^ penalty of at most -1 µX Y - µX Y 2. ^ Entropy and Regularity: Depending on the choice of entropy and divergence functionals we obtain a range of diverse estimators. For instance, if we were to choose the unnormalized entropy instead of the entropy, we would obtain AdaBoost style problems. We may also use Csiszar and Bregmann divergences. The key point is that our reasoning of estimating µX Y based on an aggregate of samples with unknown labels but known label proportions is still applicable. This means that it should be possible to design boosting algorithms and sparse coding methods which could operate on similarly restricted sets of observations. 5 Related Work and Alternatives 4 Extensions Note that our analysis so far focused on a specific setting, namely maximum-a-posteriori analysis in exponential families. While this is a common and popular setting, the derivations are by no means restricted to this. We have the entire class of (conditional) models described by Altun & Smola (2006); Dud´k & Schapire i (2006) at our disposition. They are characterized via H minimize -H (p) sub ject to Ezp [(z )] - µ p Transduction G¨rtner et al. (2006) and Mann & Mca Callum (2007) performed transduction by enforcing a proportionality constraint on the unlabeled data via a Gaussian Process model. At first glance these methods might seem applicable for our problem but as stated in Section 1, they do require that we have at least some labeled instances of al l classes at our disposition which need to be drawn in an unbiased fashion. This is clearly not the case in our setting. Self consistent prop ortions Kuck & de Freitas ¨ (2005) introduced a more informative variant of the binary multiple-instance learning, in which groups of instances are given along with estimates of the fraction of positively-labeled instances per group. This is then used to design a hierarchical probabilistic model which will generate consistent fractions. The optimization is solved via a MCMC sampler. While only described for a binary problem it could be easily extended to multiclass settings. Chen et al. (2006) and Musicant et al. (2007) use a similar approach with similar drawbacks, since we typically only have as many sets as classes. Conditional Probabilities A seemingly valid alternative approach is to try building a classifier for p(i|x) and subsequently recalibrating the probabilities to obtain p(y |x). At first sight this may appear promising since this method is easily applicable in conjunction with most discriminative methods. The idea would be to reconstruct p(y |x) by i p(y |x) = iy p(i|x). (23) However, this is not a useful estimator in our setting for a simple reason: it assumes the conditional independence y x | i, which obviously does not hold. A simple example illustrates the problem. Assume that X, Y = {1, 2} and that p(y = 1|x = 1) = p(y = 2|x = 2) = 1. In other words, the estimation problem ere p is a distribution, H is an entropy-like quantity defined on the space of distributions, and (z ) is some evaluation map into a Banach space. This means that the optimization problem can be viewed as an approximate maximum entropy estimation problem, where we do not enforce exact moment matching of µ but rather allow slack. In both Altun & Smola (2006) and Dud´k i & Schapire (2006) the emphasis lay on unconditional density models: the dual of the above optimization problem. In particular, it follows that for H being the Shannon-Boltzmann entropy, the dual optimization problem is the maximum a posteriori estimation problem, which is what we are solving here. In the conditional case, p denotes the collection of probabilities p(y |xi ) and the operator Ezp [(z )] = m 1 i=1 Ey |p(y |xi ) [(xi , y )] is the conditional exp ecm tation opm ator on the set of observations. Finally, er 1 µ = m i=1 (xi , yi ), that is, it describes the empirical observations. We have two design parameters: Function Space: Depending on which Banach Space norm we may choose to measure the deviation between µ and its expectation with respect to p in terms of e.g. the 2 norm, the 1 norm or the norm. The latter would lead to sparse coding and convex combinations. Estimating Lab els from Lab el Prop ortions is solvable since the classes are well separated. Moreover, assume that is given by 0 0.5 - .5 0.5 + 0.5 f or 0 < 1. = generate X3 . In scenario B we use c1 · 0.4 · p(1) c1 · 0.15 · p(2) c2 · 0.1 · p(1) c2 · 0.15 · p(2) c3 · 0.5 · p(1) c3 · 0.7 · p(2) the following splits c1 · 0.14 · p(3) c2 · 0.06 · p(3) c3 · 0.8 · p(3) Here, p(i|x) is useless for estimating p(y |x), since we will only exceed random guessing by at most . Reduction to Binary For binary classification and real-valued classification scores we may resort to yet another fairly straightforward method: build a classifier which is able to distinguish between the sets X1 and X2 and subsequently threshold labels such that the appropriate fraction of observations in X1 and X2 respectively has its proper labels. Unfortunately, multi-class variants of this reduction are nontrivial and experiments show that even for the binary case this method is inferior to our approach. Density Estimation Finally, one way of obtaining the probabilities p(x, y |i) is to perform density estimation for each set of observations Xi . Subsequently we may use i y -1 p(x, y |i) (24) p(x|y ) = i Here the constants c1 , c2 and c3 are chosen such that the probabilities are properly normalized. As before, X3 contains half of the data. Mo del Selection: As stated, we carry out a nested 10-fold cross-validation procedure: 10-fold cross-validation to assess the performance of the estimators; within each fold, 10-fold cross-validation is performed to find a suitable value for the parameters. For supervised classification, i.e. discriminative sorting, such a procedure is quite straightforward because we can directly optimize for classification error. For kernel density estimation (KDE), we use the loglikelihood as our criterion. Due to the high number of hyper-parameters (at least 8) in MCMC, it is difficult to perform nested 10-fold cross-validation. Instead, we choose the best parameters from a simple 10-fold crossvalidation run. In other words, we are giving the MCMC method an unfair advantage over our approach by reporting the best performance during the model selection procedure. Finally, for the re-calibrated sufficient statistics µX Y ^ we use the estimate of the log-likelihood on the validation set as the criterion for cross-validation, since no other quantity, such as classification errors is readily available for estimation. Algorithms: For discriminative sorting we use an SVM with a Gaussian RBF kernel whose width is set to the median distance between observations (Sch¨lkopf, o 1997); the regularization parameter is chosen by crossvalidation. The same strategy applies for our algorithm. For KDE, we use Gaussian kernels with diagonal densities. Cross-validation is performed over the kernel width. For MCMC, 103 samples are generated after a burn in period of 103 steps (Kuck & de Freitas ¨ (2005)). Optimization: Bundle methods (Smola et al., 2007; Teo et al. , 2007) are used to solve the optimization problem in Algorithm 1. Results: The experimental results are summarized in Table 3. Our method outperforms KDE and discriminative sorting. In terms of computation, our approach is somewhat more efficient, since it only needs to deal with a smaller sample size (only X rather than the union of all Xi ). The training time for our method is less than 2 minutes for all cases, whereas MCMC on average takes 15 minutes and maybe even much longer to re-calibrate the probability estimates. Bayes' theorem is finally invoked to compute posterior probabilities. This tends to fail for high-dimensional data due to the curse of dimensionality in density estimation. 6 Exp eriments Datasets: We use binary and three-class classification datasets from the UCI repository2 and the LibSVM site.3 If separate training and test sets are available, we merge them before performing nested 10-fold cross-validation. Since we need to generate as many splits as classes, we limit ourselves to three classes. For the binary datasets we use half of the data for X1 and the rest for X2 . We also remove all instances of class 2 from X1 . That is, the conditional class probabilities in X2 match those from the repository, whereas in X1 their counterparts are deleted. For three-class datasets we investigate two different partitions. In scenario A we use class 1 exclusively in X1 , class 2 exclusively in X2 , and a mix of all three classes weighted by (0.5 · p(1), 0.7 · p(2), 0.8 · p(3)) to http://archive.ics.uci.edu/ml/ http://www.csie.ntu.edu.tw/cjlin/ libsvmtools/ 3 2 Estimating Lab els from Lab el Prop ortions when the number of active kernels and/or observations are high. However, for large number of partitions n, the MCMC procedure might potentially have an edge over our method as we do not take full advantage of this setting. However, this can be achieved easily by optimizing the condition number of the pseudoinverse of the redundant system of linear equations. Our method also performs well on multiclass datasets. As described in Section 3.2, the quality of our minimizer of the log-posterior depends on the mixing matrix and this is noticeable in the reduction of performance for the dense mixing matrix (scenario B) in comparison to the better conditioned sparse mixing matrix (scenario A). In other words, for ill conditioned even our method has its limits, simply due to numerical considerations of effective sample size. 339. Mann, G., & McCallum, A. (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. In ICML'07. Musicant, D., Christensen, J., & Olson, J. (2007). Supervised learning by training on aggregate outputs. In IEEE ICDM. Sch¨lkopf, B. (1997). Support Vector Learning. Oldo enbourg Verlag. Smola, A., Vishwanathan, S. V. N., & Le, Q. (2007). Bundle methods for machine learning. In NIPS'07. Teo, C.H., Le, Q.V, Smola, A.J., & Vishwanathan, S.V.N. (2007). A Scalable Modular Convex Solver for Regularized Risk Minimization. In KDD'07. Table 3. Classification error on UCI/LibSVM datasets 7 Conclusion We have showed a rather surprising result, namely that it is possible to consistently reconstruct the labels of a dataset if we can only obtain information about the proportions of occurrence of each class (in at least as many data aggregates as there are classes). In particular, we prove that up to constants, our algorithm enjoys the same rates of convergence afforded to methods which have full access to all label information. This has a range of potential applications in ecommerce, spam filtering and improper content detection. It also suggests that techniques used to anonymize observations, e.g. demographic data, may not be really safe. Experiments show our algorithm is fast and outperforms competitive methods. Errors are reported in mean ± standard error. The best result and those not significantly worse than it, are highlighted in boldface. We use a one-sided paired t-test with 95% confidence. MM: Mean Map (our method); KDE: Kernel Density Estimation; DS: Discriminative Sorting (only applicable for binary classification); MCMC: the sampling method; BA: Baseline, obtained by predicting the ma jor class. : Program fails (too high dimensional data - only KDE). : Program fails (large datasets - only MCMC). Data iono iris optd page pima tic yeast wine wdb c sonar heart brea aust svm3 adult cleve derm musk ger cove spli giss made cmc bupa protA protB dnaA dnaB sensA sensB MM 18.4±3.2 10.0±3.6 1.8±0.5 3.8±2.3 27.5±3.0 31.0±1.5 9.3±1.5 7.4±3.0 7.8±1.3 24.2±3.5 30.0±4.0 5.3±0.8 17.0±1.7 20.4±0.9 18.9±1.2 19.1±3.6 4.9±1.4 25.1±2.3 32.4±1.8 37.1±2.5 25.2±2.0 10.3±0.9 44.1±1.5 37.5±1.4 48.5±2.9 44.6±0.3 45.7±0.6 16.6±1.0 29.1±1.0 19.8±0.1 21.0±0.1 KDE 17.5±3.2 16.8±3.4 0.7±0.4 7.1±2.8 34.8±0.6 34.6±0.5 6.5±1.3 12.1±4.4 5.9±1.2 35.2±3.5 38.1±3.8 14.2±1.6 33.8±2.5 27.2±1.3 24.5±1.3 35.9±4.5 27.4±2.6 28.7±2.6 41.6±2.9 41.9±1.7 35.5±1.5 43.8±0.7 50.8±5.1 60.2±0.1 61.2±0.0 30.7±0.8 33.0±0.7 43.1±0.0 43.1±0.0 DS 12.2±2.6 15.4±1.1 9.8±1.2 18.5±5.6 34.4±1.7 26.1±1.5 25.6±3.6 18.8±6.4 10.1±2.1 31.4±4.0 28.4±2.8 3.5±1.3 15.8±2.9 25.5±1.5 22.1±1.4 23.4±2.9 4.7±1.9 22.2±1.8 37.6±1.9 32.4±1.8 26.6±1.7 12.2±0.8 46.0±2.0 45.1±2.3 40.3±4.9 N/A N/A N/A N/A N/A N/A MCMC 18.0±2.1 21.1±3.6 2.0±0.4 5.4±2.8 23.8±1.8 31.3±2.5 10.4±1.9 8.7±2.9 15.5±1.3 39.8±2.8 33.7±4.7 4.8±2.0 30.8±1.8 24.2±0.8 18.7±1.2 24.3±3.1 14.2±2.8 19.6±2.8 32.0±0.6 41.1±2.2 28.8±1.6 50.0±0.0 49.6±0.2 46.9±2.6 50.4±0.8 65.3±1.9 67.7±1.8 37.7±0.8 40.5±0.0 BA 35.8 29.9 49.1 43.9 34.8 34.6 39.9 40.3 37.2 44.5 44.9 34.5 44.4 23.7 24.6 22.7 30.5 43.5 32.0 45.9 48.4 50.0 50.0 49.9 49.7 61.2 61.2 40.5 40.5 43.2 43.2 References Altun, Y., & Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. COLT'06, LNCS, 139­153. Springer. Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3. Chen, B., Chen, L., Ramakrishnan, R., & Musicant, D. (2006). Learning from aggregate views. ICDE'06, 3­12. Dud´k, M., & Schapire, R. E. (2006). Maximum eni tropy distribution estimation with generalized regularization. COLT'06, Springer. Gartner, T., Le, Q., Burton, S., Smola, A., & Vish¨ wanathan, S. (2006). Large-scale multiclass transduction. NIPS'06 Kuck, H., & de Freitas, N. (2005). Learning about ¨ individuals from group statistics. In UAI'05, 332­ Acknowledgments We thank Hendrik Kuck and ¨ Choon Hui Teo. NICTA is funded by the Australian Government's Backing Australia's Ability and the Centre of Excellence programs. We received funding of the FP7 Network of Excellence by the European Union.