Learning from Noisy Side Information by Generalized Maximum Entropy Model

Tianbao Yang yangtia1@cse.msu.edu Rong Jin rongjin@cse.msu.edu Anil K. Jain jain@cse.msu.edu Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA

Abstract
We consider the problem of learning from noisy side information in the form of pairwise constraints. Although many algorithms have been developed to learn from side information, most of them assume perfect pairwise constraints. Given the pairwise constraints are often extracted from data sources such as paper citations, they tend to be noisy and inaccurate. In this paper, we introduce the generalization of maximum entropy model and propose a framework for learning from noisy side information based on the generalized maximum entropy model. The theoretic analysis shows that under certain assumption, the classification model trained from the noisy side information can be very close to the one trained from the perfect side information. Extensive empirical studies verify the effectiveness of the proposed framework.

can derive the pairwise constraints based on the citations between papers. Although various algorithms have been proposed for learning from side information, most of them assume perfect side information. In contrast, in this study, we focus on the problem of learning from noisy side information in which some of the pairwise constraints are labeled incorrectly. This is important because the pairwise constraints extracted from data tend to be noisy and inaccurate. In the example of classifying research articles with pairwise constraints constructed from paper citations, the cited paper may not share the same research topic as the citing paper. In order to handle the noisy side information, we introduce the generalization of maximum entropy model and propose a framework for learning from noisy side information based on the generalized maximum entropy model. We show that under certain assumptions, the conditional probabilistic model trained from the noisy side information converges to that trained from the perfect side information. Extensive experimental results verify the efficacy of the proposed framework for learning from noisy side information. The remainder of this paper is organized as follows. In section 2, we review the related work. In section 3, we present the generalized maximum entropy learning from noisy side information. We present experimental results in section 4, and conclude our study in section 5.

1. Introduction
Learning from side information has been studied extensively and has found its application in distance metric learning (Xing et al., 2003), constrained clustering (Basu et al., 2004b), and kernel learning (Hoi et al., 2007). The side information is usually cast in the form of pairwise constraints, including the pairs in the same class, called the positive (pairwise) constraints, and the pairs in different classes, called the negative (pairwise) constraints. The side information can often be derived from data, making it more attractive than the standard setup of supervised learning. For instance, in classifying research articles, we
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. Related Work
In this section, we review the related work of learning from side information with its application to distance metric learning, constrained clustering, and kernel learning . We also discuss the relation between this work and previous work on learning from noisy information.

Learning from Noisy Side Information

2.1. Learning from Side Information The objective of learning from side information is to learn a statistical model that is consistent with the given pairwise constraints. There are three major applications of learning from side information, i.e., distance metric learning, constrained clustering, and kernel learning. Below we briefly review each of the three applications. For distance metric learning, an appropriate distance metric is learned so that data points in positive constraints are separated by a short distance while data points in negative constraints are separated by a large distance. Many algorithms(Xing et al., 2003; Shental et al., 2002; Hoi et al., 2006a; Goldberger et al., 2004; Yang et al., 2006; Weinberger et al., 2006) have been developed for distance metric learning. More work on distance metric learning from side information can be found in the survey (Yang & Jin, 2006) and references therein. For constrained clustering, the objective is to improve the accuracy of data clustering by exploiting the pairwise constraints. Based on how the constraints are used, the constrained clustering methods can be cast into three categories: constrainedbased (Davidson & Ravi, 2005; Basu et al., 2004a), distance-based (Bilenko & Mooney, 2003; Cohn et al., 2003), and a mixture of these two (Basu et al., 2004b). More work on constrained clustering with side information can be found in the (Davidson & Sugato, 2007), and the references therein. For kernel learning, an appropriate kernel matrix/function for a given data set is learned from a set of pairwise constraints. Recent work on kernel learning from pairwise constraints include (Hoi et al., 2007; Hoi & Jin, 2008). Most studies on learning from side information assume side information is noise-free. However, in many applications, the pairwise constraints are derived from data, making them prone to errors. Although some work (Basu et al., 2004b; Pelleg & Baras, 2007) claims that their approaches are robust to noise in side information, they do not have systematic approaches for handling noisy constraints. In contrast, we present a principled framework for learning from noisy side information based on the generalized maximum model. We also provide theoretic analysis to further justify the proposed approach for learning from noisy side information. 2.2. Learning from Noisy Label Information Learning from noisy label information (not noisy pairwise constraints) was considered in several studo ies. (Lawrence & Sch¨lkopf, 2001) estimates a ker-

nel Fisher Discriminant in the presence of label noise. Their work is based on two assumptions: (a) a Gaussian distribution for the input patterns, and (b) a known noise model for the corruption of class labels. (Pal et al., 2007) presents a probabilistic model for extracting location information for events by using the noisy training labels. Our work differs from these studies in that we deal with noisy pairwise constraints, not noisy label information. In addition, we present theoretic analysis showing that under certain assumption, the solution found by our algorithm using noisy side information converges to the one trained from perfect side information. Finally, it is worthwhile pointing out that our approach is closely relate to the learning framework based on divergence minimization (Altun & Smola, 2006). However, unlike the divergence minimization framework that is designed for the standard setup of supervised learning, our work focuses on learning from noisy side information.

3. Learning from Noisy Side Information
We start with the basic formulation for maximum entropy learning from perfect side information, followed by its generalization. We then extend the generalized maximum entropy learning to the case of noisy side information. For the purpose of presentation, we first introduce the notations that are used throughout this article. 3.1. Notations Let D = {xi  X , i = 1, · · · N } be a collection of data points, P = {(x1 , x2 , yi )|x1 , x2  D, i = 1, . . . , n, yi  i i i i {+1, -1}} be a collection of observed labeled pairs. We slightly abuse the terminology of labeled and unlabeled examples by referring to the examples in D that also occur in P as labeled examples, and to the remaining examples in D as unlabeled examples. We denote by yi the true label for the pair (x1 , x2 ). We i i refer to the pairs with yi = +1 as perfect positive constraints and the pairs with yi = -1 as perfect negative constraints. Similarly, we refer to the pairs with yi = +1 as noisy positive constraints and the pairs with yi = -1 as noisy negative constraints. We use y = -y for complement of y. We use Kj (x1 , x2 ) for the Ż j th  {1, · · · , m} feature function defined on X × X . We denote by ki = (K1 (x1 , x2 ), · · · , Km (x1 , x2 )) the i i i i feature vector for pair (x1 , x2 ). Throughout the paper, i i we use capital letters X, Y, Y for the corresponding random variables. We define Pr(Y = y|Y = y) = cy ,

Learning from Noisy Side Information

and use c+ = c+1 , c- = c-1 for short. Also, in the sequel, we use the following notations: aj [y] =  1 n
n

(yi , y)Kj (x1 , x2 ) i i
i=1 n

Theorem 1. Assume (x1 , x2 , yi ) are i.i.d. sami i ples from an unknown distribution P (X 1 , X 2 , Y ), the equality constraint in (1) for any j and y holds with probability 1 when the number of instances approaches infinity. In particular, for any  > 0 we have Pr aj [y] - aj [y]    4 exp - p 
x ,x

1 aj [y] =  n

(yi , y)Kj (x1 , x2 ) i i
i=1

2 n 82 j

where (yi , y) is the Kronecker delta function that outputs 1 if yi = y and zero, otherwise. 3.2. Learning from Perfect Side Information by Generalized Maximum Entropy Model We first consider the maximum entropy model for learning from perfect side information. We cast the problem of learning from side information into a binary classification problem where the objective is to classify each pair (x1 , x2 ) into two categories, i.e., a positive i i pair (yi = +1) and a negative pair (yi = -1). Using maximum entropy model, we aim to learn the conditional distribution Pr(Y = y|X 1 , X 2 ), which leads to the following optimization problem:
n

where j = max |Kj (x1 , x2 )|. 1 2 The theorem can be proved by noting that E[aj [y]] =  E[aj [y]] and applying McDiarmid's inequality. Details p are provided in the supplementary material. To address the case that aj [y] and aj [y] could be different, p  we propose a generalization to the traditional maximum entropy model in (1). Given the finite number of training data, we relax the equality constraints in (1) into inequality ones, leading to the following formulation for learning from side information max s.t. 1 n 1 n
n

i=1

H(p|x1 , x2 ) - i i

1 2

y
y

2

(2)

max s.t. 1 n

H(p|x1 , x2 ) i i
i=1 n

(1)

i

p(y|x1 , x2 )Kj (x1 , x2 )  aj [y] - yj , y, j i i i i 

i=1

p(y|x1 , x2 )Kj (x1 , x2 ) = aj [y], y, j i i i i 

where y = (y1 , . . . , ym ) and · is a norm that measures the length of vector y . The key features of the generalized maximum entropy model in (2) are: · Replacing equality constraints with inequality ones. As a result, we have aj [y] - yj  aj [y]  aj [y] + yj . Ż p   Note that although only one side inequality is included in (2), the upper bound of aj [y] can be p easily derived by using the relation aj [y] + aj [Ż] = p p y j j a [y] + a [Ż]. y · The positive dummy variables  are introduced to account for the difference between the two empirical means aj [y] and aj [y]. A regularization term p  y 2 /(2) is introduced into the objective in order to determine these variables automatically. We further justify the generalized maximum entropy model by showing it is equivalent to the regularized logistic regression model. Proposition 2. When · = · 2 , the dual problem of (2) is equivalent to the regularized logistic regression model, i.e., min   2
2 2

where H(p|x1 , x2 ) = - y p(y|x1 , x2 ) ln p(y|x1 , x2 ). i i i i i i The solution to (1) is given by p(y|x1 , x2 ) = i i 1 1 + exp(-y ki )

where   Rm are the dual variables and are obtained by solving the following optimization problem,
n R

min m
i=1

ln 1 + exp(-yi  ki )

One major problem with the maximum entropy model in (1) is the equality constraint, which is unlikely to hold if for each pair (x1 , x2 ), yi is a random sample i i from the distribution p(y|x1 , x2 ). We denote by aj [y] p i i the left side of equality constraint in problem (1), i.e. aj [y] = p 1 n
n

p(y|x1 , x2 )Kj (x1 , x2 ) i i i i
i=1

The following theorem shows that aj [y] and aj [y] could p  differ significantly if n is small. The difference between the two quantities will diminish only when n approaches infinity.

Rd

+

1 n

n

ln 1 + exp(-yi  ki )
i=1

Learning from Noisy Side Information

3.3. Learning from Noisy Side Information by Generalized Maximum Entropy Model We extend the framework of generalized maximum entropy learning to the case when pairwise constraints are noisy, i.e., yi = yi for some pairs. The advantage of extending the maximum entropy model for noisy side information is that the label information yi is only utilized in computing aj [y]. Hence, if we have an al ternative approach to estimate aj [y] without having to  know which labels are incorrect, we can construct the maximum entropy model for noisy side information. In order to estimate aj [y] in the case of noisy side  information, we make the following assumptions. Assumption 1. We assume (1.a) Pr(Y |X , X , Y ) = Pr(Y |Y ), (1.b) Pr(Y = y|Y = y) = cy is given and (1.c) cy + cy - 1 > 0. Ż In the above assumption, (1.a) assumes Y is conditionally independent of (X 1 , X 2 ) given Y , (1.b) assumes the group-level knowledge about the noise in the pairwise constraints and (1.c) essentially assumes that the noise level of the side information is not too significant. With these three assumptions, the following theorem shows that it is possible to express empirical mean aj [y] in terms of aj [y], i.e., the empirical mean   estimated from the noisy side information. Theorem 3. Assume = 1, . . . , n are i.i.d samples, then for any  > 0 we have Pr aj [y] - bj [y]    4 exp -    (cy + cy - 1) n Ż 82 j
2 2 1 2

With theorem 3 and corollary 4, we finally arrive at the following formulation for generalized maximum entropy learning from noise side information max s.t. 1 n 1 n
n

i=1 n

H(p|x1 , x2 ) - i i

1 2

y
y

2

(3)

i=1

p(y|x1 , x2 )Kj (x1 , x2 )  bj [y] - yj i i i i 

3.4. Optimization, Analysis and Application to Kernel Learning We present the dual formulation to the generalized maximum entropy learning from noisy side information in (3). To simplify our presentation, we define b1 = (b1 [y], . . . , bm [y]) , b0 = (b1 [y], . . . , bm [y])     |y=1 |y=-1 The dual problem to (3) is given by
1 ,0 R+

max m

 b1 +  b0 - 1 0 - 1 n

 ( 1 2

2 

+ 0

2 )

(4)

ln exp( ki ) + exp( ki ) 1 0
i

(x1 , x2 , yi ), i i i

where ·  is the dual norm of · . The resulting conditional distribution p(y|x1 , x2 ) is given by p(y = 1|x1 , x2 ) = exp((1 - 0 ) k(x1 , x2 )) (5) 1 + exp((1 - 0 ) k(x1 , x2 ))

where bj [y] =  aj [y] 1 (1 - cy ) Ż  - (cy + cy - 1) n cy + cy - 1 Ż Ż
n

Kj (x1 , x2 ) i i
i=1

Next, we show how the solution 1 and 0 will be affected when replacing aj [y] with bj [y], i.e., the em  pirical mean computed from the noisy pairwise constraints. Theorem 5. Assume 2 norm is in the generalized maximum entropy model, i.e., · = · 2 . Let  = ( ,  ) be the solution to (4) with b = b , b , and 1 0 1 0 o = (o , o ) be the solution with bo = bo , bo . We 1 0 1 0 have  - o
F

The proof can be found in the supplementary material. As indicated by Theorem 3, under Assumption 1, we can approximate aj [y] by bj [y]. It is interesting to note    that the  convergence rate is O (1/[(cy + cy - 1) n]), Ż not O(1/ n). Thus, when the noise level of pairwise constraints is high, i.e., cy + cy - 1 is small, the two Ż quantities could still differ significantly even with modest number of training pairs. Similar to Theorem 3, we can have the following corollary to bound the difference between aj [y] and bj [y]. p  Corrolary 4. Assume (x1 , x2 , yi ), i = 1, . . . , n are i i i.i.d samples, then for any  > 0 we have Pr aj [y] - bj [y]    4 exp - p  2 (cy + cy - 1)2 n Ż 82 j


2  b - bo 

F

The proof for the theorem as well as the following theorems can be found in the supplementary material. Combining Theorem 5 and Theorem 3, we have the following theorem showing the impact of replacing aj [y]  with bj [y].  Theorem 6. Let · = · 2 . Let p(y|x1 , x2 ) be the conditional model derived from noisy side information

Learning from Noisy Side Information

using (3), and p(y|x1 , x2 ) be the conditional model derived from the perfect side information using (2). Under Assumption 1, with probability 1 - , for any x1 , x2 and y, we have p(y|x1 , x2 ) - p(y|x1 , x2 )  4m2 c 8 ln n 8m 

name Cora Citeseer TeAt

Table 1. Statistics of Data sets #examples #words #links 2708 3312 1293 1433 3703 106 5429 4732 571

#classes 7 6 6

where  = max1jm j , and c = c+ + c- - 1. As indicated by the above theorem, the difference between two conditional models will be reduced at the  rate of 1/[(c+ + c- - 1) n]. Finally, since our algorithm depends on the knowledge of c+ and c- , we further analyze the behavior of the proposed algorithm with inaccurate estimation of c+ and c- . We denote by c+ and c- the estimates of c+ and c- , respectively. We define p(y|x1 , x2 ) the conditional model derived from the noisy side information using c+ and c- . We measure the difference (c+ , c- ) and their estimates by  = max(|c+ -c+ |, |c- -c- |). The next theorem shows the difference between p(y|x1 , x2 ) and p(y|x1 , x2 ). Theorem 7. Let · = · 2 . Let p(y|x1 , x2 ) be the conditional model derived from noisy side information with c+ and c- , and p(y|x1 , x2 ) be the conditional model derived from the perfect side information using (2). Assume c+ + c- - 1   with   0 and   /4. Under Assumption 1, with probability 1 - , for any x1 , x2 and y, we have p(y|x1 , x2 ) - p(y|x1 , x2 )  4m2 c 8 8m 32m ln + n  2

first introduce the data sets, baselines and evaluation metric. Data Sets We select three linked document data sets, i.e. Cora, Citeseer, Terrorist Attacks(TeAt) 1 for our evaluation. They were processed by the research group of Lise Getoor 2 . Each data set contains (1) a set of documents described by binary vectors indicating the presence and absence of words from a dictionary, (2) links among documents (e.g., citations between research articles), and (3) the class assignment for each document. The statistics of these three data sets are summarized in Table 1. In the experiments, the attributes for each document are normalized by first dividing the sum of the attributes and then taking the square root (Jebara et al., 2004). Evaluation In order to evaluate the proposed algorithm, we apply it to kernel learning as described at the end of Section 3. In particular, for each attribute j, we construct a linear kernel matrix Kj (x1 , x2 ) = x1 [j]x2 [j] for paired documents (x1 , x2 ), where x[j] is the j th normalized attribute of document x. The proposed algorithm will be applied to learn the combination of multiple kernel matrices from the noisy pairwise constraints derived from links. The 2 norm is used in the proposed algorithm. Given the learned kernel matrix, a spectral clustering algorithm (Shi & Malik, 1997) is applied for document clustering. We evaluate the clustering result by comparing it to the class assignment information provided in each data set. Normalized mutual information(NMI) (Yang et al., 2009) is used as our evaluation metric. For all the experiments, we set  in the proposed algorithm to be 0.01/c2, where c = c+ + c- - 1. Baseline We compare the proposed algorithm to the following metric/kernel learning algorithms: (a) GDM, the global distance metric learning algorithm (Xing et al., 2003), (b) DCA, the discriminative component analysis algorithm (Hoi et al., 2006a), (c) ITML, the information theoretic metric learning algorithm proposed by (Davis et al., 2007), and (d) SKL,
We choose these data sets because they have relatively low noise (20%  35%) in their pairwise constraints derived from links that satisfies the condition c+ + c- - 1 > 0 in Assumption 1. 2 www.cs.umd.edu/projects/linqs/projects/lbc/
1

We finally discuss the application of the generalized maximum entropy learning from side information to kernel learning. In this case, Kj (x1 , x2 ) is a candidate kernel function, and the solution to (5) could provide us the way to linearly combine kernels, i.e., j j j (1 - 0 )Kj . However, the combined kernel may not be positive semi-definite, because some weights j - j are negative. To ensure the combined ker1 0 nel to be valid, we introduce one more constraint j  j , j = 1, . . . , m to the optimization problem 1 0 in (3). The optimization problems are solved by Nesterov method (Nemirovski, 1994).

4. Experiments
We evaluate the proposed algorithm by clustering the linked documents. We first present the experiments on clustering linked documents with noisy pairwise constraints derived from the link information. We then examine the behavior of the proposed algorithm in more details. Before presenting the experimental results, we

Learning from Noisy Side Information

the spectral kernel learning algorithm (Hoi et al., 2006b). For fair comparison, the distance metric A learned by the metric learning algorithms will be used to construct a kernel matrix K = XAX  , where X is the data matrix, and the same spectral clustering algorithm will be applied to K for document clustering. We also evaluate the proposed algorithm against the metric pairwise constrained K-means clustering algorithm (Basu et al., 2004b), referred to as MPCK. In order to improve the robustness of MPCK to noisy constraints, we follow (Liu et al., 2007) and weight the noisy positive constraints and the noisy negative constraints by c+ , c- respectively to reduce their impact on the clustering results. As the reference point, we compute a linear kernel for both labeled and unlabeled examples, without using the provided pairwise constraints. We refer to this baseline as base. Finally, we refer to as GMEns the proposed generalized maximum entropy model for learning from noisy side information, and as GMEs the generalized maximum entropy model without considering the noise in side information. All the experiments are run five times and the clustering accuracy averaged over five runs is reported in our study. 4.1. Experiments on Real Noisy Constraints We conduct experiments of document clustering with the noisy pairwise constraints derived from the links between documents. In particular, we use all the linked document pairs as the positive constraints. The same number of document pairs without link are sampled to construct the negative constraints. To obtain the noise levels of the pairwise constraints, we sample a total of 100 pairwise constraints and estimate c+ and c- based on the correctness of the sampled constraints 3 . Figures 1(a), 2(a) and 3(a) show the clustering accuracy measured in NMI for the three data sets. The mean values of the estimated c+ and c- are listed under each figure. We observe that given the noisy pairwise constraints, all the algorithms except ITML perform significantly worse on at least one data set than the reference method base. In contrast, the proposed algorithm for learning from noisy pairwise constraints outperforms the reference method significantly for all three data sets. We thus conclude that the proposed algorithm is overall more robust to noise in the side information.
These validated pairwise constraints are also used by the other baseline methods for computing distance metrics and kernel matrices
3

4.2. Controlled Experiments with Synthetic Noisy Constraints In this section, we examine the robustness of the proposed algorithm to (a) different noise levels in synthetically generated pairwise constraints, and (b) the estimated values for c+ and c- . Robustness to the Noise We first sample 10, 000 pairwise constraints from each data set, with 5, 000 positive constraints and 5, 000 negative constraints. Random noise is introduced to the synthetic constraints by randomly flipping the label of a pair with a probability p%, where p% specifies the noise level. We set c+ and c- to be 1 - p%, with the assumption that the knowledge of noise level is perfect. To examine the impact of noisy positive constraints and noisy negative constraints separately, for each data set, with a given noise level p%, we conduct two experiments, one with corrupted positive constraints but perfect negative constraints, and the other with corrupted negative constraints but perfect positive constraints. Figures 1(b), 2(b) and 3(b) compare the clustering results for GMEns and GMEs with the noise levels in the synthetic pairwise constraints varied from 10% to 90% on the three data sets. We observe that GMEns, the generalized maximum entropy model for noisy side information, is significantly more robust to the noise in the pairwise constraints than GMEs which does not take into account the noise in side information. We also observe that the noisy positive constraints have significantly higher adverse impact on the clustering results than the noisy negative constraints. Sensitivity to c+ , c- We use the same set of 10, 000 randomly sampled pairwise constraints for this study. We add the same noise level to both positive constraints and negative constraints. To investigate the sensitivity to c+ , c- , instead of setting them to be 1-p%, we perturb these parameters by setting them to be (1 - p%)(1 ± e%). Figures 1(c), 2(c) and 3(c) show the results of GMEns on the three data sets with four noise levels p% = 10%  40% for e% = 1%, 10%. We observe that GMEns is overall robust to modest perturbation level, making the proposed algorithm applicable even when the assumed noise levels are inaccurate. Finally, we compare the proposed algorithm to the baselines on the synthetic noisy constraints by varying the level of noise. Due to the fact that some of the baseline algorithms are time consuming, one thousand pairs are sampled for positive constraints and negative constraints, respectively. We show the results on the noise added to the positive constraints due to its stronger effect on the performance. Fig-

Learning from Noisy Side Information
Cora
0.35 0.3

Cora
base GMEns MPCK GDM DCA ITML SKL
0.4

Cora e%=1%
0.35 0.3 0.25

Cora
0.5 0.45 0.4

NMI

0.3 0.2 0.1 0

0.2 0.15 0.1

0.25

GMEns GMEs
10 20 30 40 50 60 70 80 90

c ,c
+ + +

- - -

0.35 0.3

c (+e%), c (+e%)

base GMEns MPCK GDM DCA ITML SKL

NMI

NMI

noise(%) in positive constraints
0.15 0.4 0.3

0.05 0

NMI

0.2

c (+e%), c (-e%)
10 20

0.25 0.2

e%=10%

30

40

0.3

NMI

0.1

0.15 0.1

NMI

0.2 0.1 0

0.2

0.05

c (-e%), c (+e%)
0.1 0

+

-

0.05

c+(-e%), c-(-e%)
10 20 30 40 50 60 70 80 90 0

0

noise(%) in negative constraints

10

20

30

40

less

medium

much

noise(%) in pairwise constraints

noise level

c = 0.8192 Ż (a) c+ = 0.8150 Ż-

(b)

(c)

(d)

Figure 1. Experimental results on Cora data set (a) Comparison with baselines on real noisy constraints; (b) Robustness to the noise in synthetic constraints; (c) Sensitivity to c+ , c- ; (d) Comparison with baselines on synthetic noisy constraints. The same placement of subfigures applies to Figure 2 and Figure 3.
Citeseer
0.4 0.3 0.25

Citeseer
0.3 0.25

Citeseer e%=1%

Citeseer
0.45 0.4

NMI

0.2

0.2 0.15 0.1

0.35

0.2

0.1

GMEns GMEs
10 20 30 40 50 60 70 80 90

c+, c- c+(+e%), c-(+e%)
10 20 30 40 0.3

0

0.05 0

NMI

0.15 0.4

noise(%) in positive constraints
base GMEns MPCK GDM DCA ITML SKL
0.3

e%=10%
0.3

NMI

c (+e%), c (-e%)
+ -

base GMEns MPCK GDM DCA ITML SKL

NMI

0.25

0.2

0.1

0.15

NMI

NMI

0.2

0.2

0.1

0.05

0.1

0.1

c+(-e%), c-(+e%) c+(-e%), c-(-e%)
10 20 30 40

0.05

0

0

10

20

30

40

50

60

70

80

90

0

0

less

medium

much

noise(%) in negative constraints

noise(%) in pairwise constraints

noise level

c = 0.8077 Ż (a) c+ = 0.7596 Ż-

(b)

(c)

(d)

Figure 2. Experimental results on Citeseer data set

ures 1(d), 2(d) and 3(d) show the clustering results of all algorithms at three noise levels: low(10%), medium(40%), high(70%) on the three data sets. We observe that the proposed algorithm is able to outperform all the baseline algorithms for all the cases.

5. Conclusions
We have proposed a generalized maximum entropy model for learning from noisy side information, and discussed its application to kernel learning. Our theoretical analysis shows that the model trained from the noisy side information converges to the model trained from the perfect side information. Extensive experimental results verify the efficacy of the proposed model. In the future, we plan to apply the proposed approach to problems in other domains, including visual object recognition and gene expression pattern prediction.

fice (W911NF-09-1-0421). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF, ONR and ARO. Part of Anil Jain's research was supported by WCU (World Class University) program through the National Research Foundation of Korea funded by the Ministry of Education, Science and Technology(R31-2008-000-10008-0).

References
Altun, Yasemin and Smola, Alexander J. Unifying divergence minimization and statistical inference via convex duality. In COLT, pp. 139­153, 2006. Basu, Sugato, Banjeree, A., Mooney, ER., Banerjee, Arindam, and Mooney, Raymond J. Active semisupervision for pairwise constrained clustering. In SDM, pp. 333­344, 2004a. Basu, Sugato, Bilenko, Mikhail, and Mooney, Raymond J. A probabilistic framework for semisupervised clustering. In KDD, pp. 59­68, 2004b. Bilenko, Mikhail and Mooney, Raymond J. Adaptive

Acknowledgement
The work was supported in part by National Science Foundation (IIS-0643494), Office of Naval Research ( N00014-09-1-0663), and Army Research Of-

Learning from Noisy Side Information
Terrorist Attack
0.4 0.6

Terrorist Attack
base GMEns MPCK GDM DCA ITML SKL
0.5

Terrorist Attack e%=1%
0.5 0.4 0.6

Terrorist Attacks
base GMEns MPCK GDM DCA ITML SKL

NMI

0.35

0.4

0.3

NMI

0.3 0.2 0.1 0 10

0.3 0.2

0.25

GMEns GMEs
20 30 40 50 60 70 80 90

c ,c
+

-

0.5

c+(+e%), c-(+e%)
10 20 30 40

0.1 0

NMI

0.2 0.6 0.15 0.5 0.1

noise(%) in positive constraints

e%=10%
0.5 0.4

NMI

c+(+e%), c-(-e%)

0.4

0.3

NMI

0.4 0.3 0.2 0.1

0.2

NMI

0.3 0.2 0.1

0.05

c+(-e%), c-(+e%) c+(-e%), c-(-e%)
10 20 30 40

0.1

0

0

10

20

30

40

50

60

70

80

90

0

0

noise(%) in negative constraints

less

medium

much

noise(%) in pairwise constraints

noise level

c = 0.6590 Ż (a) c+ = 0.6643 Ż-

(b)

(c)

(d)

Figure 3. Experimental results on Terrorist Attack data set

duplicate detection using learnable string similarity measures. In KDD, pp. 39­48, 2003. Cohn, D., Caruana, R., and McCallum, A. Semisupervised clustering with user feedback. Technical report, 2003. Davidson, Ian and Ravi, S. S. Hierarchical clustering with constraints: Theory and practice. In PKDD, pp. 59­70, 2005. Davidson, Ian and Sugato, Basu. A survey of clustering with instance level constraints. Technical report, 2007. Davis, Jason V., Kulis, Brian, Jain, Prateek, Sra, Suvrit, and Dhillon, Inderjit S. Information-theoretic metric learning. In ICML, pp. 209­216, 2007. Goldberger, Jacob, Roweis, Sam, Hinton, Geoff, and Salakhutdinov, Ruslan. Neighbourhood components analysis. In NIPS, pp. 513­520, 2004. Hoi, Steven C. H. and Jin, Rong. Active kernel learning. In ICML, pp. 400­407, 2008. Hoi, Steven C. H., Liu, Wei, Lyu, Michael R., and Ma, Wei-Ying. Learning distance metrics with contextual constraints for image retrieval. In CVPR, pp. 2072­2078, 2006a. Hoi, Steven C. H., Lyu, Michael R., and Chang, Edward Y. Learning the unified kernel machines for classification. In KDD, pp. 187­196, 2006b. Hoi, Steven C. H., Jin, Rong, and Lyu, Michael R. Learning nonparametric kernel matrices from pairwise constraints. In ICML, pp. 361­368, 2007. Jebara, Tony, Kondor, Risi, Howard, Andrew, Bennett, Kristin, and Cesa-bianchi, Nicol. Probability product kernels. JMLR, 5:819­844, 2004. Lawrence, Neil D. and Sch¨lkopf, Bernhard. Estimato ing a kernel fisher discriminant in the presence of label noise. In ICML, pp. 306­313, 2001.

Liu, Yi, Jin, Rong, and Jain, Anil K. Boostcluster: boosting clustering by pairwise constraints. In KDD, pp. 450­459, 2007. Nemirovski, A. Efficient methods in convex programming. 1994. Pal, Chris, Mann, Gideon, and Minerich, Richard. Putting semantic information extraction on the map: Noisy label models for fact extraction. In AAAI Workshop on Information Integration on the Web, 2007. Pelleg, Dan and Baras, Dorit. K-means with large and noisy constraint sets. In ECML, pp. 674­682, 2007. Shental, Noam, Hertz, Tomer, Weinshall, Daphna, and Pavel, Misha. Adjustment learning and relevant component analysis. In ECCV, pp. 776­792, 2002. Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. PAMI, 22:888­905, 1997. Weinberger, K., Blitzer, J., and Saul, L. Distance metric learning for large margin nearest neighbor classification. In NIPS, pp. 207­244, 2006. Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Russell, Stuart. Distance metric learning, with application to clustering with side-information. In NIPS, pp. 505­512, 2003. Yang, Liu and Jin, Rong. Distance metric learning: A comprehensive survey. Technical report, 2006. Yang, Liu, Jin, Rong, Sukthankar, Rahul, and Liu, Yi. An efficient algorithm for local distance metric learning. In AAAI, pp. 543­548, 2006. Yang, Tianbao, Jin, Rong, Chi, Yun, and Zhu, Shenghuo. Combining link and content for community detection: a discriminative approach. In KDD, pp. 927­936, 2009.