Robust Formulations for Handling Uncertainty in Kernel Matrices Sahely Bhadra1 Sourangshu Bhattacharya2 Chiranjib Bhattacharyya1 Aharon Ben-tal3 1 Department of CSA, Indian Institute of Science, INDIA 2 Yahoo! Labs, Bangalore, INDIA 3 Faculty of Industrial Engg. and Management, Technion, Haifa, ISRAEL sahely@csa.iisc.ernet.in sourangb@yahoo-inc.com chiru@csa.iisc.ernet.in bental@ie.technion.ac.il Abstract We study the problem of uncertainty in the entries of the Kernel matrix, arising in SVM formulation. Using Chance Constraint Programming and a novel large deviation inequality we derive a formulation which is robust to such noise. The resulting formulation applies when the noise is Gaussian, or has finite support. The formulation in general is non-convex, but in several cases of interest it reduces to a convex program. The problem of uncertainty in kernel matrix is motivated from the real world problem of classifying proteins when the structures are provided with some uncertainty. The formulation derived here naturally incorporates such uncertainty in a principled manner leading to significant improvements over the state of the art. To the best of our knowledge there is no such study of this important problem in the existing literature. We treat the impact of uncertainty in individual examples as an additive uncertainty Z. We consider the following chance constraint setting: 1 max e - t t,Sn 2 s.t. P rob Y (K + Z)Y t 1 - (2) (3) where < 0.5. In this setting the inequality (3), ensures that the event Y (K + Z)Y t holds with high probability (1 - ) for any instantiation of the random variate Z. It is assumed that K is a specified kernel matrix, and is symmetric, positive semidefinite. The random matrix K + Z is not necessarily psd and symmetric. Optimization problems involving chance constraints often turn out to be NP-hard and is an active area of study (Nemirovski & Shapiro, 2006; Ben-Tal & Nemirovski, 2007). Chance constraints were previously used in handling uncertainty in the context of linear classifiers (Ghaoui et al., 2003; Bhattacharyya et al., 2004; Shivaswamy et al., 2006). Assuming a full knowledge of Covariance structure of the data uncertainty and using Chebychev inequality they (Bhattacharyya et al., 2004; Shivaswamy et al., 2006) formulated the problem as a Second Order Cone Program(SOCP). Instead of using a full covariance matrix, which is difficult to estimate, an alternative based only on the support information was proposed in (Ghaoui et al., 2003). However the application of these methods to (3) is not straightforward and requires further investigation. The problem studied here is motivated from the problem of classifying protein structures where kernel methods have been highly successful (Qiu et al., 2007; Bhattacharya et al., 2007). They designed kernels are 1. Introduction Given a dataset D = {(xi , yi )|i = 1, . . . , n} the SVM dual formulation (Vapnik, 1998) can be written as: Sn ,t max e - 1 t 2 s.t. Y KY t n (1) where Sn = {|0 i C, i=1 i yi = 0} and Y = diag(yi ). The kernel matrix, K, is a n× n matrix, where Kij can be understood as dot product between implicitly defined feature map over examples xi , xj X . In this paper, we study the problem of designing robust classifiers when the entries of K are uncertain. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Robust Formulations for Handling Uncertainty in Kernel Matrices based on similarity scores, like Root mean square deviation(RMSD) obtained from structural alignment algorithms e.g. DALI(Holm & Sander, 1996). Existing methods assume that protein structures are determined exactly, without any uncertainty. However in reality, coordinates of atoms of protein structures are determined with uncertainty, governed by the resolution of X-ray diffraction experiment. 1 When the uncertainty becomes comparable to RMSD then the similarity scores becomes suspect. For example, consider the two SCOP domains d1biaa1 and d1repc1 belonging to different families, but same superfamily: Winged helix DNA-binding domain. The structures for these have been determined at resolutions 2.3° and 2.6° and Dali gives a structural alignment A A with RMSD 2.2 between these domains. So, the uncertainty in the kernel value for these structures is higher than the scores themselves, which could be detrimental to discriminating between the two classes. We study the problem of solving (2) assuming that Zij are centered and independent. We study the two cases namely a.) Zij is Gaussian, and b.) Zij has finite support. For case of Gaussian distribution we derive a novel formulation which can be interpreted as a robust version of SVM. A major contribution of this paper is a novel large deviation inequality which applies to (3) which applies to the finite support case. Using this inequality, we derive a formulation similar to Gaussian case for solving (2). The optimization problem in a general setting turns out to be instance of non-convex program. However under certain assumptions the problem can be solved as a convex conic quadratic. It is interesting to note that under i.i.d assumption the formulation reduces to a SVM with a modified kernel function. Extensive experimentation on synthetic datasets show that current formulation is more robust than standard SVM. The formulation developed here incorporates resolution information available in protein structures in a principled way yielding to substantially better design of classifiers. Experimental results show that this resulted in significant improvements in classification accuracy over existing state of the art methods and their obvious extensions. Also, as expected, the new classifiers are more robust than existing ones. The paper is organized as follows: The main contributions are described in section 2. Section 3 presents algorithms to solve such problems, and section 4 discusses metrics for measuring the performance of resultant classifiers. Section 5 reports experimental results. 1 2. Robust formulations for handling uncertainty in Kernel matrices In this section we study (3) when Zij is independent with zero mean. We begin the study by assuming that the entries are Gaussian distributed and in subsection 2.2 we consider the more general case when the entries have finite support. To this end we derive a large deviation inequality on the inner product of a matrix with Z which is later used in (3). Notation: We denote the Hadamard product of A, B Rn×n by (A B) a n × n matrix with entries (A B)ij = aij bij . The frobenius norm of C Rn×n , F is given by C = n i=1 n 2 j=1 cij . Trace of square matrix A is denoted by T r(A). 2.1. Kernel matrix is Gaussian distributed We begin with the following Lemma. Lemma 1. Let Z be an n × n random matrix with en2 tries Zij independently distributed as Zij N (0, ij ). n×n For every W, A R the constraint P rob(T r{(Z + W )A} t) is satisfied if the following holds. T r(W A) t + -1 () A where ij = ij . Proof. Observe that T r(ZA) N (0, A 2 ). This F is true because T r(ZA) can be written as a weighted sum of independently distributed normal random variables. Using a standardized normal random variate, U N (0, 1), one can write T r(ZA) = U A F . Using the CDF of U, defined as P rob(U u) = (u) = -s2 u 1 e 2 - 2 (4) F (5) ds we get, P rob (T r{(Z + W )A} t) = P rob (U A F t - T r(W A)) = (-u) where u = t - T r(W A) A F (6) The second equality follows because U is standard normal. This derivation shows that equation (4) can be restated as (-u) Noting that is an increasing function of its argument one obtains u --1 () where -1 is the inverse function of . Substituting the value of u completes the proof of the theorem. A direct application of the above Lemma leads to the following theorem, which is the first result of the paper. http://www.rcsb.org/pdb/ Robust Formulations for Handling Uncertainty in Kernel Matrices Theorem 1. Let Z be an n × n matrix whose entries are independently distributed with entries Zij 2 N (0, ij ). Let K = K + Z be a noise corrupted matrix where K is n × n known kernel matrix. For such a K the constraint (3) in the formulation (2) is satisfied if the following holds. yi yj i j K ij - -1 () ( ) F 1 ^ bound fij () 2 2 + µij . which could be further 2 ^ tightened by considering fij () 1 ij 2 + µij where 2 ij is given in (9). ij t (7) Substituting = rlij the bound (11) can be written 1 2 2 2 2 as logE(erZij ) r(cij + lij µij ) + 2 r2 ij lij = 1 r2 ij lij . ^ 2 This bound holds for all r, and for the problem at hand 1 2 2 2 by putting r = sVij we get E(esVij Zij ) 2 s2 Vij ij lij . In light of this result the relation (10) can now be upperbounded as follows log [P rob (T r(ZV ) u)] mins0 - su + Proof. Substituting W = K and Aij = i yi j yj in Lemma 1 proves the theorem This theorem points to a deterministic equivalent to the problem stated in (2). Note that for cases of interest < 0.5, which implies that -1 () < 0. In a later section we will discuss algorithmic approaches for solving such programs. 2.2. Uncertainty with finite support In this section we study the case where uncertainty in the kernel entries has finite support. We state and prove a novel large deviation inequality, and exploit it to obtain a deterministic constraint similar to the one obtained in the Gaussian case (see Theorem 1). We begin by proving a novel large deviation inequality, Theorem 2. Let Z be a n × n random matrix with entries Zij independently distributed along with finite support, P rob(aij Zij bij ) = 1, and E(Zij ) = 0. For every V Rn×n , and u > 0 P rob(T r(ZV ) u) exp - lij = bij -aij , 2 1 s2 V 2 2 F =- u2 1 2 V 2 F The proof is completed by noting that minimization u is attained at s = V F , obtained by solving an univariate quadratic optimization problem. The values ij (9) can be calculated numerically and is not presented here because of space constraints. The inequality is of independent interest, but we do not study it further. Instead we apply the inequality to the problem at hand. Consider the theorem Theorem 3. Let Z be a n×n matrix whose entries are independently distributed, given that P (aij Zij bij ) = 1 and E(Zij ) = 0. Let K = K + Z be a noise corrupted matrix where K is n × n specified kernel matrix. For such a K the constraint (3) in formulation (2) is satisfied if the following holds. yi yj i j K ij + ij u2 1 2 V c 2 F (8) 2 log(1/) ij cij = bij +aij , 2 2 ij = ij ij 2 2 µij = - lij , ij = lij ij , ^ ij 2 2 t (12) i j where ij is defined as in Theorem 2. 2 2 2 z and ij = min{ 0 | µij sinh(z)) 0, z R.} ^ + µij z - log(cosh(z) + ^ (9) Proof. We begin by noting that the constraint (3) can be stated as P rob(T r{(Z + K)V } t) (13) Proof. As a consequence of Markov inequality and independence of entries of Z, the following holds s 0. P rob (T r(ZV ) u) e{-su} Y i,j E e{sVij Zij } " " (10) Exploiting the convexity of the function, ex , one can upperbound the moment generating function of Zij . More importantly for any r R the following is true. E(erZij ) aij bij eraij - erbij = ercij +fij (rlij ) bij - aij bij - aij (11) where Vij = i yi j yj . A necessary condition for satisfying the above inequality can be obtained by exploiting the large deviation inequality stated in Theorem 2. In particular a direct application of the bound yields the following constraint 2 log(1/) V F t - T r(KV ) (14) where, fij () = log (cosh() + µij sinh()). By us^ ing Taylor expansion around 0 we obtain the following Substituting V in the above equation proves the theorem Robust Formulations for Handling Uncertainty in Kernel Matrices 2.3. A deterministic optimization problem In light of the Theorem 1 and Theorem 3 one can motivate the following deterministic counterpart of (2). 1 min t - t,Sn 2 s.t. ij Case 2 - is psd: If the matrix is positive semidefinite then the formulation can be posed as SOCP. To this end consider the following theorem. Theorem 6. If both K, are symmetric psd matrices then the following formulation is equivalent to formulation (15). mint,,t ,Sn s.t. i t - t 1 Y (K) 2 2 t 2 2 i i i 2 i i yi yj i j K ij + ij ij 2 2 t i j (15) 1 2t - 1 (16) where, = 2 log(1/) when uncertainty has finite support. This formulation is robust to uncertainty in the kernel entries and will be referred as Robust SVM (RSVM). In case of Gaussian uncertainty, 2 = --1 () and ij = ij and this formulation will be referred as Robust SVM for Gaussian distribution (RSVM(g) ). Proof. As , K are psd matrices their matrix square roots, 2 , K 2 , exist. At optimality 2 = i and the i theorem follows. Note that this formulation is second order cone representable and hence can be solved as an Second Order Cone Program (SOCP). This will be denoted (g) by RSVMSOCP (RSVMSOCP for Gaussian uncertainty). Case 3- The case of general : The case of general (ij 0), is an instance of non-convex program. We do not study this setting in detail here but propose a general descent algorithm to solve this. In particular we use a modified Newton Method (Luenberger & Ye, 2008) with square penalty function leading to the following unconstrained approximation of (15). min L() @= f () + 1 1 3. Algorithms for solving the robust formulation In this section we consider algorithms for solving (15). In general these problems are instances of non-convex programs. Here we observe that in several cases of interest the problems can be reduced to convex conic quadratic programs. Case 1 - is rank one: For this case the formulation is equivalent of solving SVM. It is interesting to note that this case arises when the uncertainty is independent and identically distributed (i.i.d.). Theorem 4. Let be a rank one matrix, i.e. ij = i j where is a vector with non-negative components. The formulation (15) is equivalent to a SVM with kernel K + diag(). 2 2 2 Proof. Noting that, i i i and ij ij i j = eliminating t the result follows. By design ij 0 which implies that i > 0 (Minc, 1970) 0 1 X X P X [( i yi )2 + (i - C)2 + 2 ]A i 2 i i: >C i: <0 i i (17) where, 1 2 2 f () = 2 ij yi yj i j Kij + 1 ij ij i j - 2 and P is a user defined cost for penalty function. i i We will denote the corresponding formulation by (g) RSVMQP (RSVMQP for Gaussian uncertainty). As a corollary to the previous theorem one can prove Theorem 5. Let entries of the random matrix Z be i.i.d. and K = K + Z In such a case formulation (15) is equivalent of solving a SVM with Kernel matrix K = K + I Proof. Note that for i.i.d. case i = in the previous theorem and the proof follows. We minimize L with the Quasi Newton type method with DFP updation as follows t+1 = t - t H t L Where, t is step size, H t is approximate inverse of Hessian of L obtained by DFP procedure, and L is the gradient of L w.r.t . This will be referred as RSVMQN (RSVMQN for Gaussian uncertainty). The algorithm suffers from the problem of local minimum. To alleviate the problem we have used multiple starting points. The decision function for the classifier can be expressed as f (w, b) = sign( iSV yi i Ki. + b), where SV is an index set of support vectors. In order to get robust performance and to reduce the effect of uncertainty in kernel, the bias b can be computed as b = # 1 [ jSV yj - i,jSV yi i Kij ]. SV (g) Robust Formulations for Handling Uncertainty in Kernel Matrices 4. Error metrics For each test data, Pt , for all the training data points ¯ Pi , we have the mean Kti of the kernel entry Kti , 2 and either variance ti (in Gaussian case) or range [ati , bti ] (finite support case) for the uncertainty. In both cases, we test by generating multiple samples s Kti , s = 1, . . . , ns , for each kernel entry Kti , where ns is the number of samples. s ¯ For each of these test values, = 0 (15) and Kti = Kti . Hence, the decision function obtained from equation s (15) can be written as fts = sign( jSV j yj Ktj + b). pr One way to assign a label yt to Pt is by majority votes pr from fts , s = 1, . . . , ns . So, yt = sign( ns fts ). s=1 Let D = {Pt , t = 1, . . . , n} be a test dataset which is tested according to the majority vote scheme. The overall classification error can be calculated as: MajErr(M E) : Pn t=1 pr 1(yt =yt ) 5.1. Experiment with Synthetic data We performed a thorough experimental analysis of the proposed formulations measuring its generalization performance, robustness, and performance of the bounds. For this, we created a synthetic dataset of 2 classes and 100 data points per class, using a Gaussian mixture. A linear kernel was computed for these data s s points. Let Kij = Kij + Zij , where Kij is computed from original datapoints and 100 uncertain samples s Zij for each kernel entry Kij was generated using: a) Gaussian (0,1) b) Uniform [-0.5,0.5] c) centered Beta (0.5,0.5) distributions and multiplied by a random lij (lij = lji ). The support parameters are ess s timated as aij = mins Kij and bij = maxs Kij . For the Gaussian case the parameter is estimated as 2 s ij = ij = var(Kij ). For non-Gaussian case is calculated from (9). ^ For RSV MQP , we approximated by = vmax , where vmax and are principal eigenvalue and eigenvector of respectively. For RSV MSOCP we ap^ proximated by = r,vr 0 vr er e , where vr and r er are corresponding eigenvlaues and eigenvectors of respectively. 5.1.1. Comparison of Generalization error All six formulations proposed here are compared with Nominal SVM using the three metrics described in section 4. For all the metrics, we performed 5-fold crossvalidation on 20 different datasets. The hyperparameters (C and/or ) for each classifier, were chosen using a grid search from the set C = {0.1, 1, 5, 10, 50, 100} and = {0.05 : 0.05 : 0.5}. For each metric, the best cross-validation accuracy averaged over 20 dataset is reported in table 1. We observe that either RSVMSOCP or RSVMSOCP perform best in terms of all the error measures, clearly demonstrating the power of the proposed methods. For generating synthetic data, we chose lij = 0.25Kij , in order to have a dataset where the noise is less than the actual kernel values. Hence, the matrix turns out to be PSD most of the times, thus leading to (g) better performance of RSVMSOCP or RSVMSOCP . (g) RSVMQN and RSVMQN follow closely, because they get stuck at local optima. RSVMQP and (g) RSVMQP show intermediate performance compared to SVM. In terms of RobustErr, SVM performs very badly, showing its lack of ability to achieve robustness. Also, RSVM(g) is found to perform better than RSVM when the uncertainty is Gaussian. These observations are explored in detail below. (g) n × 100 (18) where, yt is the true label for Pt . While the above scheme is intuitive for labeling an uncertain data point, a robust classifier is expected to classify all the ns samples generated for each uncertain data point Pt correctly. To capture this notion of robustness, we propose another error measure (RobustErr) which counts the fraction of data points in D for which all the samples are classified correctly. RobustErr(RE) : Pn t=1 s 1(s|ft =yt ) × 100 n (19) We can also treat each of the samples generated from uncertain data points as individual data points, and define a standard classification error (NomErr), as: NomErr(N E) : P st s 1(ft =yt ) × 100 nns (20) In the following section, we report experimental results for the techniques developed here and state of the art methods with respect to the above mentioned metrics. 5. Experiments This section presents experimental results to compare the proposed RSVM (15), RSVM(g) , and NominalSVM (SVM with specified kernel) in terms of accuracy and robustness on the task of binary classification. All the three algorithms for RSVM: RSV MQP , RSV MSOCP and RSV MQN , were implemented in Matlab with the help of a standard QP solver and Sedumi2 . We report results for both synthetic data and resolution aware protein structure classification problem. The results demonstrate that the proposed formulations outperform state of the art techniques with respect to both traditional error measures and new metrics defined in section 4. 2 http://sedumi.ie.lehigh.edu/ Robust Formulations for Handling Uncertainty in Kernel Matrices Table 1. Cross-validation accuracy (%) obtained with (g) (g) (g) RSV MQP , RSV MSOCP , RSV MQN , RSV MQP , RSV MSOCP , RSV MQN , Nominal SVM using NomErr (NE, 20), MajErr (ME, 18) and RobustErr (RE, 19). RSV M (g) QP SOCP QN QP RSV M SOCP QN SV M discussed in Lemma 1 and Theorem 2, which were used to derive the RSVM formulation (15) from the chance constraint (2). For a given , we calculate ef f ective = of and t solving (15). #{K s ,s=1...ns | Y K s Y >t } ns , for the optimal values Uniform Distribution ME RE NE 94.60 52.60 80.30 96.15 93.30 95.94 95.55 92.85 95.50 95.30 74.20 83.20 96.60 95.20 95.60 95.60 94.60 95.60 87.70 18.65 58.79 Gaussian Distribution ME RE NE 95.45 55.15 74,27 96.35 93.70 95.18 95.95 92.70 94.92 94.50 64.30 80.45 (0.5, 0.5) ME RE NE 95.20 47.45 79.22 96.15 91.45 95.91 95.60 91.35 95.48 95.35 75.50 85.35 96.20 94.30 96.20 95.95 93.85 95.95 86.75 5.20 57.25 95.60 84.10 94.80 95.70 84.10 94.65 71.95 24.75 52.69 Figure 3 plots ef f ective vs . Ideally ef f ective should be equal to (shown as "Ideal-case" in graph). The leftmost plot shows that the Gaussian bound (Lemma 1) is much tighter than interval based bound if the uncertainty is Gaussian. In the other plots, we plot ef f ective obtained from RSVM for various values of , thus testing the bound in Theorem 2. The bound is very loose for the general distributions. However, we observe that for the interesting range of , [0, 0.5], the bound is tighter than rest of the region. The bound is very tight for small values of . 5.2. Resolution-aware protein structure classification Here, we present experimental results which compare accuracy and robustness of the proposed RSVM, with state of the art methods for protein structure classification. Dataset: We use a dataset based on SCOP (Murzin et al., 1995) 40% sequence non-redundant dataset, taken from (Bhattacharya et al., 2007). The dataset has 15 classes (SCOP superfamilies), having 10 structures each. The experimental methodology is also similar to that used in (Bhattacharya et al., 2007), e.g. using 15 "one versus all" binary classifiers, where the negative data contains 10 proteins (to keep the dataset balanced) randomly chosen from all other classes. We perform Leave-One-Out(LOO) crossvalidation here. Let D = {(Pi , ri , yi )} where Pi is the set of coordinates of ith protein structure obtained from Astral3 database, ri be the corresponding resolution information obtained from PDB, and yi is the class label. Using resolution, we generate a set of perturbed structures Qi = {Pi1 , . . . , Pins } for each Pi as follows: for each atom pia of Pi generate structure Pis with coordinates of atoms as psa = pia + u and u U ( -ri , ri ). i 2 2 5.1.2. Comparison of robustness In the proposed RSVM (or RSVM ), the effect of uncertainty in training data is controlled by and hence (15). Higher the value of , higher the effect of uncertainty. For = 0, RSVM ignores uncertainty in kernel values. A consistent reduction in uncertainty for the test data points is achieved by generating them s s s as Kij = K ij + Zij , where |Zij | lij for uncers tainty with finite support, and Zij N (0, ij ) for Gaussian uncertainty. Figure 1 shows that, with the increase of uncertainty in test examples the RobustErr(19) for SVM increases more rapidly than that for RSVMSOCP , RSVMQN for all types of uncertainties. This shows that nonrobust classifiers, e.g. SVM, are unable to handle uncertainty, compared to the proposed robust classifiers. RSVMQP performs comparably with SVM since the assumption of being rank 1 does not hold for the current dataset. Hence, RSVMQP becomes theoretically equivalent to SVM using a kernel with diagonal made heavy (see Theorem 5). Figure 2 shows that with increase in uncertainty which is i.i.d. RSVMQP performs much better than Nominal SVM. In Figure 1 and Figure 2 at = 0, RobustErr for both RSV M and RSV M (g) are exactly same as that of SVM. It confirms the fact that at = 0, RSV M (g) and RSV M are equivalent to SVM. Both RSVMQN and RSVM(g) QN sometimes give higher error than SOCP due to the solver getting stuck at local optimum. 5.1.3. Effectiveness of bound In this section, using the same synthetic data as above, we experimentally verify the effectiveness of bounds (g) For any kernel K, mean kernel K ij = E[K(p, p )]. Also aij = minpQi ,p Qj K(p, p ) and bij = maxpQi ,p Qj K(p, p ). For the purpose of our comparison, we have used weighted pairwise distance substructure kernel (Bhattacharya et al., 2007). Existing techniques: Each protein structure can be viewed as a set of perturbations of the original structures. Hence, we compared RSV M 3 http://astral.berkeley.edu Robust Formulations for Handling Uncertainty in Kernel Matrices 120 RSVMQP 100 RSVMSOCP RSVMQN RSVM(g) QP 80 (g) RSVMSOCP 120 120 100 100 RSVMQP 80 RSVMSOCP RSVMQN RSVMQP 60 RSVM(g) RSVM(g) 40 SOCP QN (g) 80 RSVMQP RSVMSOCP RSVM QN Robust Error in % Robust Error in % QN Nominal SVM 60 Robust Error in % RSVM(g) 60 RSVM(g) QP RSVM(g) SOCP 40 Nominal SVM 40 RSVM(g) QN Nominal SVM 20 20 20 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 Figure 1. Robustness for RSVM, RSVM(g) and Nominal-SVM using (starting from left) Gaussian, Uniform and (0.5, 0.5) distribution for generating sample kernels. (Plot shows average error over 20 classifiers by fixing C at 10). 120 120 120 100 100 100 80 80 80 Robust Error in % Robust Error in % Robust Error in % RSVM 60 QP RSVM(g) QP RSVM 60 QP 60 RSVM(g) QP Nominal SVM Nominal SVM RSVMQP RSVM(g) QP 40 40 Nominal SVM 40 20 20 20 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 0 0 0.5 1 1.5 2 2.5 Figure 2. Robustness for RSVMQP , RSVMQP and Nominal-SVM using (starting from left) i.i.d. Gaussian, Uniform and (0.5, 0.5) distributions for generating sample kernels. (Plot shows average error over 20 classifiers by fixing C at 10). 0.5 RSVMQP RSVM SOCP (g) 0.5 0.5 0.5 RSVMQP RSVMSOCP 0.4 RSVMQP RSVMSOCP 0.4 RSVMQP RSVMSOCP 0.4 RSVMQN 0.4 RSVMQP (g) RSVM SOCP (g) RSVMQN Nominal SVM Ideal-case RSVMQN Nominal SVM Ideal-case RSVMQN Nominal SVM Ideal-case effective effective effective Ideal-case 0.3 0.3 effective 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 RSVM (g) QN 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 3. Verification of Bound for RSVM and RSVM(g) using (starting from left) Gaussian, Uniform, (0.5, 0.5) and (5, 5) distribution for generating sample kernels. (Plot shows average error over 20 classifiers by fixing C at 10). with the normalized set multi-instance kernel (MI) (Gartner et al., 2002). For a given kernel K, normalized set P kernel is defined as Kmulti (Pi , Pj ) = ( P pQi ,p Qi K(p,p ) q P . K(p,p )) ( pQ ,p Q K(p,p )) pQi ,p Qj j j We have used SV MR (SVM considering each perturbed structure as individual data points) NominalSVM (SVM with kernel based on protein structure reported in PDB files), SV MM (SVM considering K as kernel) for benchmarking our result. 5.2.1. Results on protein structures Table 2 reports results for RSVM and state of the art methods using both standard and robust error measures defined in section 4 using the LOO proce- dure. Hyper-parameters (C and/or ) for each classifiers were tuned separately using a grid search. We report Total Accuracy (TA) and F1 score. All reported results are averaged over 10 different datasets, where negative dataset were selected randomly. Note that for SVM with MI kernel, one label is given to every set following (Gartner et al., 2002) method. We report this as the majority error, which will also be the Nominal error. It is clear that RSV MQN performs significantly better than rest of the methods, both in terms of Nominal Accuracy (measured by NomErr) and Robustness (measured by RobustErr). This indicates that use of resolution information improves the overall classification accuracy. Moreover, very low values of accuracy Robust Formulations for Handling Uncertainty in Kernel Matrices Table 2. Comparison RSVMQP , RSVMSOCP , RSVMQN , Nominal-SVM,SVMM .SVMR and MI using accuracy measures defined in section 4 RSVM QP SOCP QN Nominal MajErr TA F1 72.67 73.49 73.56 74.35 82.78 82.95 62.89 63.50 RobustErr TA F1 27.11 26.81 50.33 50.28 66.44 66.36 34.56 34.07 NomErr TA F1 66.50 65.13 66.65 65.16 76.00 75.80 61.02 60.86 65.00 64.48 70.44 67.58 × × 22.00 21.70 61.56 61.26 20.11 19.63 71.11 71.87 71.67 72.58 72.11 72.17 SVM M R MI Bhattacharya is thankful to Yahoo! India for supporting this research. Aharon Ben-Tal's research is supprted by the MINERVA Foundation. References Ben-Tal, A. and Nemirovski, A. Selected Topics in Robust Convex Optimization. Mathematical Programming, 112(1), 2007. Bhattacharya, S., Bhattacharyya, C., and Chandra, N. Structural alignment based kernels for protein structure classification. ICML, 2007. Bhattacharyya, C., Grate, L. R., Jordan, M. I., Ghaoui, L. El, and Mian, S. I. Robust sparse hyperplane classiers: application to uncertain molecular proling data. Journal of Computational Biology, 11(6):1073­1089, 2004. Gartner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. Multi-instance kernels. ICML, 2002. Ghaoui, L. E., Lanckriet, G. R. G., and Natsoulis, G. Robust Classification with Interval Data. Technical Report UCB/CSD-03-1279, Computer Science Division, University of California, Berkeley, 2003. Holm, L. and Sander, C. Mapping the protein universe. Science, 273(5275):595­602, 1996. Luenberger, David G. and Ye, Yinyu. Linear and nonlinear programming. Springer, 2008. Minc, H. On the maximal eigenvector of a positive matrix. SIAM Journal on Numerical Analysis, 7 (3):424­427, 1970. Murzin, A. G., Brenner, S. E., Hubbard, T., and Chothia, C. Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4): 536­540, April 1995. Nemirovski, A. and Shapiro, A. Convex Approximations of Chance Constrained Programs. SIAM Journal of Optimization, 17(4):969­996, 2006. Qiu, J., Hue, M., B.-Hur, A., Vert, J.-P., and Noble, W. S. A structural alignment kernel for protein structures. Bioinformatics, 23(9):1090­1098, 2007. Shivaswamy, P. K., Bhattacharyya, C., and Smola, A. J. Second Order Cone Programming Approaches for Handling Missing and Uncertain Data. JMLR, 7:1283­1314, 2006. Vapnik, V. Statistical Learning Theory. John Wiley and Sons, New York, 1998. corresponding to RobustErr for SVM and other competing methods suggests that the SVM classification is not robust to perturbations in coordinates of atoms within the resolution. The fact that other RSVM formulations perform worse than RSV MQN indicates that assumptions used to derive other formulations, e.g. Rank 1 or PSD, do not hold for this dataset. In terms of RobustErr, RSVMQP performs worse than RSVMSOCP , confirming the fact that PSD assumption is much better for robustness than rank one assumption. The simple heuristic of using all the perturbed samples (SVMR ) performs very well in terms of robustness, which is intuitive. However, the computational complexity of SV MR is O(n2 ) higher than s others, which can be prohibitive for many cases. 6. Conclusion We have presented an optimization problem (15), which is robust to uncertainty in the kernel matrix. The formulation applies to Gaussian uncertainty and as well as to arbitrary distributions with finite support. For the finite support case the formulation is derived from a novel large deviation inequality, stated in Theorem 2. The large deviation inequality is of independent interest and applies more generally to problems involving traces of random matrices. An interesting result is, for i.i.d uncertainty, the formulation reduces to SVM (Theorem 5). We show that for positive semidefinite the the formulation is second-order-cone representable and can be solved by SOCP. On the real world problem of protein structure classification it yields significantly improved results. Acknowledgments Chiranjib Bhattacharyya and Sahely Bhadra are supported by grants from Yahoo! India. Sourangshu