Improved Guarantees for Learning via Similarity Functions Maria-Florina Balcan Avrim Blum Computer Science Department, Carnegie Mellon University { ninamf,avrim} @cs.cmu.edu Nathan Srebro Toyota Technological Institute at Chicago nati@uchicago.edu Abstract We continue the investigation of natural conditions for a similarity function to allow learning, without requiring the similarity function to be a valid kernel, or referring to an implicit high-dimensional space. We provide a new notion of a "good similarity function" that builds upon the previous definition of Balcan and Blum (2006) but improves on it in two important ways. First, as with the previous definition, any large-margin kernel is also a good similarity function in our sense, but the translation now results in a much milder increase in the labeled sample complexity. Second, we prove that for distribution-specific PAC learning, our new notion is strictly more powerful than the traditional notion of a large-margin kernel. In particular, we show that for any hypothesis class C there exists a similarity function under our definition allowing learning with O(log |C |) labeled examples. However, in a lower bound which may be of independent interest, we show that for any class C of pairwise uncorrelated f|unctions, there is no kernel with margin 8/ C | for all f C , even if one allows average hinge-loss as large as 0.5. Thus, the sample complexity for learning such classes with SVMs is (|C |). This extends work of BenDavid et al. (2003) and Forster and Simon (2006) who give hardness results with comparable margin bounds, but at much lower error rates. Our new notion of similarity relies upon L1 regularized learning, and our separation result is related to a separation result between what is learnable with L1 vs. L2 regularization. 1 Introduction Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well (Scholkopf & Smola, 2002; Herbrich, 2002; Shawe-Taylor & Cristianini, 2004; Scholkopf et al., 2004). This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the in tuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in "natural" representations of the data. Therefore , it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semi-definiteness ma y rule out the most natural pairwise similarity functions for the given problem domain. In recent work, Balcan and Blum (2006) developed an alternative, more general theory of learning with pairwise similarity functions that may not necessarily be valid positive semi-definite kernels. Specifically, this work developed sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spac es, and does not require the function to be positive semi-definite (or even symmetric). While this theory provably generalizes the standard theory in that any good kernel function in the usual sense can be shown to also be a good similarity function under this definition, the translation does incur a penalty. Subsequently, Srebro (2007) tightly quantified the gap between the learning guarantees based on kernel-based learning, and those that can be obtained by using the kernel as a similarity function in this way. In particular, Sreb ro (2007) shows that a kernel of margin is guaranteed to be a similarity function of margin ( 2 ) at hinge-loss , and furthermore there exist examples for which this is tight. To sum up, while the theory of Balcan and Blum (2006) applies to a wider class of pairwise functions than the standard notion of kernel learning, it might be quantitatively inferio r in those cases that both notions apply. In this work we develop a new notion of a good similarity function that broadens the definition of Balcan and Blum (2006) while still guaranteeing learnability. As with the previous definition, our notion talks in terms of natural similarity-based properties and does not require posit ive semi-definiteness or reference to implicit spaces. However, our new notion improves on the previous definition in two important respects: First, our new notion provides a better kernel-to-similari ty translation. Any large-margin kernel function is a good sim ilarity function under our definition, and while we still inc ur some loss in the parameters, this loss is much smaller than under the prior definition, especially in terms of the final la beled sample-complexity bounds. In particular, when using a valid kernel function as a similarity function, a substantial portion of the previous sample-complexity bound can be transferred over to merely a need for unlabeled examples. Second, we show that our new definition allows for good similarity functions to exist for concept classes for which there is no good kernel. In particular, for any concept class C and sufficiently unconcentrated distribution D, we show there exists a similarity function under our definition with parameters yielding a labeled sample complexity bound of O( 1 log |C |) to achieve error , matching the ideal sample complexity for a generic hypothesis class. In fact, we also extend this result to classes of finite VC-dimension rather than finite cardinality. In contrast, we show there exist cla sses C such that under the uniform distribution |over the instance space, there is no kernel with margin 8/ C | for all f C even if one allows 0.5 average hinge-loss. Thus, the marginbased guarantee on sample complexity for learning such classes with kernels is (|C |). This extends work of Ben-David et al. (2003) and Forster and Simon (2006) who give hardness results with comparable margin bounds, but at much lower error rates. Warmuth and Vishwanathan (2005) provide lower bounds for kernels with similar error rates, but their results hold only for regression (not hinge loss). Not e that given access to unlabeled data, any similarity function under the definition of Balcan and Blum (2006) can be converted to a kernel function with approximately the same parameters. Thus, our lower bound for kernel functions applie s to that definition as well. These results establish a gap in the representational power of similarity functions under o ur new definition relative to the representational power of eit her kernels or similarity functions under the old definition. Both our new definition and that of Balcan and Blum (2006) are based on the idea of a similarity function being good for a learning problem if there exists a non-negligible subset R of "reasonable points" such that most examples x are on average more similar to the reasonable points of their own label than to the reasonable points of the other label. (Formally, the "reasonableness" of an example may be given by a weight between 0 and 1 and viewed as probabilistic or fractional.) However, the previous definition combined the two quantities of interest--the probability mass of reasonable points and the gap in average similarity to reasonable points of each label--into a single margin parameter. The new notion keeps these quantities distinct, which turns out to make a substantial difference both in terms of broadness of applicability and in terms of the labeled sample complexity bounds that result. Note that we distinguish between labeled and unlabeled sample complexities: while the total number of examples needed depends polynomially on the two quantities of interest, the number of labeled examples depends only logarithmically on the probability mass of the reasonable set and therefore may be much smaller under the new definition. This is especially beneficial in situations in which unlabeled data is plentiful (or the distribution is known and so unlabeled data is free), but labeled data is scarce. Another way to view the distinction between the two notions of similarity is that we now require good predictions u sing a weight function with bounded expectation, rather than bounded supremum: compare the old Definition 4 and the variant of the new definition given as Definition 17. (We do in fact still have a bound on the supremum, but this bound only affects the labeled sampled complexity logarithmically.) In Theorem 19 we make the connection between the two versions of the new definition explicit. Conditioning on a subset of reasonable points, or equivalently bounding the expectation of the weight function, allows us to base our learnability results on L1 -regularized linear learning. The actual learning rule we get, given in Equation (4.6), is very similar, and even identical, to learning rules suggested by various authors and commonly used in practice as an alternative to Support Vector Machines (Bennett & Campbell, 2000; Roth, 2001; Guigue et al., 2005; Singer, 2000; Tipping, 2001). Here we give a firm theoretical basis to this learning rule, with explicit learning guarantees, a nd relate it to simple and intuitive properties of the similari ty function or kernel used (see the discussion at the end of Section 4). Structure of this paper: After presenting background on the previous definitions and their relation to kernels in Sec tion 2, we present our new notion of a good similarity function in Section 3. In Section 4 we show that our new broader notions still imply learnability. In Section 5 we give our separation results, showing that our new notion is strictly mor e general than the notion of a large margin kernel. In Section 6 we show that any large margin kernel is also a good similarity function in our sense, and finally in Section 7 we discuss learning with multiple similarity functions. 2 Background and Notation We consider a learning problem specified as follows. We are given access to labeled examples (x, ) drawn from some distribution P over X × {-1, 1}, where X is an abstract instance space. We will sometimes use D to denote the distribution over x, and for simplicity, we will assume a deterministic target function, so that (x, ) = (x, (x)). The goal of a learning algorithm is to produce a classification functi on g : X {-1, 1} whose error rate Pr(x,)P [g (x) = ] is low. We will consider learning algorithms whose only access to points x is through a pairwise similarity function K (x, x ) mapping pairs of points to values in the range [-1, 1]. Specifically, Definition 1 A similarity function over X is any pairwise function K : X ×X [-1, 1]. We say that K is a symmetric similarity function if K (x, x ) = K (x , x) for all x, x . Our goal is to describe "goodness" properties that are sufficient for a similarity function to allow one to learn wel l that ideally are intuitive and subsume the usual notion of good kernel function. A similarity function K is a valid kernel function if it is positive-semidefinite, i.e. there exists a function from the instance space X into some (implicit) Hilbert "-space" such that K (x, x ) = (x), (x ) . See, e.g., Smola and Scholkopf (2002) for a discussion on conditions for a map¨ ping being a kernel function. Throughout this work, and without loss of generality, we will only consider kernels su ch that K (x, x) 1 for all x X (any kernel K can be con~ verted into this form by, for instance, defining K (x, x ) = K K(x, x )/ (x, x)K (x , x )). We say that K is (, )kernel good for a given learning problem P if there exists a vector in the -space that has error at margin ; for simplicity we consider only separators through the origin. Specifically: Definition 2 K is an (, )-good kernel function if there exists a vector , 1 such that (x , )P Pr [ (x), ] 1 - . We say that K is -kernel good if it is (, )-kernel good for = 0; i.e., it has zero error at margin . Given a kernel that is (, )-kernel-good for some learning problem P , a predictor with error rate at most + acc can( be learned (with hirgh probability) from a sample of 1 ~ O + acc )/( 2 2cc ) andom examples from P by mina imizing the number of margin violations on the sample (McAllester, 2003). However, minimizing the number of margin violations on the sample is a difficult optimization problem: it is NP-hard, and even NP-hard to approximate (Arora et al., 1997; Feldman et al., 2006; Guruswami & Raghavendra, 2006). Instead, it is common to minimize the so-called hinge loss relative to a margin. Definition 3 We say that K is (, )-kernel good in hingeloss if there exists a vector , 1 such that E(x,)P [[1 - , (x) /]+ ] , where [1 - z ]+ = max(1 - z , 0) is the hinge loss. Given a kernel that is (, )-kernel-good in hinge-loss, a predictor with error rate at most (+ acc can be efficiently w ~ learned from a sample of size O + acc )/( 2 2cc ) ith a high probability by minimizing the average hinge loss relative to margin on the sample (Bartlett & Mendelson, 2003). We now present the definition of a good similarity function from (Balcan & Blum, 2006; Srebro, 2007). Definition 4 (Previous, Margin Violations) A pairwise function K is an (, )-good similarity function for a learning problem P if there exists a weighting function w : X [0, 1] such that at least a 1 - probability mass of examples (x, ) satisfy: E(x , )P [ w(x )K (x, x )] . ( 2 .1 ) over the random sample S , the induced distribution S (P ) in Rd has a separator of error at most +acc/2 at margin at least /2. Now, draw a fresh sample, map it into the transformed space using S , and then learn a good linear separator in the transformed spa( e. The total sample c( mplexity is domic o s = ~ + acc )d/2 ) ~ + acc )/( 2 2 ) nated by the O O acc acc ample complexity of learning in the transformed space, yie lding the same overall sample complexity as with an (, )good kernel function. The above bounds refer to learning a linear separator by minimizing the error over the training sample. As mentioned earlier, this minimization problem is NP-hard even to approximate. Again, we can instead consider the hinge-loss rather than the number of margin violations. Balcan and Blum (2006) and Srebro (2007) therefore provide the following hinge-loss version of their definition: Definition 5 (Previous, Hinge Loss) A similarity function K is an (, )-good similarity function in hinge loss for a learning problem P if there exists a weighting function w(x ) [0, 1] for all x X such that [ , ( 2 .2 ) E(x,)P 1 - g (x)/ ]+ where g (x) = E(x , )P [ w(x )K (x, x )] is the similaritybased prediction made using w(), and recall that [1 - z ]+ = max(0, 1 - z ) is the hinge-loss. The same algorithm as above, but now using SVM to minimize hinge-loss in the transformed space, allows one to efficiently use a similarity function satisfying this definition to e ( ~ find a predictor of error + acc using O + acc )/( 2 2cc ) a xamples. 3 New Notions of Good Similarity Functions In this section we provide new notions of good similarity functions generalizing Definitions 4 and 5 that we prove have a number of important advantages. In the definitions of Balcan and Blum (2006), a weight w(x ) [0, 1] was used in defining the quantity of interest E(x , )P [ w(x )K (x, x )]. Here, it will instead be more convenient to think of w as the expected value of an indicator random variable R(x) {0, 1} where we will view the (probabilistic) set {x : R(x) = 1} as a set of "reasonable points". Formally, we will then be sampling from the joint distribution P (x, (x), R(x)) = P (x, (x))P (R(x)|x) but we will sometimes omit the explicit dependence on R when it is clear from context. Our new definition is now as follows. Definition 6 (Main, Margin Violations) A similarity function K is an (, , )-good similarity function for a learning problem P if there exists a (random) indicator function R(x) defining a (probabilistic) set of "reasonable points" such that the following conditions hold: 1. A 1 - probability mass of examples (x, ) satisfy E(x , )P [ K (x, x ) | R(x )] ( 3 .1 ) 2. Prx [R(x )] . That is, if the underlying distribution is 50/50 positive and negative, this is saying that the average weighted similarity of an example x to random examples x of its own label should be 2 larger than the average weighted similarity of x to random examples x of the other label. r s Balcan and Blum (2006) show how a p( edictor with error ~ rate at most +acc can be learned from O + acc )/( 2 2cc ) a amples using an (, )-good similarity function K : First draw from P an (unlabeled) sample S = {x , . . . , x } of 1 d d = (4/ )2 ln(4/( acc )) random "landmarks", and construct the mapping S : X Rd defined as S i (x) = 1 K (x, x ), i {1, . . . , d}. With probability at least 1 - i d ~ The O(·) notation hides logarithmic factors in the arguments and in the failure probability. 1 If the reasonable set R is 50/50 positive and negative (i.e., Prx [(x ) = 1|R(x )] = 1/2), we can interpret the condition as stating that most examples x are on average 2 more similar to random reasonable examples x of their own label than to random reasonable examples x of the other label. The second condition is that at least a fraction of the points should be reasonable. We also consider a hinge-loss version of the definition: Definition 7 (Main, Hinge Loss) A similarity function K is an (, , )-good similarity function in hinge loss for a learning problem P if there exists a (probabilistic) set R of "reasonable points" such that the following conditions hold: 1. We have E (x , )P 2. Prx [R(x )] . [ , 1 - g (x)/ ]+ ( 3 .2 ) Definitions 4 and 5, and this is tight (Srebro, 2007), result1 t ~ ing in a sample complexity of O /( 4 3 ) o achieve error . However, we can show K is an ( , 2 , )-good similarity function under the new defin.ition,2 resulting in a sample 1 ~ complexity of only O /( 4 ) 4 Good Similarity Functions Allow Learning where g (x) = E(x , ,R(x )) [ K (x, x ) | R(x )]. It is not hard to see that an (, )-good similarity function under Definitions 4 and 5 is also an (, , )-good similarity function under Definitions 6 and 7, respectively. In the reverse direction, an (, , )-good similarity function under Definitions 6 and 7 is an (, )-good similarity function under Definitions 4 and 5 (respectively). For formal proofs, se e Theorems 23 and 24 in Appendix A. As we will see, under both old and new definitions, the number of labeled samples required for learning grows as 1/ 2 . The key distinction between them is that we introduce a new parameter, , that primarily affects the number of unlabeled examples required. This decoupling of the number of labeled and unlabeled examples enables us to handle a wider variety of situations with an improved labeled sample complexity. In particular, in translating from a kernel to a similarity function, we will find that much of the loss can now be placed into the parameter. In the following we prove three types of results about this new notion of similarity. The first is that similarity fu nctions satisfying these conditions are sufficient for learni ng (in polynomial time in the case of Definition 7), with a sample size of O( 12 ln( 1 )) labeled examples and O( 1 2 ) unla beled examples. This is particularly useful in settings where unlabeled data is plentiful and cheap--such settings are increasingly common in learning applications (Mitchell, 2006; Chapelle et al., 2006)--or for distribution-specific learni ng where unlabeled data may be viewed as free. The second main theorem we prove is that any class C , over a sufficiently unconcentrated distribution on example s, has a (0, 1, 1/(2|C |))-good similarity function (under either definition 6 o|r 7), whereas there exist classes C that have no (0.5, 8/ C |)-good kernel functions in hinge loss. This provides a clear separation between the similarity and kern el notions in terms of the parameters controlling labeled sample complexity. The final main theorem we prove is that any large-margin kernel function also satisfies our similarity definitions, with substantially less loss in the parameters controlling labeled sample complexity compared to the definition of (Balcan & Blum, 2006). For example, if K is a (0, )-good kernel, then it is an ( , 2 )-good similarity function under The basic approach proposed for learning using a similarity function is similar to that of Balcan and Blum (2006). First, a feature space is constructed, consisting of similarities t o randomly chosen landmarks. Then, a linear predictor is sought in this feature space. However, under the previous definitions, we were guaranteed large L2 -margin in this feature space, whereas under the new definitions we are guaranteed large L1 -margin in the feature space. After recalling the notion of an L1 -margin and its associated learning guarantee, we first establish that, for an (, , )-good similalrity function, the feature map constructed 1 ~ using O /( 2 ) andmarks indeed has (with high probability) a large L1 -margin separator. Using this result, we then obtain a learning guarantee by following the strategy outlined above. In speaking of L1 -margin , we refer to separation with a margin by a unit-L1-norm linear separator, in a unitL -bounded feature space. Formally, let : x (x), (x) Rd , with (x) 1 be a mapping of the data to a d-dimensional feature space. We say that a linear predictor Rd , achieves error relative to L1 -margin if Prx,(x) ((x) , (x) ) 1 - (this is the standard margin constraint) and 1 = 1. Given a d-dimensional feature map under which there exists some (unknown) zero-error linear separator with L1 margin , we can efficientey learn a predictor with error at l l g most acc using O aoc d xamples (with high probability). 2 c This can be done using the Winnow algorithm with a standard online-to-batch conversion (Littlestone, 1989). If we can only guarantee the existence of a separator with error > 0 relative to L1 -margin , then a predictor with error + acc can be theoret(cally learned (wite high probability) i h ~ from a sample of O log d)/( 2 2cc ) xamples by minia mizing the number of L1 -margin violations on the sample ( Z h an g , 2 0 0 2 ) . We are now ready to state the main result enabling learning using good similarity functions: Theorem 8 Let K be an (, , )-good similarity function for a learning problem P . Let S = {x , x , . . . , x } be a 1 2 d (potentially unlabeled) sample of l l 2 log(2/ ) d= og(2/ ) + 8 2 andmarks drawn from P . Consider the mapping S : X Rd defined as follows: S i (x) = K (x, x ), i {1, . . . , d}. i Then, with probability at least 1 - over the random sample S , the induced distribution S (P ) in Rd has a separator of error at most + relative to L1 margin at least /2. 2 Formally, the translation produces an ( , 2 /c, c)-good similarity function for some c 1. However, smaller values of c only improve the bounds. Proof: Fi t, note that since |K (x, x)| 1 for all x, we have rs S (x) 1. Consider the linear separaior Rd , given by i = t R(x ) is the number of land(xi )R(x )/d1 where d1 = i i marks with R(x ) = 1. This normalization ensures 1 = 1. Note that we take R(x ) to be drawn jointly with x . If it i i is random, than it is randomly instantiated to either zero or o n e. We have, for any x, (x): d = i=1 (x)(x )R(x )K (x, x ) i i i ( 4 .1 ) (x) , S (x) d1 This is an empirical average of d1 terms -1 (x)(x )K (x, x ) 1 for which R(x ) = 1. For any x we can apply Hoeffding's inequality, and obtain that with probability at least 1 - 2 /2 over the choice of S , we have: (x) , S (x) 2 log( 2 ) 2 Ex [K (x, x )(x )(x)|R(x )] - ( 4 .2 ) d1 Since the above holds for any x with probability at least 1- 2 /2 over S , it also holds with probability at least 1- 2 /2 over the choice of x and S . We can write this as: P ES P d r ( violation ) 2 /2 ( 4 .3 ) x P Proof: We have proved in Theorem 8 that if K is(0, , )good similarity function, then with high probability there exists a low-error (at most ) large-margin (at least ) separa2 tor in the transformed space under mapping S . Thus, all we ~ need now to learn well is to draw a new fresh sample S , map it into the transformed space using S , and then apply a good algorithm for learning linear separators in the new space th at produces a hypothesis of error at most acc with probability at least 1 - . In particular, remember that the vector has error at most at L1 margin /2 over S (P ), where the mapping S produces examples of L norm at most 1. In order to enjoy the better learning guarantees of the separable cas e, we will set small enough so that no bad points appear in the sample. The Corollary now follows from the L1 -margin learning guarantee in the separable case, discussed earlie r in the Section. t u For the general ( > 0) case, Theorem 8 implies tha1 by ~2 following our two-stage approach, first using du = O acc nlabeled examples as landmarks in order to con1 truct S (·t), s ~ and then using a fresh sample of size dl = O 2 2 ln du o learn a low-error L1 -margin separator in S (·), we have: Corollary 10 If K is a (, , )-good similarity function then by minimizing L1 margin violations we can find a predictor with erroraat most acc from an unlabeled sample of size du = 1 l . ~ ~ g2 O2 nd from a labeled sample of size dl = O o2 du acc 2 least 1 - /2. When this happens, we have d1 /2. We get then, that with probability at least 1 - , for at least 1 - - of the points: (x) , S (x) /2. ( 4 .5 ) To bound the second term we need an upper bound on d1 , the number of reasonable landmarks. The probability of each of the d landmarks being reasonable is at least and so the number of reasonable landmarks follows a Binomial distribution, ensuring d1 8 log(1/ )/ 2 with probability at 2 log( 1 where "violation" refers to violating (4.2). Applying Mark ov's inequality we get that with probability at least 1 - /2 over the choice of S , at most fraction of points violate (4.2). Recalling Definition 6, at most an additional fraction of the points violate (3.1). But for the remaining 1 - - fraction of the points, for which both (4.2) and (3.1) hold, we have: 2 S log( 1 ) 2 . ( 4 .4 ) (x) , (x) - d1 ) The procedure described above, although well defined, involves a difficult optimization problem: minimizing the number of L1 -margin violations. In order to obtain a computationally tractable procedure, we consider the hinge-los s instead of the margin error. In a feature space with (x) 1 as above, we say that a unit-L1 -norm predictor , 1 = 1, has expected hinge-loss E [[1 - (x) , (x) /]+ ] relative to L1 -margin . Now, if we know there is some (unknown) predictor with hinge-loss relative L1 -margin , than a predictor with error + acc can be learned (with high l e ~ probability) from a sample of O og d/( 2 2cc ) xamples a by minimizing the empirical average hinge-loss relative to L1 -margin on the sample (Zhang, 2002). Before proceeding to discussing the optimization problem of minimizing the average hinge-loss relative to a fixed L1 -margin, let us establish the analogue of Theorem 8 for the hinge-loss: Theorem 11 Let K be an (, , )-good similarity function in hinge-loss for a learning problem P . For any 1 > 0 and 0 < < 1 /4 let S = {x 1 , x 2 , . . . , x d } bd a sample e l 2 of size d = og(2/ ) + 16 log(2/ )/(1 )2 rawn from P . With probability at least 1 - over the random sample S , the induced distribution S (P ) in Rd , for S as defined in Theorem 8, has a separator achieving hinge-loss at most + 1 at margin . Proof: We use the same construction as in Theorem 8. For the realizable ( = 0) case, we obtain: Corollary 9 If K is an (0, , )-good similarity function then with high probability we can efficiently find a predictor with error1 at most acc from an unlabeled sample of size du = . a l ~g ~ nd from a labeled sample of size dl = O o2 du O2 acc Corollary 12 K is an (, , )-good similarity function in hinge loss then we can efficiently find a predictor with error at 1 ost + acc from an unlabeled sample of size ldu = . ma ~ 22 ~ og 2 O nd from a labeled sample of size dl = O 2 du acc acc For the hinge-loss, our two stage procedure boils down to solving the following optimization problem w.r.t. : jdu idl 1 - j (xi )K (xi , x j ) minimize =1 =1 + ( 4 .6 ) s.t. jdu =1 |j | 1/ This is a linear program and can thus be solved in polynomial time, establishing the efficiency in Corollary 12. An optimization problem similar to (4.6), though usually with the same set of points used both as landmarks and as training examples, is actually fairly commonly used as a learning rule in practice (Bennett & Campbell, 2000; Roth, 2001; Guigue et al., 2005). Such a learning rule is typically discussed as an alternative to SVMs. In fact, Tipping (2001) suggest the Relevance Vector Machine (RVM) as a Bayesian alternative to SVMs. The MAP estimate of the RVM is given by an optimization problem similar to (4.6), though with a loss function different from the hinge loss (the hinge-loss cannot be obtained as a log-likelihood). Similarly, Singer (2000) suggests Norm-Penalized Leveraging Procedures as a boosting-like approach that mimics SVMs. Again, although the specific loss functions studied by Singer are different from the hinge-loss, the method (with a norm exponent of 1, as in Singer's experiments) otherwise correspond s to a coordinate-descent minimization of (4.6). In both cases, no learning guarantees are provided. The motivation for using (4.6) as an alternative to SVMs is usually that the L1 -regularization on leads to sparsity, and hence to "few support vectors" (although Vincent and Bengio (2002), who also discuss (4.6), argue for more direct ways of obtaining such sparsity), and also that the linear program (4.6) might be easier to solve than the SVM quadratic program. However, we are not aware of a previous discussion on how learning using (4.6) relates to learning using a SVM, or on learning guarantees using (4.6) in terms of properties of the similarity function K . Guarantees solely in terms of the feature space in which we seek low L1 -margin (S in our notation) are problematic, as this feature space is generated randomly from data. In fact, in order to enjoy the SVM guarantees while using L1 regularization to obtain sparsity, some authors suggest regularizing both the L1 norm 1 of the coefficient vector (as in (4.6)), and the norm of the corresponding j j (x j ) in the Hilbert space implied by predictor = K , where K (x, x ) = (x), (x ) , as when using a SVM with K as a kernel (Osuna & Girosi, 1999; Gunn & Kandola, 2 0 0 2 ). Here, we provide a natural condition on the similarity function K (Definition 7), that justifies the learning rule (4.6). Furthermore, we show (in Section 6) than any similarity func tion that is good as a kernel, and can ensure SVM learning, is also good as a similarity function and can thus also ensure learning using the learning rule (4.6) (though possibly wit h some deterioration of the learning guarantees). These argu ments can be used to justify (4.6) as an alternative to SVMs. Before concluding this discussion, we would like to mention that Girosi (1998) previously established a rather different connection between regularizing the L1 norm 1 and regularizing the norm of the corresponding predictor in the implied Hilbert space. Girosi considered a hard-margin SVR (Support Vector Regression Machine, i.e. requiring each pr ediction to be within ((x) - , (x) + )), in the noiseless case where the mapping x (x) is in the Hilbert space. In this setting, Girosi showed that a hard-margin SVR is equivalent to minimizing the distance in the implied Hilbert space between the correct mapping x (x) and the predictions j j K (x, x j ), with an L1 regularization term 1 . x However, this distance between prediction functions is very different than the objective in (4.6), and again refers back to the implied feature space which we are trying to avoid. 5 Separation Results In this Section, we show an example of a finite concept class for which no kernel yields good learning guarantees when used as a kernel, but for which there does exist a good similarity function yielding the optimal sample complexity. That is, we show that some concept classes cannot be reasonably represented by kernels, but can be reasonably represented b y similarity functions. Specifically, we consider a class C of n pairwise uncorrelated functions. This is a finite class of cardinality |C | = n, and so if the target belongs to C then O( 1 log n) samples are enough for learning a predictor with error . Indeed, we show here that for any concept class C , so long as the distribution D is sufficiently unconcentrated, there exists a similarity function that is (0, 1, 2|1 | )-good under our C definition for every f C . This yields a (labeled) sample complexity O( 1 log |C |) to achieve error , matching the ideal sample complexity. In other words, for distributionspecific learning (where unlabeled data may be viewed as free) and finite classes, there is no intrinsic loss in samplecomplexity incurred by choosing to learn via similarity fun ctions. In fact, we also extend this result to classes of bounded VC-dimension rather than bounded cardinality. In contrast, we show that if C is a class of n functions that are pairwise uncorrelated with respect to distributio n D, then no kernel is (, )-good in hinge-loss for all f C even for = 0.5 and = 8/ n. This extends work of (Ben-David et al., 2003; Forster & Simon, 2006) who give hardness results with comparable margin bounds, but at a much lower error rate. Thus, this shows there is an intrinsic loss incurred by using kernels together with margin bounds, since this results in a sample complexity bound of at least (|C |), rather than the ideal O(log |C |). We thus demonstrate a gap between the kind of prior knowledge can be represented with kernels as opposed to general similarity functions and demonstrate that similar ity functions are strictly more expressive (up to the degradati on in parameters discussed earlier). Definition 13 We say that a distribution D over X is - unconcentrated if the probability mass on any given x X is at most . Theorem 14 For any class finite class of functions C and for any 1/|C |-unconcentrated distribution D over the instance space X , there exists a similarity function K that is a (0, 1, 2|1 | )-good similarity function for all f C . C Proof: Let C = {f1 , . . . , fn }. Now, let us partition X into n regions Ri of at least 1/(2n) probability mass each, which we can do since D is 1/n-unconcentrated. Finally, define K (x, x ) for x in Ri to be fi (x)fi (x ). We claim that for this similarity function, Ri is a set of "reasonable points" establishing margin = 1 for target fi . Specifically, E[K (x, x )fi (x)fi (x )|x Ri ] = E[fi (x)fi (x )fi (x)fi (x )] = 1. Since Pr(Ri ) 21 , this implies that under distribution D, n K is a (0, 1, 21 )-good similarity function for all fi C . n Note 1: We can extend this argument to any class C of small VC dimension. In particular, for any distribution D, the class C has an -cover C of size (1/)O(d/) , where d is the VC-dimension of C (Benedek & Itai, 1988). By Theorem 14, we can have a (0, 1, 1/|C|)-good similarity function for the cover C , which in turn implies an (, 1, 1/|C |)good similarity function for the original set (even in hinge loss since = 1). Plugging in our bound on |C |, we get an (, 1, O(d/) )-good similarity function for C . Thus, the labeled sample complexity we get for learning with similarity functions is only O((d/) log(1/)), and again there is no intrinsic loss in sample complexity bounds due to learning with similarity functions. Note 2: The need for the underlying distribution to be unconcentrated stems from our use of this distribution for bot h labeled and unlabeled data. We could further extend our definition of "good similarity function" to allow for the unlabeled points x to come from some other distribution D given apriori, such as the uniform distribution over the instance space X . Now, the expectation over x and the probability mass of R would both be with respect to D , and the generic learning algorithm would draw points x from D i rather than D. In this case, we would only need D to be unconcentrated, rather than D. We now prove our lower bound for margin-based learning with kernels. Theorem 15 Let C be a class of n pairwise uncorrelated functions over distribution D. Then, there is no kernel that for all f is (, )-good in hinge-loss even for = 0.5 C a n d = 8 / n. Proof: Let C = {f1 , . . . , fn }. We begin with the basic fourier setup (Linial et al., 1989; Mansour, 1994). Given two functions f and g , define f, g = Ex [f (x)g (x)] to be their correlation with respect to distribution D. (This is their inner-product if we view f as a vector whose j th coordinate is f (xj )[D(xj )]1/2 ). Because the functions fi C are pairwise uncorrelated, we have fi , fj = 0 for all i = j , and because the fi are boolean functions we have fi , fi = 1 for all i. Thus they form at least part of an orthonormal basis, and for any hypothesis h (i.e. any mapping X {±1}) we have f h, fi 2 1. i C So, this implies f or equivalently i C | h, fi | n. Efi C | h, fi | 1/ n. ( 5 .1 ) In other words, for any hypothesis h, if we pick the target at random from C , the expected magni de of the correlation tu between h and the target is at most 1/ n. We now consider the implications of having a good kernel. Suppose for contradiction that there exists a kernel K that is (0.5, )-good in hinge loss for every fi C . What we will show is this implies that for any fi C , the expected value of | h, fi | for a random linear separator h in the -space is greater than /8. If we can prove this, then we are done because this implies there must exist an h that has Efi C | h, f | > /8, which contradicts equation (5.1) f o r = 8 / n. So, we just have to prove the statement about random linear separators. Let w denote the vector in the -space that has hinge-loss at most 0.5 at margin for target function fi . For any example x, define x to be the margin of (x) with respect to w , and define x = sin-1 (x ) to be the angular margin of (x) with respect to w .3 Now, consider choosing a random vector h in the -space, where we associate h(x) = sign(h · (x)). Since we only care about the absolute value | h, fi |, and since -h, fi = - h, fi , it suffices to show that Eh [ h, fi | h · w 0] > /8. We do this as follows. First, for any example x, we claim that: Pr[(h(x) = fi (x)|h · w 0] = 1/2 - x / . h ( 5 .2 ) This is because we look at the 2-dimensional plane defined by (x) and w , and consider the half-circle of h = 1 such that h · w 0, then (5.2) is the portion of the half-circle that labels (x) incorrectly. Thus, we have: Eh [err(h)|h · w 0] = Ex [1/2 - x / ], and so, using h, fi = 1 - 2 err(h), we have: Eh [ h, fi | h · w 0] = 2Ex [x ]/ . Finally, we just need to relate angular margin and hinge loss: if Lx is the hinge-loss of (x), then a crude bound on x is x (1 - ( /2)Lx ). 3 So, x is a bit larger in magnitude than x . This works in our favor when the margin is positive, and we just need to be caref ul when the margin in negative. Since we assumed that Ex [Lx ] 0.5, we have: Ex [x ] (1 - /4). Putting this together we get expected magnitude of correlation of a random halfspace is at least 2 (1 - /4)/ > /8 as desired, proving the theorem. An example of a class C satisfying the above conditions is the class of parity functions over {0, 1}lg n , which are pairwise uncorrelated with respect to the uniform distribution . Note that the uniform distribution is 1/|C |-unconcentrated, and thus there is a good similarity function. (In particular, one could use K (xi , xj ) = fj (xi )fj (xj ), where fj is the parity function associated with indicator vector xj .) We can extend Theorem 15 to classes of large Statistical Query dimension as well. In particular, the SQ-dimension of a class C with respect to distribution D is the size d of the largest set of functions {f1 , f2 , . . . , fd } C such that | fi , fj | 1/d3 for all i = j (Blum et al., 1994). In this case, we just need to adjust the Fourier analysis part of the argument to handle the fact that the functions may not be completely uncorrelated. Theorem 16 Let C be a class of functions of SQ-dimension d with respect to distribution D. Then, there is no kernel that for all f C is (, )-good in hinge-loss even for = 0.5 and = 16/ d . Proof: Let f1 , . . . , fd be d functions in C such that | fi , fj | 1/d3 for all i = j . We can define an orthogonal set of functions f1 , f2 , . . . , fd as follows: let f1 = f1 , f2 = f2 - f1 f2 , f1 , and in general let fi be the portion of fi orthogonal to the space spanned by f1 , . . . , fi-1 . (That is, fi = fi - pro j(fi , span(f1 , . . . , fi-1 )), where "proj" is orthogonal projection.) Since the fi are orthogonal and have length i at most 1, for any boolean functi h we have on h, fi 2 1 and therefore Ei | h, fi | 1/ d. Finally, since fi , fj 1/d3 for all i = j , one can show this implies that |fi - fi | 1/d for all i. So, Ei | h, fi | 1/ d + 1/d 2/ d. The rest of the arg ent in the proof of Theorem 15 now applies um with = 16/ d. For example, the class of size-n decision trees over {0, 1}n has n(log n) pairwise uncorrelated functions over the uniform distribution (in particular, any parity of log n variables can be written as an n-node decision tree). So, this means we cannot have a kernel with margin 1/poly (n) for all sizen decision trees over {0, 1}n. However, we can have a similarity function with margin 1, though the parameter (which controls running time) will be exponentially small. kernel. We now show the converse: if a kernel function is good in the kernel sense, it is also good in the similarity sense, though with some degradation of the margin. This degradation is much smaller than the one incurred previousl y by Balcan and Blum (2006) and Srebro (2007). Specifically, we can show that if K is a (0, )-good kernel, then K is (, 2 , )-good similarity function for any (formally, it is (, 2 /c, c)-good for some c 1). To prove this relationship, we introduce an intermediate notion of a good similarity function. Definition 17 (Intermediate, Margin Violations) A similarity function K is a strongly (, , M )-good similarity function for a learning problem P if there exists a bounded weighting function w over X , w(x ) [0, M ] for all x X , E[w] 1 such that at least a 1 - probability mass of examples x satisfy: Ex P [(x)(x )w(x )K (x, x )] . ( 6 .1 ) Definition 18 (Intermediate, Hinge Loss) A similarity function K is a strongly (, , M )-good similarity function in hinge loss for a learning problem P if there exists a weighting function w(x ) [0, M ] for all x X , E[w] 1 such that [ , ( 6 .2 ) Ex 1 - (x)g (x)/ ]+ where g (x) = Ex P [(x )w(x )K (x, x )] is the similaritybased prediction made using w(·). These intermediate definitions are closely related to our main similarity function definitions: in particular, if K is a strongly (, , M )-good similarity function for a learning problem P , then it is also an (, /c, c/M )-good similarity function for some c 1. Theorem 19 If K is a strongly (, , M )-good similarity function for a learning problem P , then there exists c 1 such that K is a (, /c, c/M )-good similarity function for P . If K is a strongly (, , M )-good similarity function in hinge loss for P , then there exists c 1 such that K is a (, /c, c/M )-good similarity function for P . Note that since our guarantees for (, , )-good similarity functions depend on only through 2 , a decrease in and a proportional increase in (as when c < 1 in Theorem 19) only improves the guarantees. However, allowing flexibility in this tradeoff will make the kernel-to-similarity function translation much easier. Proof: (of Theorem 19) First, divide w by M to scale its range to [0, 1], so E[w] = c/M for some c 1 and the margin is now /M . Define random indicator R(x ) to equal 1 with probability w(x ) and 0 with probability 1 - w(x ), so we have = Prx [R(x ) = 1] = E[w] = c/M , and we can rewrite (6.1) as Ex P,R [(x)(x )R(x )K (x, x )] /M . (6.3) 6 Relation between kernels and similarity functions As is shown in the Appendix (Theorem 25), if a similarity function K is indeed a kernel, and it is (, , )-good as a similarity function (possibly in hinge-loss), than it is also (, )-good as a kernel (respectively, in hinge loss). That is, although the notion of a good similarity function is more widely applicable, for those similarity functions that are positive semidefinite, a good similarity function is also a good Finally, divide both sides of (6.3) by = c/M , producing the conditional Ex [(x)(x )K (x, x ) | R(x )] on the LHS and a margin of /c on the RHS. The case of hinge-loss is identical. We will now establish that a similarity function K that is good as a kernel, is also good as a similarity function in this intermediate sense, and hence, by Theorem 19, also in our original sense. We begin by considering goodness in hingeloss, and will return to margin violations at the end of the Section. Theorem 20 If K is (0 , )-good kernel in hinge loss for learning problem (with deterministic labels), then it is also a 2 strongly (0 + 1 , 1+ /21 , 211 0 )-good similarity in hinge + 0 loss for the learning problem, for any 1 > 0. Proof: We initially only consider finite discrete distributions, where: Pr( xi , yi ) = pi ( 6 .4 ) n for i = 1 . . . n, with i=1 pi = 1 and xi = xj for i = j . Let K be any kernel function that is (0 , )-kernel good in hinge loss. Let be the implied feature mapping and denote i = (xi ). Consider the following weighted-SVM quadratic optimization problem with regularization param eter C : in 1 2 pi [1 - yi , i ]+ ( 6 .5 ) minimize +C 2 =1 The dual of this problem, with dual variables i , is: i 1i i - maximize yi yj i j K (xi , xj ) 2j subject to 0 i C pi i We now use the fact that can be written as = yi i with 0 C pi . Let us consider the weights i i So, wi and E[w] = have no duality gap we also have i C A wi = w(xi ) = /(Api ) 1 i P i A. ( 6 .1 0 ) Furthermore, since we i - i i i 12 + C 0 . so So, we have for every x, y : i 1 1 2= 2+C pi [1 - yi , i ]+ , 2 2 y Ex ,y [w(x )y K (x, x )] = = = = y y y i pi w(xi )yi K (x, xi ) pi i yi K (x, xi )/(Api ) i yi i , (x) /A i y , (x) /A i Multiplying by A and using (6.9): ( 6 .1 1 ) Ex,y [ [ 1 - Ay Ex ,y [w(x )y K (x, x )] ]+ ] 1 = Ex,y [ [ 1 - y , (x) ]+ ] + 0 2C 2 1 1 This holds for any A and C such that 2 + C 0 A 1, ( 6 .6 ) There is no duality gap, and furthermore the primal optimum ican be expressed in terms of the dual optimum : = yi i . i Since K is (0 , )-kernel-good in hinge-loss, there exists a predictor 0 = 1 with average-hinge loss 0 relative to margin . The primal optimum of (6.5), being the optimum solution, then satisfies: i 1 pi [1 - yi , i ]+ 2+C 2 1 2 1 ] i 1 pi [1 - yi 0 + C 0 , i + 2 [ 1 ]= 1 1 = 2 + CE 1 - y + C 0 0 , (x) + 2 2 2 ( 6 .7 ) Since both terms on the left hand side are non-negative, each of them is bounded by the right hand side, and in particular: i 1 pi [1 - yi , i ]+ 2 + C 0 ( 6 .8 ) C 2 Dividing by C we get a bound on the average hinge-loss of the predictor , relative to a margin of one: E[[1 - y , (x) ]+ ] 1 + 0 2C 2 0/ A = 1+ 2 21 , we set C = 1/(21 2 ) and get an average hinge-loss of 0 + 1 , [ 0 + 1 Ex,y 1 - y Ex ,y [w(x )y K (x, x )]/(21 2 ) ]+ ( 6 .1 2 ) as desired. This establishes that if K is (0 , )-good kernel in hinge 2 loss then it is also a strongly (0 + 1 , 1+ /21 , 211 0 )-good + 0 similarity in hinge loss, for any 1 > 0, at least for finite discrete distributions. To extend the result also to non-discrete distributions, we can consider the variational "infinite SVM" problem and apply the same arguments, as in (Srebro, 2007). and describes the average hinge-loss relative to margin 1/A. We also have the constraint C M . Choosing M = 211 0 , A + We can now use the hinge-loss correspondence to get a similar result for the margin-violation definitions: Theorem 21 If K is (0 , )-good kernel for a learning problem (with deterministic labels), then it is also a strongly (0 + 1 1 , 2 /2, (1-0 )1 -good similarity function for the learning problem, for any 1 > 0. Proof: If K is (0, )-good as a kernel, it is also (0, ) good as a kernel in hinge loss, and we can apply Theorem 20 to obtain that K is also (0 /2, 1 , 1 )-good, where 1 = 2 and 1 = 1/1 . We can then bound the number of margin violations at 2 = 1 /2 by half the hinge loss at margin 1 to obtain the desired result. ( 6 .9 ) If K is only (, )-good as a kernel, we follow a similar procedure to that described in (Srebro, 2007), and consider a distribution conditioned only on those places where there is no error. Returning to the original distribution, we must scale the weights up by an amount proportional to the probability of the event we conditioned on (i.e. the probability of no margin violation). This yields the desired bound. 7 Learning with Multiple Similarity Functions Suppose that rather than having a single similarity functio n, we were instead given n functions K1 , ..., Kn , and our hope is that some convex combination of them will satisfy Definition 6. Is this sufficient to be able to learn well? (Note that a convex combination of similarity functions is guaranteed to have range [-1, 1] and so be a legal similarity function.) The following generalization of Theorem 8 shows that this is indeed the case. (The analog of Theorem 11 can be derived similarly.) Theorem 22 Suppose K1 , . . . , Kn are similarity functions such that some (unknown) convex combination of them is (, , )-good. For any > 0, let S = {x1 , x2 , . . . , x } d log(1/ ) be a sample of size d = 16 2 drawn from P . Consider the mapping S : X Rnd defined as follows: S i (x) = (K1 (x, x ), . . . , Kn (x, x ), . . . , K1 (x, x ), . . . , Kn (x, x )). 1 1 d d With probability at least 1 - over the random sample S , the induced distribution S (P ) in Rnd has a separator of error at most + at L1 , L margin at least /2. Proof: Let K = 1 K1 + . . . + n Kn be an (, , )-good convex-combination of the Ki . By Theorem 8, had we in~ stead performed the mapping: S : X Rd defined as ~ S (x) = (K (x, x 1 ), . . . , K (x, x d )), ~ then with probability 1 - , the induced distribution S (P ) in Rd would have a separator of error at most + at mar^ gin at least /2. Let be the vector corresponding to such ^ a separator in that space. Now, let us convert into a vecnd ^j with the n values tor in R by replacing each coordinate ^ ~ ^ ^ (1 j , . . . , n j ). Call the resulting vector= . Notice that ~ . ~ by design, for any x we have , S (x) , S (x) ~1 ^1 ~ Furthermore, = . Thus, the vector under distri^ bution S (P ) has the same properties as the vector under ~S (P ). This implies the desired result. Note that we get significantly better bounds here than in (Balcan & Blum, 2006), since the margin does not drop 1 by a factor of n . separation result between what is learnable with L1 vs. L2 regularization. In a lower bound of independent interest, w e show that if C is a class of n pairwise uncorrelated functions, then no kernel is (, )-good in hinge-loss for all f C even for = 0.5 and = 8/ n. It would be interesting to explore whether the lower bound could be extended to cover margin violations with a constant error rate > 0 rather than only hinge-loss. In addition, it would be particularly interesting to develop even broade r natural notions of good similarity functions, that allow for functions that are not positive-semidefinite and yet provide even better kernel-to-similarity translations (e.g., not squaring the margin parameter). Acknowledgments: We would like to thank Manfred Warmuth and Hans-Ulrich Simon for helpful discussions. This work was supported in part by the National Science Foundation under grant CCF-0514922, by an IBM Graduate Fellowship, and by a Google Research Grant. References Arora, S., Babai, L., Stern, J., & Sweedyk, Z. (1997). The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences, 54, 317 ­ 331. Balcan, M.-F., & Blum, A. (2006). On a theory of learning with similarity functions. Proceedings of the 23rd International Conference on Machine Learning. Bartlett, P. L., & Mendelson, S. (2003). Rademacher and gaussian complexities: risk bounds and structural results . J. Mach. Learn. Res., 3, 463­482. Ben-David, S., Eiron, N., & Simon, H.-U. (2003). Limitations of learning via embeddings in euclidean half-spaces. The Journal of Machine Learning Research, 3, 441 ­ 461. Benedek, G., & Itai, A. (1988). Learnability by fixed distributions. Proc. 1st Workshop Computat. Learning Theory (p p . 8 0 ­ 9 0 ). Bennett, K. P., & Campbell, C. (2000). Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl., 2, 1­13. Blum, A., Furst, M., Jackson, J., Kearns, M., Mansour, Y., & Rudich, S. (1994). Weakly learning DNF and characterizing statistical query learning using fourier analysis. Proceedings of the 26th Annual ACM Symposium on Theory of Computing (pp. 253­262). Chapelle, O., Schlkopf, B., & Zien, A. (2006). supervised learning. MIT Press. Semi- 8 Conclusions We provide a new notion of a "good similarity function" that we prove is strictly more powerful than the traditional noti on of a large-margin kernel. Our new notion relies upon L1 regularized learning, and our separation result is related to a Feldman, V., Gopalan, P., Khot, S., & Ponnuswami, A. (2006). New results for learning noisy parities and halfspaces. 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (pp. 563­574). Forster, J., & Simon, H.-U. (2006). On the smallest possible dimension and the largest possible margin of linear arrangements representing given concept classes. Theoretical Computer Science, 350, 40­48. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Comput., 10, 1455­1480. Guigue, V., Rakotomamonjy, A., & Canu, S. (2005). Kernel basis pursuit. Proceedings of the 16th European Conference on Machine Learning (ECML'05). Springer. Gunn, S. R., & Kandola, J. S. (2002). Structural modelling with sparse kernels. Mach. Learn., 48, 137­163. Guruswami, V., & Raghavendra, P. (2006). Hardness of learning halfspaces with noise. 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (pp. 5 4 3 ­ 5 5 2 ). Herbrich, R. (2002). Learning kernel classifiers. MIT Press, Cambridge. Linial, N., Mansour, Y., & Nisan, N. (1989). Constant depth circuits, fourier transform, and learnability. Proceedings of the Thirtieth Annual Symposium on Foundations of Computer Science (pp. 574­579). Research Triangle Park, North Carolina. Littlestone, N. (1989). From online to batch learning. Proc. 2nd Annual ACM Conference on Computational Learning Theory (pp. 269­284). Mansour, Y. (1994). Learning boolean functions via the fourier transform. In Theoretical advances in neural computation and learning, 391­424. McAllester, D. (2003). Simplified pac-bayesian margin bounds. Proceedings of the 16th Conference on Computational Learning Theory. Mitchell, T. (2006). The discipline of machine learning. CMU-ML-06 108. Osuna, E. E., & Girosi, F. (1999). Reducing the run-time complexity in support vector machines. In Advances in kernel methods: support vector learning, 271­283. Cambridge, MA, USA: MIT Press. Roth, V. (2001). Sparse kernel regressors. ICANN '01: Proceedings of the International Conference on Artificial Neural Networks (pp. 339­346). London, UK: SpringerVerlag. Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. support vector machines, regularization, optimization, and beyond. MIT University Press, Cambridge. Scholkopf, B., Tsuda, K., & Vert, J.-P. (2004). Kernel methods in computational biology. MIT Press. Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. Singer, Y. (2000). Leveraged vector machines. Advances in Neural International Proceedings System 12. Smola, A. J., & Scholkopf, B. (2002). Learning with kernels. ¨ MIT Press. Srebro, N. (2007). How Good is a Kernel as a Similarity Function. COLT. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1, 211­244. Vincent, P., & Bengio, Y. (2002). Kernel matching pursuit. Mach. Learn., 48, 165­187. Warmuth, M. K., & Vishwanathan, S. V. N. (2005). Leaving the span. Proceedings of the Annual Conference on Learning Theory. Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res., 2, 527­550. A Kernels and Similarity Functions Theorem 23 If K is an (, )-good similarity function under Definitions 4 and 5, then K is also an (, , )-good similarity function under Definitions 6 and 7, respectively. Proof: If we set Pr(R(x) | x) = w(x), we get that in order for any point x to fulfill equation (2.1), we must have Pr(R(x)) = E[w(x)] E[ w(x )K (x, x )] . Furthermore, for any x, for which (2.1) is satisfied, we have E[ K (x, x )|R(x )] = E[ K (x, x )w(x )]/ Pr(R(x)) E[ K (x, x )w(x )] . Theorem 24 If K is an (, , )-good similarity function under Definitions 6 and 7, then K is an (, )-good similarity function under Definitions 4 and 5 (respectively). Proof: Setting w(x) = Pr(R(x) | x) we have for any x, satisfying (3.1) that E[ K (x, x )w(x )] = E[ K (x, x )R(x )] = E[ K (x, x )|R(x )] Pr(R(x )) . (A.2) A similar calculation establishes the correspondence for t he hinge loss. We show in the following that a kernel good as a similarity function is also good as a kernel. Theorem 25 If K is a valid kernel function, and is (, , )good similarity for some learning problem, then it is also (, )-kernel-good for the learning problem. If K is (, , )good similarity in hinge loss, then it is also (, )-kernelgood in hinge loss. Proof: Consider a similarity function K that is a valid kernel, i.e. K (x, x ) = (x), (x ) for some mapping of x to a Hilbert space H. For any input distribution and any probabilistic set of reasonable points R of the input we will construct a linear predictor w H, with w 1, such that similarity-based predictions using R are the same as the linear predictions made with R . ( A. 1 ) Define the following linear predictor R H: R = Ex [(x )(x )|R(x )]. ( A. 3 ) The predictor w has norm at most: R = Ex [(x )(x )|R(x )] max (x )(x ) x K (x , x ) 1 (A.4) max (x ) = max where the second inequality follows from |(x )| 1. The predictions made by R are: R , (x) = Ex [(x )(x )|R(x )], (x) = Ex [(x ) (x ), (x) |R(x )] = Ex [(x ))K (x, x )|R(x )] ( A. 5 ) That is, using R is the same as using similarity-based prediction with R. In particular, the margin violation rate, as well as the hinge loss, with respect to any margin , is the same for predictions made using either R or R . This is enough to establish Theorem 25: If K is (, )-good (perhaps for to the hinge-loss), there exists some valid R that yields margin violation error rate (resp. hinge loss) at mos t with respect to margin , and so R yields the same margin violation (resp. hinge loss) with respect to the same margin, establishing K is (, )-kernel-good (resp. for the hinge loss).