Multi-Classification by Categorical Features via Clustering Yevgeny Seldin Naftali Tishby, School of Computer Science and Engineering, Center for Neural Computation, The Hebrew University of Jerusalem, Israel seldin@cs.huji.ac.il tishby@cs.huji.ac.il Abstract We derive a generalization bound for multiclassification schemes based on grid clustering in categorical parameter product spaces. Grid clustering partitions the parameter space in the form of a Cartesian product of partitions for each of the parameters. The derived bound provides a means to evaluate clustering solutions in terms of the generalization power of a built-on classifier. For classification based on a single feature the bound serves to find a globally optimal classification rule. Comparison of the generalization power of individual features can then be used for feature ranking. Our experiments show that in this role the bound is much more precise than mutual information or normalized correlation indices. Grid clustering or some other form of dimensionality reduction can be helpful and even essential when the sample size is limited. However, an appropriate choice of clustering resolution is crucial for good results. A coarse clustering may be highly imprecise - think of the extreme of putting all the data into one big cluster. On the other hand, a fine clustering may be statistically unreliable - at the opposite extreme, if we put every parameter value into a separate cluster, some parameter combinations may not occur in the training set at all. Thus, unification of parameter values amplifies the statistical reliability, but reduces the precision. In this paper we relate this tradeoff to generalization properties of a classifier based on the clustering. Applications of grid clustering to data with intrinsically categorical features are abundant. Seldin, Slonim and Tishby (2007) consider grid clustering from an MDL perspective and demonstrate its success in predicting missing values in the context of collaborative filtering. The same work achieves state of the art performance in terms of coherence of obtained clusters with manual annotation in the context of gene expression and stock data analysis. Here we also suggest a new application of grid clustering for feature ranking. We are not aware of any previous work on generalization properties of models based on grid clustering. A somewhat related work is (Srebro, 2004), which derives a generalization bound for matrix approximation with bounded norm factorization. However, matrix factorization is a different model and the proof is based on a different technique (Rademacher complexities). The key point of this paper is a derivation of a generalization bound for classification based on grid clustering. The bound is derived by using the PAC-Bayesian technique (McAllester, 1999). The power of the PACBayesian technique lies in its ability to handle heterogeneous hypothesis classes so that the generalization bound for a specific hypothesis depends on the complexity of that hypothesis rather than on the complexity of the whole class. A classical example of an appli- 1. Intro duction Clustering is one of the basic tools for dimensionality reduction in categorical spaces. In this paper we study classifiers based on a soft grid clustering of categorical parameter product spaces. The grid clustering is defined by a set of stochastic mappings {qi : Xi {1, .., mi }}, one for each parameter i, which map the possible values Xi of the parameter Xi to a reduced set Ci of size mi . A classifier based on the grid clustering then assigns a separate prediction strategy to each partition cell. For example, in collaborative filtering we can cluster a thousand by thousand space of viewers by movies into a five by five space of viewer clusters by movie clusters (here X1 are the viewers and X2 are the movies). Then we can predict a missing entry within some partition cell with an average of ratings in that cell. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Multi-Classification by Categorical Features via Clustering cation of the PAC-Bayesian bound are SVMs (Langford, 2005). It is well known that the VC-dimension of separating hyperplanes in Rn is n + 1. As well, the VCdimension of separating hyperplanes with a margin , assuming all points are bounded in a unit sphere, is min{ 12 , n} + 1. The ability to slice the hypothesis space into infinitely many subspaces characterized by a finer notion of complexity (the size of the margin) rather than the coarse VC-dimension of the whole class makes it possible to derive a better bound that remains meaningful even for infinite dimensional spaces. In this paper we propose a fine measure of complexity of a grid partition of a cardinal space. The proposed measure of complexity is related to the entropy of a partition along each dimension i. The bound enables us to consider all possible partitions of the product space and to choose one with better generalization properties. In the case of a single parameter it is easy to find a global optimum of the bound. The mapping rule achieving the optimum is shown to be the optimal classification rule from a generalization point of view. Although the bound is not perfectly tight, its shape follows an error on a validation set extremely well. In the experimental section we apply the bound to feature ranking and it is shown to be much more precise than standard mutual information or normalized correlation rankings. We now introduce the notion of a randomized classifier. Let Q be any (posterior) distribution over H. A randomized classifier associated with Q works by choosing a new classifier h from H according to Q every time a classification is made. We denote the loss of a strategy Q by L(Q) = EhQ L(h) and similarly ^ ^ L(Q) = EhQ L(h). By taking an expectation of (1) over the choice of h and exploiting the concavity of the square root we obtain that with a probability greater than 1 - : - EhQ ln P (h) - ln ^ (Q) + b . (2) L(Q) L 2N The PAC-Bayesian bound (McAllester, 1999) was derived to allow uncountably infinite hypothesis spaces, though in our case the hypothesis space is finite. We cite a slightly tighter version of the bound proved in (Maurer, 2004). Theorem 2 (PAC-Bayesian Bound). For any data distribution and for any "prior" P over H fixed ahead of training with a probability greater than 1 - for al l distributions Q over H: D (Q P ) + 1 ln(4N ) - ln 2 ^ , (3) L(Q) L(Q) + b 2N (h) where D(Q P ) = EhQ ln Q(h) is the Kullback-Leibler P (KL) divergence between the distributions Q and P . The minimal - alue of that satisfies the requirement v is (h) = b ln P (h)-ln , 2N which completes the proof. 2. A Brief Review of the PAC-Bayesian Generalization Bound To set the stage, we start with a simplified version of the PAC-Bayesian bound, called Occam's razor. Let H be a countable hypothesis space. For a hypothesis ^ h H denote by L(h) an expected and by L(h) an empirical loss of h. We assume the loss is bounded by b. Theorem 1 (Occam's razor). For any data generating distribution and for any "prior distribution" P (h) over H with a probability greater than 1 - over drawing an i.i.d. sample of size N , for al l h H: - ln P (h) - ln ^ (h) + b . (1) L(h) L 2N Pro of. The proof is fairly simple and provides a good illustration of what the "prior distribution" P (h) is. ^ By Hoeffding's inequality P {L(h) - L(h) (h)} -2N (h)2 /b2 e for any given h H. We require that 2 2 2 e-hN (h) /b P (h) for some prior P (h) that satisfies H P (h) = 1. Then, by the union b ound L(h) ^ L(h) + (h) for all h H with a probability of 1 - . If the loss function is bounded by one (1), (2), and (3) may be written in the form of a bound on the ^ KL-divergence between L(Q) and L(Q). For ex^ ample, (1) may be obtained as: D(L(h) L(h)) - ln P (h)-ln ^ if we start from the P {L(h) - L(h) N -N D (L(h)+ L(h)) } e form of Hoeffding's inequal^ ity. This provides a better bound in cases when L(Q) 1 is sufficiently small (less than 8 ). The choice of the square root form of the bounds is based on their easier analytical tractability for subsequent minimization. By writing D(Q P ) = -H (Q) - EhQ ln P (h) it is easy to see that (3) is an improvement over (2) when H (Q) > 1 ln(4N ). In our experiments (2) is usually 2 1 tighter. It is an open question whether 2 ln(4N ) can be removed from (3), at least in the case of a countable H. For example, (Blanchard & Fleuret, 2007) suggest a parameterized tradeoff k+1 D(Q P ) + ln(k + 1) + 3.5 + k 1 1 2k instead of D (Q P ) + 2 ln(4N ). For our data the tradeoff does not improve the results and therefore we omit its discussion. Multi-Classification by Categorical Features via Clustering 3. A Formal Definition of Grid Clustering Before proceeding to the results we provide a formal definition of grid clustering as used in this paper. Definition 1. Grid Clustering of the parameter space X1 × .. × Xd is a set of distributions qi (Ci |Xi ) defining the probability of mapping Xi Xi to Ci {1, .., mi }. If each of qi (Ci |Xi ) is deterministic, we cal l the clustering a deterministic grid clustering. Otherwise it is a stochastic grid clustering. In the following sections we assume some unknown joint probability distribution p(Y , X1 , .., Xd ) of the parameters and the label exists. The set of all possible labels is denoted by Y and its size is denoted by ny . The size of Xi is denoted by ni . The cardinality of Ci is mi . The value of each mi can vary in the range of 1 mi ni for different partitions. A hypothesis h in a deterministic grid clustering is comprised of a set of deterministic mappings qi (Ci |Xi ), for simplicity denoted by qi (Xi ) : Xi {1, .., m}, and a set i of mi labels, one for each partition cell. We denote the hypothesis space by H and decompose it as H = H|1 × .. × H|d × H|Y |m . Here H|i is a space of ¯ all possible partitions of Xi , or, in other words, a projection of H onto dimension i. m = (m1 , .., md ) is a ¯ vector of cardinalities of the partitions along each dimensi on and H|Y |m is a space of all possible labelings i ¯ of mi partition cells. Similarly, h H is decomposed as h = h|1 × .. × h|d × h|y|m . It is assumed that ¯ a loss function l : Y × Y R+ is given. The loss of h, denoted by L(h), is defined as an expectation over p of L: L(h) = Ep l(h(X1 , .., Xd ), Y (X1 , .., Xd )). The empirical loss of h on a sample S of size N is denoted ^ by L(h) and equals the average loss on the sample. For stochastic mappings qi (Ci |Xi ) it is assumed that a random realization of the mapping is done prior to the prediction. In other words, we choose a hypothesis h at random by determining the values of qi (Xi ) accordi{ g to qi (Ci |Xi ) before we makc a prediction. n e Q = qi (Ci |Xi )}d=1 , q (Y |C1 , .., Cd ) ollectively dei notes a distribution over H associated with a randomized classifier called Q. The loss of Q is denoted by L(Q) and equals L(Q) = EhQ L(h). The empirical ^ loss of Q is denoted by L(Q). x 1 qi (Ci |xi ) to be a marginal We define qi (Ci ) = ni i distribution over Ci corresponding to a uniform distribution over Xi and the conditional distribution qi (Ci |Xi ) of our choice. The entropy of a partition along a dimension i with respect to a uniform distribution over Xi is then HU (qi ) c qi (ci ) ln qi (ci ). The mutual inHU (Ci ) = - i formation between Xi and Ci with respect to a unxform distribution over Xi is IU (Xi ; Ci ) = i 1 qi (ci |xi ) ln[qi (ci |xi )/qi (ci )]. ni i ,c i 4. Generalization Bound for Multi-Classification with Grid Clustering In this section we state and prove a generalization bound for multi-classification with stochastic grid clustering: Theorem 3. For any probability measure p over instances and for any loss function l bounded by b, with a probability of at least 1 - over a selection of an i.i.d. sample S of size { according to p, for al l randomized N : classifiers Q = qi (Ci |Xi )}d=1 , q (Y |C1 , .., Cd ) i i ni HU (Ci ) + K ^ , (4) L(Q) L(Q) + b 2N i i (ln(ni ) + 1)2 K= ) + ( mi ) ln ny - ln . (mi ln ni + 4 (5) It is also possible to replace (4) in the theorem with: i ni IU (Xi ; Ci ) + 1 ln(4N ) + K 2 ^ . L(Q) L(Q) + b 2N (6) Pro of. The bounds (4) and (6) are direct consequences of (2) and (3) respectively for an appropriate choice of a prior P over H. The main part of the proof is to define a prior P that will provide a meaningful complexity-related slicing of H and then to calculate -EQ ln P (h) for (4) and D(Q P ) for (6). To define the prior P over H we count Q e hypotheses th m in H. For a fixed partition there are ny i i possibilities to assign the labels to the partition cells. There are ni possibilities to choose the number of clusters along n + m i - m i -1 ni a dimension i. There are at most imi -1 1 possibilities to choose a cluster cardinality profile along a dimension i. (This is the number of possibilities to place mi - 1 ones in a sequence of ni + mi - 1 ones and zeros, where ones symbolize a partition of zeros ("balls") into mi bins.) We take the nmi -1 bound for i simplicity. For a fixed cardinality profilen|ci1 |, .., pcimi | | (over a single dimension) there are |ci1 |,..,i|cim | ossii bilities to assign Xi -s to the clusters. This multinomial coefficient can be bounded from above by eni HU (Ci ) (see (Cover & Thomas, 1991, page 284) for an elegant proof ). Putting all the combinatorial calculations together it is possible to define a distribution P (h) over Multi-Classification by Categorical Features via Clustering H that satisfies: P (h) exp [ = = h We pause to stress that unlike in most applications of the PAC-Bayesian bound, in our case the prior P and the posterior Q are defined over slightly different hypothesis spaces. The posterior Q is defined for named clusterings - we explicitly specify for each Xi the "name" of Ci it is mapped to. Whereas the prior P is defined over unnamed partitions - we only check the cardinality profile of Ci , but we cannot recover which Xi -s are mapped to a given Ci . Nevertheless, the "named" distribution Q induces a distribution over the "unnamed" space by summing up over all possible name permutations. This enables us to compute -EQ ln P (h) we need for the bound. We now turn to bound -EQ ln P (h). This is done by showing that Q is concentrated around the hypotheses (hard partitions) h for which the entropies of the partitions are close to the entropies HU (qi ). By the decomposition property we can write: P (h) = P (h|1 )..P (h|d )P (h|y|mi), and similarly for Q. Then ¯ -EQ ln P (h) = - EQ ln P (h|i ) - EQ ln P (h|y|m ), ¯ and similarly for D(Q P ). The last term is easy to compute since P is uniform over HY |m and Q is de¯ fined for a fixed m. Therefore, -EQ ln P (h|y|m ) = ¯ ¯ i ( mi ) ln ny . For the first d terms we need to compute or at least to bound Q(h|i ). 1 i i . (ni HU (Ci ) + mi ln ni ) + ( mi ) ln ny ] (7) h Q(qi )(ni H (qi ) + mi ln ni ) ^ ^ |i H|i |i H|i Q(qi )[ni HU (qi )+mi ln ni +ni (H (qi )-HU (qi ))] ^ ^ h Q(qi )ni (qi ) ^ ^ -2ni (qi )2 ^ ni HU (qi ) + mi ln ni + ni HU (qi ) + mi ln ni + ni HU (qi ) + mi ln ni + h |i H|i ni (qi )e (ln(ni )+1)2 ^ ni e-2ni 2 |i H|i /(ln(ni )+1)2 0 1 = ni HU (Ci ) + mi ln ni + (ln(ni ) + 1)2 . 4 This completes the proof of (4). For (6) what remains is to compute EQ ln Q(h|i ). To do so we bound ln Q(qi ) from above. The bound fol^ lows from the fact that if we draw ni values of Ci according to qi (Ci |Xi ) the probability of the resulting type qi is bounded frox above by e-ni HU (Ci |Xi ) , where ^ m 1 qi (ci |xi ) ln qi (ci |xi ) (see TheHU (Ci |Xi ) = - ni i ,c i orem 12.1.2 in (Cover & Thomas, 1991)). Thus EQ ln Q(h|i ) -ni HU (Ci |Xi ), which together with the identity IU (Xi ; Ci ) = HU (Ci ) - HU (Ci |Xi ) completes the proof of (6). d 5. An Optimal Solution for a Single Feature and Feature Ranking In this section we show that if there is only one parameter X (i.e., d = 1) a globally optimal (from a generalization point of view) classification rule may be efficiently found by examining the "direct" mappings q (Y |X ). In other words, for a single parameter there is no need for intermediate clustering. The obtained result is used in the applications section for feature ranking. It is shown there that the bound follows extremely well the shape of the true error of a classifier based on a single feature and is much more precise than mutual information or normalized correlation indices. To prove the optimality of direct mappings we start with the observation that for any clustering C a classification rule q (Y |X ) defined as c q (y |x) = q (c|x)q (y |c) (10) achieves the same loss as the loss of a hypothesis h based on the clustering C . Therefore, the space of all direct mappings q (Y |X ) incorporates all possible solutions that may be achieved via intermediate clustering. It remains to show that the generalization power of the Recall that h|i is obtained from Q by drawing a cluster Ci for each Xi Xi independently according to | |c 1 the distribution qi (Ci |Xi ). Let qi = { |cii | , .., imi } ^ n ni denote an empirical cluster cardinality profile along a dimension i obtained by such assignment. Then: ^ ^ Eqi H (qi ) = HU (qi ) - Eqi D(qi qi ) HU (qi ), (8) c ^ ^ q (ci ) ln q (ci ) and D(qi qi ) = ^ where H (qi ) = - ^ i c qi (ci ) ^ qi (ci ) ln qi (ci ) . And also: ^ i ^ ^ Pqi {H (qi ) - EH (qi ) } e-2ni 2 /(ln(ni )+1)2 . (9) The latter inequality follows from the fact that the empirical entropy H (qi ) satisfies a bounded differences ^ property with a constant equal to ln(n1 )+1 . See (Panini ski, 2003) for a more detailed proof of (8) and (9). Now, if qi is the cardinality profile of h|i , then Q(h|i ) = ^ ^ ^ ^ Q(qi ) Pqi {qi }. Let (qi ) = max{0, H (qi ) - HU (qi )}. ^ Since H (qi ) - HU (qi ) H (qi ) - EH (qi ) by (8), from ^ ^ ^ 2 ^2 (9) we have: Q(qi ) e-2ni (qi ) /(ln(ni )+1) . Thus: ^ h -EQ ln P (h|i ) = - Q(h|i ) ln P (h|i ) |i H|i Multi-Classification by Categorical Features via Clustering direct mappings is not worse than the generalization power of clustering-based solutions and that the global optimum may be efficiently found. To analyze the generalization power of a direct mapping we define ny clusters cy , one for each label y Y , i.e., Cy = {cy : y Y }. All instances x mapped to a cluster cy obtain the label y . Thus the clustering Cy is identified with the labeling Y , in particular q (Cy |X ) = q (Y |X ), and we can replace HU (Cy ) in (4) with HU (Y ) and IU (X ; Cy ) in (6) with IU (X ; Y ). Moreover, in our construction there are only (ny !) posn sibilities to assign the labels to the clusters and not ny y as in the case of general clustering. In addition, the cardinality of Cy is fixed at ny and does not change from 1 to n, where n is the cardinality of X . This further reduces a ln(n) factor from the bound. Thus, the definition of K in (5) is improved to: n ] (ln n + 1)2 + ny - 1 Ky = ln[ + + ln(ny !) - ln . 4 ny - 1 (11) n ny i (We used the tighter bound +y --1 nstead of nny -1 n 1 on the number of partitions.) And we get: n HU (Y ) + Ky ^ L(Q) L(Q) + b (12) 2N instead of (4) and n IU (X ; Y ) + Ky ^ (Q) + b L(Q) L (13) 2N 1 instead of (6) for Ky = Ky + 2 ln(4N ). For any other clustering C the direct mapping q (Y |X ) defined by (10) satisfies IU (X ; C ) IU (X ; Y ) by the information processing inequality (Cover & Thomas, 1991). Furthermore, since in H every partition cell gets a single label, HU (Y |C ) = 0. Therefore, HU (Y ) HU (C ) because HU (Y ) = HU (Y ) - HU (Y |C ) = IU (C ; Y ) = HU (C ) - HU (C |Y ) HU (C ). Adding the fact that the empirical losses are equal for the clustering-based classification and the associated direct mapping we obtain that both (12) and (13) for the direct mapping are tighter than (4) and (6) for the corresponding clustering solution. We can further optimize (13) by looking for an optimal classification rule q (Y |X ) that minimizes it. The minimum is achieved by iteration of the following self-consistent equations, where p(x, y ) is the empirical ^ joint distribution of X and Y (the derivation is done by taking a derivative of the bound with respect to q (Y |X ) and is omitted due to lack of space): q (y ) - 2 (Py p(x,y )l(y ,y))2N (nIU (X ;Y )+Ky ) ^ , eb q (y |x) = Z (x) (14) y x 1 q (y ) = n q (y |x), and q (y |x), Z (x) = x 1 IU (X ; Y ) = n q (y |x) ln[q (y |x)/q (y )]. Although ,y I U (X ; Y ) is not necessarily convex, in our exp eriments the iterations always converged to a global optimum. It is also possible to optimize a parameter^ ized tradeoff L(Q) + IU (X ; Y ), which is convex since both mutual information IU (X ; Y ) and the empirical ^ loss L(Q) are convex with respect to q (Y |X ). A linear search over then leads to a global optimum of (13). Note that the direct mapping is no longer optimal when there is more than one parameter. For example, for two parameters X1 , X2 , each with a cardinality n, the conditional distribution p(Y |X1 , X2 ) is defined over the product space of size n2 ny . This requires at least an order of n2 ny samples - a number quadratic in n - for the direct inference to be possible. However, from (4) and (6) it follows that with grid clustering for relatively small cluster cardinalities mi it may be possible to achieve reliable estimations when the sample size N is linear in n. This is further discussed in the next section. A related bound for generalization in prediction by a single feature is suggested in (Sabato & ShalevShwartz, 2007). Sabato and Shalev-Shwartz designed an estimator for the loss of a prediction rule based on the empirical frequencies qemp (y |x) = p(y |x They ^ ). l n( N / ) l n(1 / ) prove that their estimate is at most O( ) N far from the generalization error of qemp . Compared to their work, a strong advantage of bounds (12) and (13) is that they hold for any prediction rule q (Y |X ). In particular, they hold for the maximum likelihood prediction qml (x) = arg maxy p(y |x) that performs much ^ better than qemp in practice. 6. A Bound for Estimation of a Joint Probability Distribution in Grid Clustering For a fixed set of mappings {qi (Ci |Xi )} denote by ¯ ¯ p(Y , C ) the joint probability distribution of Y and C , ¯ where C stays for C1 , .., Cd for brevity. Denote by ¯ ¯ p(Y , C ) its empirical counterpart. Clearly, p(Y , C ) is ^ determined by p(Y , X1 , .., Xd ) and the set {qi (Ci |Xi )}. ¯ In this section we bound the deviation between p(Y , C ) and its empirical estimation. Theorem 4. For any probability measure p over instances and an i.i.d. sample S of size N according to p, with a probability of at least 1 - for al l grid clusterings Q = {qi (Ci |Xi )}d=1 the fol lowing holds: i ¯ ¯ D(p(Y , C ) p(Y, C )) ^ i ni HU (Ci ) + K2 N (15) Multi-Classification by Categorical Features via Clustering +i K2 = ny m i i ln ni + l (ln(ni ) + 1)2 4 (16) = y ,c ¯ (p(y , c) - p(y , c)) ln p (y |c) - ^¯ ¯ ¯ mi n(N + 1) - ln . 1 ny + 1 ¯ ¯ p(Y, C ) - p(Y , C ) 1 ln ^ 2 - y ,c ¯ y ¯ ,c ^¯ p(y , c) ln p (y |c) ¯ p(y , c) ln(p(y |c) + ) + ln(ny + 1) ^¯ ^¯ (19) Pro of. The proof is based on the law of large numbers cited below (Cover & Thomas, 1991). Theorem 5 (The Law of Large Numbers). Let Z1 , .., ZN be i.i.d. distributed by p(Z ). Then: P {D(p(Z ) p(Z)) > } e ^ -N +|Z | ln(N +1) ln , (17) where |Z | stays for the cardinality of Z . ¯ i Note thiat the cardinality of the random variable Y, C s ny mi . For the proof of theorem 4 we require that the right hand side of (17) be smaller than P (h) . Application of a union bound and reversion of the ¯ requirement on bounds D(p(Q, C ) p(Y, C )) for the ^Y ¯ ny ( i mi ) ln(N +1)-ln(P (h) ) case of hard partitions by N for all h. Since D(p q) is convex in the pair (p, q ) (Cover & Thomas, 1991, Theorem 2.7.2), for soft ¯ ¯ partitionsQ (p(Y , C ) p(Y, C )) is bounded from above D^ ny ( i mi ) ln(N +1)-ln(P (h) ) . The calculation of by EQ N -EQ ln P (h) done earlier completes the proof. Applying the inequality relatin2 the L1 norm and the g D(P1 P2 ) (see (Cover KL divergence P1 - P2 1 & Thomas, 1991)) we obtain a bound on the variational distance. Corollary 1. Under the conditions of theorem 4: 2 i ( ni HU (Ci ) + K2 ) ¯ ¯ p(Y, C ) - p(Y , C ) 1 ^ N (18) 1 + ( + 1) ln(ny + 1), (20) where inequality (19) is justified by (18) and is de^ ¯ fined as half of its right hand side. H (Y |C ) stays for ¯ the empirical estimation of the entropy of Y given C . Equation (20) is minimized for = ny , when we get: ^ ¯ = H (Y |C ) + ln ny ¯ ^ ¯ + ( + 1) ln( + 1). -E ln p (Y |C ) H (Y |C ) + ln (21) One natural application of the bound (21) to be studied in future work is to the broadly used "bag-ofwords" models, where a decision is made based on multiple observations with the conditional independence assumption on the observations given the label. For example, in the bag of words model for document classification by topic we assume that the words are independent given a topic (the label Y ). There is a single parameter X coming from the space of all words, but the classification is based on multiple observations of this parameter - all words in the document. Since we have a single parameter we can resort to the direct mappings q (Y |X ), as in section 5. Usually, a topic that maximizes the log likelihood of all the words in a document is assigned. After simple algebraic manipulations this can be translated to maximization of a sum of ln[q (y |x)] over the document (up to correct normalization by ln[q (y )]), which is directly related to the expectation bounded in (21). A related work in this context (Shamir, Sabato & Tishby, 2008) uses some different techniques to de^ rive a bound for |H (Y |C ) - H (Y |C )|. We note that H (Y |C ) = -E ln p(Y |C ) is the minimal logarithmic loss that could be achieved if we knew the true joint distribution p(Y , C ). Thus, (Shamir et al., 2008) give a lower bound on the performance of any prediction model based on grid clustering, whereas (21) is an upper bound on the performance of the prediction strat¯ egy p (Y |C ). y ny + 1 - p(y , c) ln p(y |c) + ln(ny + 1) ^¯ ^¯ ,c ¯ 7. Generalization Bound for the Logarithmic Loss in Grid Clustering The goal of this section is to provide a ¯ bouny on the logarithmic loss -E ln p(Y |C ) = d ^ - ¯ ^¯ ,c p(y , c) ln p(y |c). This loss corresp onds to the ¯ prediction (and compression) power of the hypothesis. ^ Since ln is an unbounded function and p(y |c) is not bounded from zero, we define a smoothed distribution: p (y |c) = ¯ p(y |c) + ^¯ , ny + 1 where > 0 is the smoothing parameter. To complete the definition: p (c) = p(c) and p (y , c) = ¯ ^¯ ¯ p (c)p (y |c). Instead of proving the bound for p it ¯ ¯ ^ will be proved for p : y ¯ -E ln p (Y |C ) = - p(y , c) ln p (y |c) ¯ ¯ ,c ¯ 8. Applications In this section we provide a series of applications of the bounds (12) and (13) to prediction by a single feature Multi-Classification by Categorical Features via Clustering 0.68 0.66 0.64 0.5 0.62 L o ss L o ss 2 3 4 5 6 Parameter ID 7 8 9 0.6 0.58 0.56 0.2 0.54 0.52 0.5 1 0.1 0 0 0.4 0.3 0.7 0.6 5 10 15 Parameter ID 20 25 (a) CMC Dataset. 1.02 1 0.98 0.96 0.94 Loss 0.92 0.9 0.88 0.86 0.84 0.82 0 2 4 6 8 10 Parameter ID 12 14 16 0.4 0.35 0.3 0.25 1 2 Loss 0.5 0.45 0.7 0.65 0.6 0.55 (b) Mushrooms Dataset. Bound (12) on L(q ) ml Bound (13) on L(q*) L(q ) ml L(q*) ^ ^ L(q ) ml L(q*) Baseline 3 4 5 Parameter ID 6 7 8 (c) Letters Dataset. (d) Nursery Dataset Figure 1. Application of b ounds (12) and (13). This figure displays an application of bounds (12) and (13) to the four datasets discussed in text. The legend in subfigure (d) corresponds to all the graphs. The graphs contain the ^ training loss L(qml ), the test loss L(qml ) and the value of the bound (12) for the maximum likelihood prediction rule ^ qml (x) = arg maxy p(y |x). A second triplet on the graphs corresponds to L(q ), L(q ), and the value of the bound (13) for ^ the prediction rule q (Y |X ) that minimizes (13). Baseline corresponds to the performance level that can be achieved by predicting the test labels using a marginal distribution of Y on the train set. All the calculations are done per parameter. For better visibility of the points they have been connected with lines, but the lines have no meaning. and feature ranking, as suggested in section 5. We use (12) to bound the generalization error of the maximum likelihood classification rule. For zero-one loss the maximum likelihood rule qml (X ) returns for each value of x the most frequent value of Y that appeared with that x in the sample: qml (x) = arg maxy p(y |x). ^ We also use the iterations (14) to find a classification rule q (Y |X ) that minimizes (13). The experiments were conducted on four datasets obtained at the UCI Machine Learning Repository: Contraceptive Method Choice (CMC), Mushrooms, Letters and Nursery. In all the experiments we use 5 random partitions of the data into 80% train and 20% test subsets. Table 1 provides a short summary of the main parameters of the datasets. See (Asuncion & Newman, 2007) for a full description. Figure 1 shows the training loss and the test loss of the maximum likelihood classification rule qml (Y |X ) for the four datasets considered. We stress that the maximum likelihood rule is calculated per parameter; actually there are d maximum likelihood rules qml (Y |Xi ), one for each parameter i of a given problem. Along with the test loss we draw the value of the bound (12). Note that the bound is quite tight and follows the shape of the test loss remarkably well in all the cases. The gap between the bound and the test loss is less than 0.1. The same figure includes an additional triplet of lines - training loss, test loss, and the bound (13) value corresponding to the q (Y |X ) classification rule that minimizes (13). The performance of q is very close to the performance of qml and the value of (13) is very Multi-Classification by Categorical Features via Clustering Table 1. Description of the datasets: for every set we give the number of features, d, a list of cardinalities of the features, ni , the number of labels, ny , and a train set size, N , which is 80% of a dataset size. Data set CMC Mushrooms d 9 22 ni -s 34, 4, 4, 15, 2, 2, 4, 4, 2 6, 4, 10, 2, 9, 2, 2, 2, 12, 2, 5, 4, 4, 9, 9, 1, 4, 3, 5, 9, 6, 7, 2 16 for all ni -s 3, 5, 4, 4, 3, 2, 3, 3 ny 3 2 N 1,178 6,499 9. Discussion This paper derives generalization bounds for multiclassification based on grid clustering. The bounds enable evaluation of clustering solutions based on generalization properties oi a built-on classifier. We acf knowledge that the ( mi ) ln ny term in the bounds limits their applicability to relatively few dimensional problems. Nevertheless, this domain contains enough challenges such as feature ranking, where our bounds are especially tight, collaborative filtering and many more. An interesting direction for future work would be to extend the applicability of the approach to higher dimensions by utilizing dependencies between the parameters. Acknowledgements: We thank Ohad Shamir for useful discussions. This work was partially supported by Leibnitz Center for Research in Computer Science and the NATO SfP Programme. Letters Nursery 16 8 26 5 16,000 10,368 Top 1 feature subset. 1 Corr(X;Y) 0.5 0 Agreement Level Top 2 feature subset. 1 0.5 0 Top 3 feature subset. 1 0.5 0 CMC Mushrooms Letter Dataset Nursery ^ I (X;Y) Bound (12) References Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html. Blanchard, G., & Fleuret, F. (2007). Occam's hammer. COLT. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons. Langford, J. (2005). Tutorial on practical prediction theory for classification. JMLR, 6. Maurer, A. (2004). A note on the PAC-Bayesian theorem. www.arxiv.org. McAllester (1999). Some PAC-Bayesian theorems. Machine Learning, 37. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15. Sabato, S., & Shalev-Shwartz, S. (2007). Prediction by categorical features: Generalization properties and application to feature ranking. COLT. Seldin, Y., Slonim, N., & Tishby, N. (2007). Information bottleneck for non co-occurrence data. NIPS. Shamir, O., Sabato, S., & Tishby, N. (2008). Learning and generalization with the information bottleneck method. Preprint. Srebro, N. (2004). Learning with matrix factorizations. Doctoral dissertation, MIT. Figure 2. Feature Ranking. Agreement of C orr(X ; Y ), ^ I (X ; Y ), and the bound (12) with the test set on the top-1, top-2, and top-3 feature subsets. close to the value of (12) with a small advantage to qml and (12) on average. We conclude this section by comparing the bound (12) applied to feature ranking with the stan^ dard empirical mutual information I (X ; Y ) = x p(y |x) ^ ^^ ,y p(x)p(y |x) ln p(y ) and the normalized correla^ C ov ( X , Y ) V ar (X )V ar (Y ) tion coefficient C orr(X ; Y ) = indices. We compare agreement between the top-1, top-2, and top-3 parameter subsets suggested by the indices with the corresponding test-based sets - Figure 2. For the top-1 choice (the best single parameter) our bound is clearly superior - it provides a significant level of success in two cases where the other two indices completely fail. For the top-2 choice there is a slight advantage over the mutual information and a clear advantage over the normalized correlation. In top-3 the bound performs similarly to the mutual information and is still superior to the normalized correlation.