Learning Monotonic Transformations for Classification Andrew G. Howard Department of Computer Science Columbia University New York, NY 10027 ahoward@cs.columbia.edu Tony Jebara Department of Computer Science Columbia University New York, NY 10027 jebara@cs.columbia.edu Abstract A discriminative method is proposed for learning monotonic transformations of the training data jointly while estimating a large-margin classifier. Fixed monotonic transformations can be useful as a preprocessing step for many domains such as document classification, image histogram classification and gene microarray experiments. However, most classifiers only explore transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain at hand. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first performs a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. A second algorithm is also derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1 Intro duction Many fields have developed heuristic methods for preprocessing data to improve performance. This often takes the form of applying a monotonic transformation prior to using a classification algorithm. For example, when the bag of words representation is used in document classification, it is common to take the square root of the term frequency [6, 5]. Monotonic transforms are also used when classifying image histograms. In [3], transforms of the form xa where 0 < a 1 are demonstrated to improve performance. When classifying genes from various microarray experiments it is common to take the logarithm of the gene expression ratio [2]. In this paper, we propose to simultaneously learn a hyperplane classifier and monotonic transformations. The solution produced by our algorithm is a piecewise linear transformation and a maximum margin hyperplane classifier similar to a support vector machine (SVM) [4]. By allowing for a richer class of transforms learned at training time (as opposed to a rule of thumb applied during preprocessing), we improve classification accuracy. The transform is specifically tuned to the classification task. The main contributions of this paper are, a novel framework for estimating a monotonic transformation and a hyperplane classifier simultaneously at training time, an efficient method for finding a locally optimal solution for our formulation, and a convex relaxation for the problem. 1 m1+m2+m3+m4+m5 1 0.8 0.6 m1+m2+m3+m4 m1+m2+m3 0.4 m1+m2 0.2 0 z1 z2 m1 z1 z2 z3 z4 z5 a) Ramp function R1 (x) b) T (x) = Figure 1: Building blocks for piecewise linear functions. 5 j =1 mj Rj (x) The paper is organized as follows. In section 2, we present our formulation for learning a piecewise linear monotonic function and a hyperplane. We show how to learn this combined model through an iterative coordinate ascent optimization using interleaved quadratic programming and linear programming solvers to find a local minimum. In section 3, we derive a convex relaxation based on Lasserre's method [8]. In section 4 synthetic experiments as well as document and image classification problems demonstrate the diverse utility of our method. We conclude with a discussion and future work. 2 Learning Monotonic Transformations Given a set of data points X=(x1 , ..., xN ) d and their associated labels Y=(y1 , ..., yn ) {-1, 1}, assume there is an unknown nuisance monotonic transformation T (x) and unknown hyperplane parameterized by w and b such that predicting with y=sign(w T T (x) + b) yields high accuracy. We would like to recover T (x), w, b from X and Y . We propose to learn both a maximum margin hyperplane and the unknown transform T (x) simultaneously. In our formulation, T (x) is a piecewise linear function that we parameterize with a set of M knots Z = (z1 , . . . , zM ) where zj and positive weights (m1 , . . . , mM ) M where mj +. It can be written as T (x) = j =1 mj Rj (x) where Rj (x) are ramp functions acting on vectors and matrices element wise as follows: 0 x zj zj < x < zj +1 zj +1 x Rj (x) = 1 x-zj zj +1 -zj (1) This is one of many ways to parameterize piecewise linear functions. Using ramp functions is preferable for numerous reasons. They can be precomputed and are sparse. Once precomputed, most calculations can be computed via matrix multiplications. The positivity constraints on the weights m will also yield a simpler formulation when we derive a convex relaxation in the next section. Another formulation for monotonicity requires order constraints on the function values. These constraints would be lost in the subsequent relaxation step. The simple positivity constraints remain in effect in the relaxed problem. Figure 1a shows the ramp function associated with knot z1 . Figure 1b shows a conic combination of ramps that build a piecewise linear monotonic function. Any nonnegative combination of ramp functions will yield a monotonically increasing function. Combining this with the support vector machine formulation leads us to the following learning problem: 2 w, ,b,m min w 2 2 +C sub ject to yi w , jM iN i + =1 mj Rj (xi ) j =1 i 0, mj 0, mj 1. b 1 - i where i are the standard SVM slack variables, w and b are the maximum margin solution for the training set that has been transformed via T (x) with learned weights m. The knots Z are chosen before training at the quantiles so that they are evenly spaced in the data. This problem is nonconvex due to the quadratic term wi mj in the classification constraints. Although it is difficult to find a globally optimal solution, the structure of the problem suggests a simple method for finding a locally optimal solution. We can divide the problem into two convex subproblems. This amounts to solving a support vector machine for w and b with a fixed T (x) and alternatively solving for T (x) as a linear program with the SVM solution fixed. This yields an efficient convergent optimization method. However, this method can get stuck in local minima. In practice we initialize it with a linear T (x) and iterate from there. Alternative initializations do not seem to help much in practice. This leads us to look for a method to efficiently find global solutions. 3 Convex Relaxation When faced with a nonconvex quadratic problem, an increasingly popular technique is to relax it into a convex one. Lasserre [8] proposed a sequence of convex relaxations to these types of nonconvex quadratic programs. This method replaces all quadratic terms in the original optimization problem with entries in a matrix. In its simplest form this matrix corresponds to the outer product of the the original variables while imposing a rank 1 constraint. The relaxation is the replacement of the rank 1 constraint on the outer product matrix with a positive semidefinite constraint. Lasserre proposes more elaborate relaxations using higher order moments of the variables. We mainly use the first moment relaxation along with a few of the second order moment constraints that do not require any additional variables beyond the outer product matrix. A convex relaxation could be derived directly from the primal formulation of our problem. This would require that we relax both the w and m variables as they are the nonconvex quadratic terms. This would yield a semidefinite constraint that scales with both the number of knots and the dimensionality of the data. This is troublesome because we wish to work with high dimensional data such as the bag of words representation for text. However, if we first find the dual formulation for w, b, and , we will only have to relax the m variables, making for both a tighter relaxation and a much less computationally intensive problem. Finding the dual leaves us with the following min max saddle point problem that will be subsequently relaxed and transformed into a semidefinite program: T min max m 2T 1 - T Y i ,j 0 i C, y = 0, mj 0, mi mj Ri (X )T Rj (X ) Y j mj 1 where 1 is a vector of ones, Y is a matrix with the labels y on its diagonal with zeros elsewhere. 3 We introduce the relaxation via the substitution M = mmT and constraint M ~~ 0 where m is constructed by concatanating 1 with m. We can then transform the relaxed min max ~ problem into a semidefinite program similar to the multi kernel framework[7] using duality and the Schur complement lemma[1]. M ,t,,, min t T ,j Mi,j Ri (X ) Rj (X )Y (1 + - + y)T sub ject to M Y i 1 + - + y t - 2C T 1 0 0, M 0, M 1 1, M0,0 = 1, 0, 0 Where , , arrise from the dual transformation. This relaxation would be exact if M were rank one. This can be seen as a generalization of the multikernel framework. Instead of learning a kernel from a combination of other kernels, we are learning a combination of inner products of different functions applied to our data. In our case, these are ramp functions. The terms Ri (X )T Rj (X ) are not kernels except when i = j . This more general combination requires the stricter constraints that the mixing weights M form a positive semidefinite matrix. This is a sufficient condition for the resulting kernel to also be positive semidefinite. When using this relaxation, we can recover the monotonic transform by using the first column (row) as mixing weights, m, for the ramp functions. In practice, however, we use the learned kernel in our predictions. 4 4.1 Exp eriments Synthetic Exp eriment In this experiment we will demonstrate our method's ability to recover a monotonic transform from data. We sampled data near a linear decision boundary and generated labels based on this boundary. We then mapped the data through the inverse of a monotonic function. The training set is made up of the transformed points and the original labels. A linear algorithm will have difficulty because the mapped data is not linearly separable. However, if we recover the monotonic function to map to the original space, then a linear decision boundary would perform well. Figure 2a shows the original data and decision boundary. Figure 2b shows the data and hyperplane transformed with a normalized logarithm. Figure 2c shows a quadratic transform. 600 data points were sampled, and then transformed. 200 were used for training, 200 for cross validation and 200 for testing. We compared our locally optimal method (L mono), our convex relaxation (C mono) and a linear SVM (linear). The linear svm struggled on all of the transformed data while the other methods performed well as reported in 3. The learned transforms for L mono are plotted in 2(d-f ). The solid blue line is the mean over 10 experiments, and the dashed blue is the standard deviation. The black line is the true target function. The learned functions for C mono are in 2(g-i). Both algorithms performed superbly well on the task of classification and on recovering nearly the exact monotonic transform. The local method outperformed the relaxation slightly because this was an easy problem with few local minima. 4.2 Do cument Classification In this experiment we used the four universities WebKB dataset. The data is made up of webpages from four universities plus a larger set from miscellaneous universities. These websites are then categorized. We will be working with the largest four categories, student, faculty, course, and pro ject. The task will be to solve all six pairwise classification problems. 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 a) 1 b) 1 c) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 e) 1 0.9 f) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 g) h) i) Figure 2: a) Original data b)transformed by logarithm c) transformed by quadratic d-f ) learned function by nonconvex algorithm g-i) learned function by convex algorithm. Linear L Mono C Mono linear 0.0005 0.0020 0.0025 exponential 0.0375 0.0005 0.0075 square root 0.0685 0.0020 0.0025 total 0.0355 0.0015 0.0042 Figure 3: Testing error rates for the synthetic experiments. 5 Linear TFIDF Sqrt Poly RBF L Mono C Mono 1 vs 2 0.0509 0.0428 0.0363 0.0499 0.0514 0.0338 0.0322 1 vs 3 0.0879 0.0891 0.0667 0.0861 0.0836 0.0739 0.0776 1 vs 4 0.1381 0.1623 0.0996 0.1389 0.1356 0.0854 0.0812 2 vs 3 0.0653 0.0486 0.0456 0.0599 0.0641 0.0511 0.0501 2 vs 4 0.1755 0.1910 0.1153 0.1750 0.1755 0.1060 0.0973 3 vs 4 0.0941 0.1096 0.0674 0.0950 0.0981 0.0602 0.0584 total 0.1025 0.1059 0.0711 0.1009 0.1024 0.0683 0.0657 Figure 4: Testing error rates for WebKB. [6, 5] showed that preprocessing the data with a square root yields good results. We will compare our nonconvex method (L mono), and our convex relaxation (C mono) to linear SVM with and without the square root, with tfidf features and also a kernelized SVM with both the polynomial kernel and the rbf kernel. We will follow the set up of [6] by training on three universities and the miscellaneous category and testing on the fourth university. We repeated this four fold experiment five times. For each fold we use a subset of 200 points of the training set to train on, 200 to cross validate the parameter settings, and all of the remainder university's points for testing. Our two methods outperform the competition as reported in Figure 4.2. The monotonic function that the convex relaxation nearly always learns is a step function that gives a 1 if a word is in the training data and 0 if it is absent. The nonconvex greedy algorithm does not end up recovering the solution as reliably and seems to get stuck in local minima as often. This leadsto worse performance than the convex version. 4.3 Histogram classification In this experiment we used the Corel image dataset. In [3] it was shown that monotonic transforms of the form xa for 0 a 1 worked well. The Corel image dataset is made up of sets of 100 images in a category. We chose four categories of animals: 1) eagles, 2) elephants, 3) horses, and 4) tigers. Images were transformed into RGB histograms following the binning strategy of [3, 5]. We ran a series of six pairwise experiments where the data was randomly split into 80 percent training, 10 percent cross validation, and 10 percent testing. These six experiments were repeated 10 times. We compared our two methods to linear support vector machine, rbf and polynomial kernel. We also compared to the set of transforms x a for 0 a 1 where we cross validated over a = {0, .125, .25, .5, .625, .75, .875, 1}. This set includes linear a = 1 at one end, a binary threshold a = 0 at the other and the square root transform in the middle. The convex relaxation performed best or tied for best on 4 out 6 of the experiments and was the best overall as reported in Figure 5. The nonconvex version also performed well but ended up with a lower accuracy than the cross validated family of xa transforms. The key to this dataset is that most of the data is very close to zero due to few pixels being in a given bin. Cross validation over xa most often chose low a values. Our method had many knots in these extremely low values because that was where the data support was. Plots of our learned functions on these small values can be found in Figure 6(a-f ). Solid blue is the mean for the nonconvex algorithm and dashed blue is the standard deviation. Similarly the convex relaxation is in red. 4.4 Gender classification In this experiment we try to differentiate between images of males and females. We have 1755 labelled images from the FERET dataset processed by [9]. Each processed image is a 21 by 12 pixel 256 color gray scale image that is rastorized to form training vectors. There are 1044 male images and 711 female images. We randomly split the data into training, cross validation, and test sets using a 80%/10%/10% split. We then compare linear SVM to our two methods on 5 random splits of the data. This results in 9.09% error for the SVM, 8.18% error for L Mono and 6.48% for C Mono. The learned monotonic function for 6 Linear Sqrt Poly RBF xa L Mono C Mono 1 vs 2 0.08 0.03 0.07 0.06 0.08 0.05 0.04 1 vs 3 0.10 0.05 0.10 0.08 0.04 0.06 0.03 1 vs 4 0.28 0.09 0.28 0.22 0.03 0.04 0.03 2 vs 3 0.11 0.12 0.11 0.10 0.03 0.05 0.04 2 vs 4 0.14 0.08 0.15 0.13 0.09 0.13 0.06 3 vs 4 0.26 0.20 0.23 0.23 0.06 0.05 0.05 total 0.1617 0.0950 0.1567 0.1367 0.0550 0.0633 0.0417 Figure 5: Testing error rates on Corel dataset. 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 1 0.8 0.6 0.4 0.2 0 -0.2 0 0.5 1 1.5 x 10 2 -3 Figure 6: Learned functions for 6 Corel problems. L Mono and C Mono are similar to a logistic function. Figure 7 shows examples of training images before and after they have been transformed by our learned function. 5 Discussion A data driven framework was presented for jointly learning monotonic transformations of input data and a discriminative linear classifier. The joint optimization improves classification accuracy and produces interesting transformations that otherwise would require a priori manual domain knowledge. Two implementations were discussed. The first is a fast greedy algorithm for finding a locally optimal solution. Subsequently, a semidefinite relaxation to the original problem was presented which does not suffer from local minima. The greedy Figure 7: Original and transformed gender images. 7 algorithm has similar scaling properties as a support vector machine yet has local minima to contend with. The semidefinite relaxation scales quite poorly yet ensures a reliable global solution. Nevertheless, both implementations were helpful in synthetic and real experiments including text and image classification and improved over standard support vector machine tools. A natural next step is to explore faster (convex) algorithms that take advantage of the specific structure of the problem. We also hope to explore applications to other domains such as gene expression data to refine the current logarithmic transforms necessary to compensate for well-known saturation effects in expression level measurements. Other possible datasets of interest include resource allocation and operations research which also encounter saturation phenomena. 6 Acknowledgements This research is funded in part by grant IIS-0347499 from the NSF. References [1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [2] M. Brown, W. Grundy, D. Lin, N. Christianini, C. Sugnet, M. Jr, and D. Haussler. Support vector machine classification of microarray gene expression data, 1999. [3] O. Chapelle, P. Hafner, and V.N. Vapnik. Support vector machines for histogram-based classification. Neural Networks, IEEE Transactions on, 10:1055­1064, 1999. [4] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273­297, 1995. [5] M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings of Artificial Intel ligence and Statistics, 2005. [6] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819­844, 2004. [7] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27­72, 2004. [8] J.B. Lasserre. Convergent LMI relaxations for nonconvex quadratic programs. In Proceedings of 39th IEEE Conference on Decision and Control, 2000. [9] B. Moghaddam and M.H. Yang. Sex with support vector machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing 13, pages 960­966. MIT Press, 2000. 8