Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets

Mingkui Tan  Li Wang  Ivor W. Tsang   School of Computer Engineering, Nanyang Technological University, Singapore  Department of Mathematics, University of California, San Diego, USA

TANMINGKUI @ GMAIL . COM LIW 022@ UCSD . EDU

I VORT SANG @ NTU . EDU . SG

Abstract
A sparse representation of Support Vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 0-1 control variable to each input feature, l0 -norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than directly solving this MIP, we propose an efficient cutting plane algorithm combining with multiple kernel learning to solve its convex relaxation. A global convergence proof for our method is also presented. Comprehensive experimental results on one synthetic and 10 real world datasets show that our proposed method can obtain better or competitive performance compared with existing SVM-based feature selection methods in term of sparsity and generalization performance. Moreover, our proposed method can effectively handle large-scale and extremely high dimensional problems.

ing Support Vector Machine (SVM) have been proposed (Blum & Langley, 1997; Guyon & Elisseeff, 2003). To obtain a sparse decision rule for SVM, many researchers (Bradley & Mangasarian, 1998; Zhu et al., 2003; Fung & Mangasarian, 2004) proposed l1 -norm Sparse SVM (SSVM), which uses w1 as the regularizer. The resultant problem is convex, and can be solved optimally by Linear Programming (LP) solvers or Newton method (Fung & Mangasarian, 2004). Apart from l1 -norm SSVM, (Weston et al., 2003) proposed an Approximation of the zero norm Minimization (AROM) to solve SSVM with w0 as the regularizer, namely l0 -norm SSVM. However, the resultant optimization is non-convex and may suffer from local minima. Recently, (Chan et al., 2007) proposed another two direct convex relaxations to l0 -norm SSVM, namely QCQP-SSVM and SDP-SSVM, which can be solved by Quadratically Constrained Quadratic Programming (QCQP) and Semi-Definite Programming (SDP), respectively. Though both the relaxed optimization problems are convex, they are computationally expensive, especially for high dimensional problems. Besides sparse regularization, (Guyon et al., 2002) proposed an effective Recursive Feature Elimination (RFE) scheme for feature selection. SVM-RFE can obtain nested subsets of input features and has shown state-of-the-art performance on gene selection in Microarray data analysis (Guyon et al., 2002). However, as described by (Xu et al., 2009), the nested "monotonic" feature selection scheme may be suboptimal in identifying the most informative subset of input features. Here, the "monotonic" property refers to the problem that, if an informative feature is wrongly eliminated from a subset S, it will not be in its nested subsets (Xu et al., 2009). This issue becomes extremely critical when dealing with problems with large number of noise features and therefore an accurate SVM model is hard to be obtained to rightly measure the importance of features. To overcome this problem, (Xu et al., 2009) proposed a non-monotonic feature selection method, namely NMMKL. However, their method requires to solve a QCQP problem with |S| quadratic constraints, where |S| denotes

1. Introduction
In many machine learning applications, there is a great desire of sparsity with respect to input features. Several facts account for this. Firstly, many real datasets such as texts and Microarray data are represented as very high dimensional vectors, resulting in great challenges for further processing. Secondly, most features in high dimensional vectors are usually non-informative or noisy and may seriously affect the generalization performance. Thirdly, a sparse classifier can lead to a simplified decision rule for faster prediction in large-scale problems. Finally, in some applications like Microarray data analysis, a small set of features is desirable to interpret the results. Recently, numerous feature selection methods regardAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets

the number of input features. Hence, NMMKL is computationally infeasible for high dimensional problems. In this paper, we propose to learn a sparse solution with respect to input features to SVM, namely Feature Generating Machine (FGM). It iteratively generates a pool of violated sparse feature subsets and then combines them via efficient Multiple Kernel Learning (MKL) algorithm. FGM shows great scalability to non-monotonic feature selection on large-scale and very high dimensional datasets. We also provide a proof of global convergence for FGM. The rest of this paper is organized as follows. Section 2 describes the sparse SVM problem and our proposed cutting plane algorithm. Experimental results are presented in Section 3. The last section gives the conclusive remarks.

where  = [1 , . . . , n ] is a vector of dual variables for the inequality constraints in the inner minimization prob{ n } lem (1), and A =  i=1 i = 1, i  0, i = 1, · · · ,n is the domain of . 2.1. Convex Relaxation Observe that (2) is still a mixed integer programming (MIP) problem, which is computationally expensive in general. Following (Li et al., 2009b;a), we introduce a mild convex relaxation for our SSVM formulation in (2). According to the minimax inequality (Kim & Boyd, 2008), when we interchange the order of mindD and maxA in (2), then the saddle-point problem (2) can be lower-bounded by
max min -
A dD n 1  i yi (xi  d) 2 i=1 2

-

1   . 2C

(3)

2. Learning Sparse SVM
In the sequel, we denote the transpose of vector/matrix by the superscript  and a vector with all entries equal to one as 1  Rn . We also denote vp as the lp -norm of a vector v. Moreover, A  B represents the elementwise product between two matrices A and B. Given a set of labeled patterns {xi , yi }n , where xi  Rm i=1 is the input and yi  {±1} is the output label, we learn a linear decision hyperplane f (x) = w x that minimizes the following structural risk functional: minw (wp ) + n C i=1 (-yi w xi ), where w = [w1 , . . . , wm ]  Rm is the weight vector of the decision hyperplane, (wp ) is the regularizer that defines the characteristic (e.g. sparsity) of the decision hyperplane, (·) is a convex loss function, and C > 0 is a regularization parameter that trades off the model complexity and the fitness of the decision hyperplane. For standard SVMs, (wp ) is set to 1 w2 , 2 2 which is a non-sparse regularizer. Hence, the learned decision hyperplane is usually non-sparse. In order to obtain a sparse solution of SVM, we firstly introduce a feature selection vector d = [d1 , . . . , dm ]  D which controls the sparsity of the SVM decision hyperplane: f (x) = w x = (w  d) x = w (d  x), } ~ ~ where { m D= d dj  B, dj  {0, 1}, j = 1, · · · , m is j=1 the domain of d, and B controls the sparsity of d. For simplicity, we here focus on square hinge loss1 , and the positive constraint i  0 satisfies automatically and can be omitted. Then, the objective of Sparse SVM (SSVM) can be simplified as: n
min min
dD w,, ~

n 2 1 Define S(, d) = - 1  i=1 i yi (xi  d) - 2C   2 and bring in an additional variable   R, (3) becomes:
A,

max -

:

  -S(, dt ),  dt  D,

(4)

which is a convex QCQP problem. Let µt  0 be the dual variable for each constraint. Its Lagrangian is written  as: L(, µ) = - + t,dt D µt ( + S(, dt )). Setting its  derivative w.r.t.  to zero, we have µt = 1. Let µ be the  vector of µt 's, and M = {µ| µt = 1, µt  0} be the domain of µ. The Lagrangian L(, µ) can be rewritten as:
max min = 
dt D A µM

µt S(, dt )

(5)

(   1 1 ) min max - (  y) µt Xt Xt + I (  y), µM A 2 C t
d D

where Xt = [x1  dt , · · · , xn  dt ] , and the last equality holds due to the fact that the objective function is concave in  and convex in µ. Moreover, (5) can be regarded as a MKL problem (Rakotomamonjy et al., 2008), where   the kernel matrix t,dt D µt Xt Xt to be learned is a con vex combination of |D| base kernel matrices Xt Xt , each of which is constructed from a feasible feature selection vector dt  D. 2.2. Cutting Plane Algorithm Although (4) is convex, a huge number of base kernels make it impractical to be solved by existing MKL techniques. Fortunately, not all constraints in (4) are active at optimality. Alternatively, we can efficiently solve this problem by cutting plane algorithm (Kelley, 1960), which iteratively generates a pool of sparse feature subsets to construct the quadratic inequality constraints in (4). The overall algorithm of FGM is described in Algorithm 1. Denote the subset of constraints by C  D. First, we ini1 tialize the vector of dual variables  to n 1 and find the most ^ ^ violated d  D, initialize the working set C = {d}. Since

1 C 2 w2 + ~ 2 i -  2 2 i=1

(1)

s.t.

yi w (xi  d)   - i , i = 1, · · · , n. ~

The inner minimization problem can be solved by its dual, then (1) can be rewritten as follows:
min max -
dD A 1 n 1  i yi (xi  d) 2 i=1 2

-

1   , 2C

(2)

This loss function can be solved efficiently by LIBLinear.

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets

the number of d in C (i.e. the number of base kernel matrices) is no longer large, one can perform MKL with a subset of kernel matrices in C and obtain new  from (5). Then the most violated d is obtained and is added to C, which is known as the "feature generation". The whole process is repeated until the termination criterion is met. Algorithm 1 The cutting plane algorithm for FGM.
1 ^ ^ 1: Initialize  = n 1. Find the most violated d, and set C = {d}. 2: Run MKL for the subset of kernel matrices selected in C and obtain  and µ from (5). ^ ^ 3: Find the most violated d and set C = C d. 4: Repeat step 2-3 until convergence.

the A

constraint domain for problem (4), }where { n = {  and i=1 i = 1, i  0, i = 1, · · · ,n } m D = d j=1 dj  B, dj  {0, 1}, j = 1, · · · , m . In FGM, we iteratively find and add the most violated constraint to the set C, which is a subset of D, i.e, C  D. Further denote by Ck be the constraint set in kth iteration, then we have Ck  Ck+1 . In kth iteration, we find a new constraint dk+1 based on k , i.e., -S(k , dk+1 ) = maxdD -S(k , d). Define
k = max -S(k , di ) = min max -S(, di )
1ik A 1ik

(7)

and
k = min -S(j , dj+1 ),
1jk

2.3. MKL with a Subset of Kernel Matrices Several efficient MKL approaches have been proposed in recent years. For simplicity, in this paper we apply SimpleMKL (Rakotomamonjy et al., 2008) to solve the MKL problem defined on the subset of kernel matrices selected in C. More specifically, suppose that the current working set is C = {d1 , · · · , dT }, the MKL problem in (5) thus corresponding to the following primal optimization problem:
T n  1  2 1 w t 2 + C ~ min i -  2 µM,w,, 2 t=1 µt ~ i=1 T   s.t. wt (yi xi  dt )   - i , i = 1, · · · ~ t=1

(8)

where -S(j , dj+1 ) = maxdD -S(j , d). Similar to (Chen & Ye, 2008), we have the following theorem that indicates FGM gradually approaches to the optimal solution. Theorem 1. Let ( ,  ) be the globally optimal solution pair of (4), {k } and {k } as defined above, then: k    k . (9)

(6) , n.

Following SimpleMKL, we solve (5) (or, equivalently, (6)) iteratively. First, we fix the coefficients µ of the base kernel ( matrices and solve the ) of SVM: maxA - 1 (  dual 2 T 1   µt Xt Xt + C I (  y). Then, we fix  and y) t=1 use the reduced gradient method for updating µ. These two steps are iterated until convergence. ^ 2.4. Finding the Most Violated d ^ To find the most violated d in (4), we have to solve the following equivalent optimization problem: n m 2 maxdD 1  i=1 i yi (xi  d) = 1 j=1 c2 dj with j 2 2 n cj = i=1 i yi xij . This problem is a linear integer prom gramming subject to one linear constraint j=1 di  B. The globally optimal solution of this problem can be obtained without any numeric optimization solver. That is, it can be solved by first sorting c2 's and then setting the first j B numbers corresponding to dj to 1 and the rests to 0. 2.5. Prediction When the algorithm converges, we get  and µ , the decision function can be obtained by f (x) = T n n t  ~ t=1 µt i=1  y (x  d ) x = i=1 i yi (xi  d) x, T i i t i ~ where d = t=1 µt d . 2.6. Global Convergence In this subsection, we consider the convergence properties of FGM. Let A × D be

With the number of iteration k increasing, {k } is monotonically increasing and the sequence {k } is monotonically decreasing. Proof.  = min max -S(, d). , we have max -S(, d)
dCk A dCk A dD A dD

For a fixed feasible max -S(, d), then
dD


min max -S(, d)  min max -S(, d), i.e. k   .

On the other hand, for j = 1, · · · , k, -S(j , dj+1 ) = max -S(j , d), thus (j , -S(j , dj+1 )) is a feasible solution pairs of (4). Then   -S(j , dj+1 ) for j = 1, · · · , k, and hence we have   k = min -S(j , dj+1 ). With the number of iteration k increasing, the subset Ck is monotonically increasing, so k is monotonic increasing while {k } is monotonically decreasing. The proof is completed. The following theorem further indicates that FGM can obtain a global solution to (4) within a finite number of steps. Theorem 2. Assume that in each exchanged iteration, subproblems in Section 2.3 and 2.4 can be solved, FGM stops after a finite number of steps with a global solution of (4). Proof. We can measure the convergence of FGM by the gap difference of series {k } and {k }. After a finite iterations, the objective value will no longer improve. Assume in kth iteration, there is no update of Ck , i.e. dk+1 = arg maxdD -S(k , d)  Ck , then Ck = Ck+1 . So, we
1jk dD

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
5 10

Table 1. Datasets used in the experiments
f1
0

5

f3
-2 0 2 4 6

DATASET
WDBC USPS B REAST CANCER L EUKEMIA REAL - SIM RCV 1. BINARY A RXIV ASTRO - PH NEWS 20. BINARY

# F EATURES 30 241 7,129 7,129 20,958 47,236 99,757 1,355,191 3,231,961 3,231,961

# T RAINING PTS . 569 1,500 38 72 32,309 20,242 62,369 9,996 16,000 20,000

# T EST PTS . ­ ­ ­ ­ 40,000 677,399 32,487 10,000 20,000 20,000

0

-5

-5 -4

-10 -4

-2

0

2

4

6

f

2

f

4

(a) f1 &f2

(b) f3 &f4

URL0 URL1

Figure 1. Detailed information of the synthetic dataset

can prove that, in this case, (k , k ) is the globally optimal solution pair of (4). First, since Ck = Ck+1 , in Algorithm 1, there will be no update of , i.e. k+1 = k . Then we have -S(k , dk+1 ) = maxdD -S(k , d) = maxdCk -S(k , d) = max1ik -S(k , di ) = k , and k = min -S(j , dj+1 )  k . From Theorem 1, we know k    k , then we obtain k =  = k , and (k , k ) is the globally optimal solution pair of (4). The proof is completed. 2.7. Computational Complexity QCQP-SSVM, SDP-SSVM and LPSVM are convex optimization problems, but they are very expensive even on medium-sized datasets. For NMMKL, it has to solve a QCQP problem with m quadratic constraints. Obviously, if m is too large, NMMKL is very computationally expensive. For SVM-RFE, its computational complexity largely depends on the number of features eliminated in each step. Assume that the training of linear SVM takes O(nm) time, if only one feature is removed from the feature list in each elimination step, SVM-RFE will take O(nm2 ) time. SVMRFE method can be speeded up by removing chunks of features in each step, which, however, may lead to a significant decline in the classification performance. In contrast, FGM only needs to solve a series of MKL problems and find the most violated d. Empirically, a maximum of 10 iterations is enough for FGM to converge. Moreover, all base kernels are linear, so the subproblem of finding  of linear SVM in Section 2.3 can be solved by LIBLinear software, which scales linearly in n and m (Hsieh et al., 2008). And the time complexity of MKL is proportional to the complexity of linear SVM. Finding the most violated d can be obtained exactly by finding the B largest ones from m coefficients c2 's, which takes only O(m log B) j time. The time complexity of each iteration of FGM is O(nm+m log B). Thus, FGM is computationally efficient even for large-scale and very high dimensional problems.
1jk

world datasets. The synthetic dataset is a binary classification problem with three informative features f1 , f2 and f3 , and the generation of data follows the description in (Xu et al., 2009). The plot of the informative features are shown in Figure 1, where f4 is a noise feature. Among the three features, f1 and f2 are composite features and generated by two different multi-variate Gaussian distributions. As a group, they are the most informative features to the classification while f3 is the most informative feature as a single. Ideally, a good SSVM or a "non-monotonic" feature selection method should successfully identify the features f1 and f2 as a group which is referred to as ideal features. The real world datasets consist of two categories. The first category includes 2 small datasets and 2 Microarray datasets2 . For these small datasets, we use the cross validation scheme to validate the performance due to the insufficient samples. For the second category, it contains 6 large-scale and very high dimensional datasets. Among them, real-sim, rcv1.binary and news20.binary are from LIBSVM website3 while Arxiv astro-ph can be referred to (Joachims, 2006). For Arxiv astro-ph and rcv1.binary, they have already been split into training set and testing set. For real-sim and news20.binary, we manually split them into training set and testing set. The last two are 2 URL datasets from an anonymized 120-day subset of the ICML-09 URL data (Ma et al., 2009). The original URL dataset contains 120 independent subsets collected from 120 days4 . Because of space limitation, we only use the data from the first three days. In our experiments, we train on data collected from the previous day and predict on data collected from the next day, resulting in two new datasets denoted by URL0 and URL1. Detailed information of these datasets and the splitting information are summarized in Table 1. For all the datasets, each dimension of the data is normalized to zero mean and unit variance.
2 wdbc is from the UCI machine learning repository, and the binary usps dataset is from http://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html; ii) Microarray datasets Breast cancer and Leukemia are from http://www.kyb.tuebingen.mpg.de/bs/people/weston/l0 3 http://www.csie.ntu.edu.tw/cjlin /libsvmtools/datasets/ 4 http://archive.ics.uci.edu/ml/datasets/url+reputation

3. Experiments
3.1. Datasets In this Section, we evaluate the performance of various methods on a synthetic dataset and a collection of real

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
100 100 80 60 40 20 0 2 10 FGM FGM-B QCQP-SSVM NMMKL LPSVM SVM-RFE SVM(ALL) 1200 FGM QCQP-SSVM NMMKL LPSVM SVM-RFE SVM(ALL)

Testing accuracy (in %)

Sparsity ratio (in %)

90 80 70 60 50 2 10 FGM FGM-B QCQP-SSVM NMMKL LPSVM SVM-RFE SVM(ALL) SVM(IDEAL) 10 10 # noisy features (logarithmic)
3 4

CPU time (in seconds)

1000 800 600 400 200 0 2 10

10 10 # noisy features (logarithmic)

3

4

10 10 # noisy features (logarithmic)

3

4

(a) Testing accuracy

(b) Sparsity (c) Training time Figure 2. Results on synthetic dataset with varying noise features
100 80 60 FGM QCQP-SSVM LPSVM SVM-RFE SVM(ALL)

100

Testing accuracy (in %)

Sparsity ratio (in %)

90

CPU time (in seconds)
16

80 FGM FGM-B QCQP-SSVM LPSVM SVM-RFE SVM(ALL) SVM(IDEAL) 2 4 6 8 10 12
3

FGM FGM-B QCQP-SSVM LPSVM SVM-RFE SVM(ALL)

2500 2000 1500 1000 500 0

40 20

70

60

14

16

0

2

4

6

8

10

12
3

14

2

4

6

8

10

12
3

14

16

# training samples ( × 10 )

# training samples ( × 10 )

# training samples( × 10 )

(a) Testing accuracy (b) Sparsity (c) Training time Figure 3. Results on synthetic dataset with varying training samples

Table 2. Sparsity ratio of various methods on small datasets
DATA SETS LPSVM ( IN %) 17.56±16.00 B 0.1m 0.2m 0.3m 0.4m 0.1m 0.2m 0.3m 0.4m 2 5 10 20 2 5 10 20 FGM ( IN %) 84.00±2.38 77.78±2.20 69.11±1.94 59.22±2.09 87.74± 3.34 80.98±2.59 73.44±5.26 73.99±1.71 99.92±0.02 99.86±0.03 99.81±0.03 99.71±0.03 99.90±0.02 99.85±0.03 99.71±0.04 99.62±0.03 QCQP-SSVM ( IN %) 86.67±0.00 65.56±15.50 0.33±1.02 0.00±0.00 6.03± 4.29 7.98±5.82 7.95±5.78 8.49±6.46 12.59±29.99 79.90±40.57 99.72±0.05 99.44±0.05 1.26± 1.19 49.46±50.35 99.53±0.07 99.25±0.08

WDBC

USPS

48.23±14.91

B REASTCANCER

89.46±29.94

cording to the score cj , and form the final feature subset. We name this strategy as FGM-B in this paper. For the better illustration of the sparsity property, we define a sparsity ratio as: (w) = 1 - Card(w) , which means the ram tio of zeros in w. Card(w) here denotes the number of nonzeros in w. However, for LPSVM and QCQP-SSVM, it is hard for them to achieve completely sparse solutions for some problems. Alternatively, we define Card(w) for LPSVM and QCQP-SSVM as the number of weights wj with large relative magnitude, i.e the number of elements with |wj |/ max(|wi |)  10-4 (Chan et al., 2007). Experi

L EUKEMIA

99.79±0.03

3.2. Experimental Setup In our experiments, comparisons are conducted among FGM, QCQP-SSVM, NMMKL, LPSVM and SVM-RFE. And we do not include the results of SDP-SSVM for its high computational cost (Chan et al., 2007), because SimpleMKL may select multiple kernels (i.e. multiple d's may be selected), the number of features selected by FGM is self-determined and may be larger than B. This property, although may lead to a decline in sparsity, is very important to the "non-monotonic" feature selection because usually we have no idea of how many features should be selected in advance. Alternatively, we can select B features by ranking the features as in SVM-RFE and NMMKL ac-

iments are performed with a 2.27GHZ Interl(R)Core(TM) 4 DUO CPU running Windows Server 2003 with 24.0GB main memory. We use MOSEK (version 5.0.0.127) for solving QCQP and LPSVM, and use LIBlinear5 to solve FGM, NMMKL and SVM-RFE. The dual coordinate descent for L2-SVM (DCDL2) algorithm is adopted as the baseline classification method. 3.3. Experiments on Synthetic Data To thoroughly study the performance of different methods, two synthetic experiments are performed. At first, we generate 1000 instances (500 for each class) and then randomly choose 500 for training and the rest for testing. Then, we gradually increase the noise features to the dataset and test whether the considered methods can successfully identify the former 2 informative features. The initial number of noise features are set to 100. For the noise features, half of
5

http://www.csie.ntu.edu.tw/ cjlin/liblinear/

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets

Table 3. Testing accuracies (in %) of various methods on small datasets
DATA SETS LPSVM ( IN %) 88.71±1.57 B 0.1m 0.2m 0.3m 0.4m 0.1m 0.2m 0.3m 0.4m 2 5 10 20 2 5 10 20 FGM ( IN %) 91.11±2.52 91.81±2.37 92.46±1.13 92.47±1.13 90.78±1.36 91.47±1.27 91.23±1.88 91.99±1.43 86.88±8.10 88.75±8.90 87.92±11.60 87.71±10.57 91.61±5.85 93.91±5.56 94.37±4.39 95.52±4.26 FGM-B ( IN %) 90.96±1.50 92.44±1.10 92.43±0.99 92.37±1.08 79.69±1.89 81.97±2.27 87.49±1.05 89.60±1.07 67.92±17.50 87.50±11.72 89.58±11.76 88.13±11.17 85.75±5.71 95.06± 4.31 94.48±5.17 95.75±3.91 QCQP-SSVM ( IN %) 86.77±2.40 90.85±2.44 90.70±1.42 90.57±1.46 84.32±9.06 84.76±8.91 83.08±8.14 83.08±8.30 48.96±14.13 72.29±18.62 79.79±15.46 81.04±16.29 62.87±22.10 80.80±15.09 93.91±6.38 91.26±7.56 NMMKL ( IN %) 92.15±1.17 92.37±1.14 92.30±1.45 92.18±1.24 80.38±1.20 83.47±2.02 87.90±1.26 89.47±1.06 SVM-RFE ( IN %) 92.15±1.29 92.25±1.20 92.51±1.10 92.51±1.22 80.62±1.33 85.78±1.24 88.82±1.30 90.08±1.40 76.04±17.61 85.00±10.46 88.54±10.52 89.17±10.63 88.39±4.75 93.91±4.85 95.63±4.87 95.75±3.81

WDBC

USPS

89.05±1.79

B REAST C ANCER

80.63±15.60

L EUKEMIA

87.01 ± 8.53

them are uniformly distributed and another half are generated by Gaussian distribution. In this experiment, the parameter C for all algorithms is set to 0.1 and B to 2. 1 The parameter  in NMMKL is set to C . In SVM-RFE, as suggested in (Guyon et al., 2002), when the total features are greater than 200, we remove 10 features at each time; otherwise, we eliminate one feature at one time. To show whether the mentioned methods can identify the composite features f1 and f2 , in Figure 2(a), we denote the testing accuracy obtained by using ideal features as SVM(IDEAL). Meanwhile, we also use SVM with all features as the baseline method which is denoted by SVM(ALL). All experiments are repeated 5 times. In Figure 2(a), with the number of noise features increasing, only FGM-B can obtain the same accuracy as SVM(IDEAL), i.e. FGM-B successfully identifies the composite features (or ideal features). As for QCQPSSVM, its sparsity increases along with the increasing number of features. However, the testing accuracy decreases when the number of noise feature increases. Another problem of QCQP-SSVM has a rapid increase in the memory usage with the increasing features. In our experiment, the MOSEK solver for QCQP encounters outof-memory problem when the number of features exceeds 8000. NMMKL can identify the composite feature when the noise is relatively small. When the noise increases, the performance of NMMKL will decline. In addition, from Figure 2(c), NMMKL cannot deal with very high dimensional problems. LPSVM shows competitive performance compared with FGM. SVM-RFE performs well when the number of noise features is not very large (less than 2500). For SVM(ALL), from Figure 2(a), when the number of noises exceeds 2000, SVM cannot learn an accurate classifier. Figure 2(c) shows the training time of various methods. Obviously, compared with SVM-RFE, FGM is more computationally efficient when dealing with higher dimensional problems.

In the second experiment, we fix the number of features to 2000 and then gradually increase the number of instances. Half of the whole instances are used as training samples and the other half are for testing. The initial instances are set to 1000. Because NMMKL is incapable of dealing with large dimensions, we did not include its results in this experiment. Figure 3 shows the testing accuracy, sparsity ratio and training time of various methods. Although the testing accuracy of FGM is lower than SVM(IDEAL), the testing accuracy of FGM-B is the same as SVM(IDEAL), which indicates that FGM can also identify the composite features. As to SVM-RFE, it fails to identify the first two composite features when the number of training samples is relatively small. However, as the training examples increasing, the performance becomes better. Finally, from Figure 3(b), QCQP-SSVM and LPSVM cannot provide good sparsity for classifiers. Both of them also cannot handle large sample problems. 3.4. Experiments on Small Datasets For the experiments on the small datasets, we randomly split them into 60% for training and the rest 40% for testing. For wdbc and usps, the parameter C is selected using 5-fold cross validation over the range {0.01, 0.05, 0.1, 0.5, 1, 5}. For the Microarray datasets, due to the small sample size, we simply set C = 0.01. As to the second parameter B, it is chosen in a range of {0.1m, 0.2m, 0.3m, 0.4m} for the first two datasets and {2, 5, 10, 20} for the Microarray datasets. As to SVMRFE, we follow the experimental settings of the synthetic experiment. All the methods are repeated 30 times and averaged performances are reported. Table 2 summarizes the performances of sparsity achieved by various methods. For FGM-B, SVM-RFE and NMMKL, as their sparsity can be directly computed by using the parameter B, we do not include their sparsity results in Table 2. Here, we did not obtain the results of

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets

Table 4. Training time (in s) of various methods on small datasets
DATA SETS LPSVM ( IN S ) 0.25±0.03 B 0.1m 0.2m 0.3m 0.4m 0.1m 0.2m 0.3m 0.4m 2 5 10 20 2 5 10 20 FGM ( IN S ) 2.09±1.49 1.41±1.04 1.14±0.70 0.95±0.68 73.15±66.31 18.84±8.36 13.60±8.22 7.58±1.94 9.24±3.54 9.18±3.61 7.84±2.50 5.96±2.36 13.01±5.96 18.32±7.37 17.01±4.38 16.16±3.52 QCQP-SSVM ( IN S ) 0.23±0.03 0.23±0.03 0.23±0.03 0.23±0.03 3.06±0.46 2.88±0.31 2.85±0.23 2.90±0.28 8.29±0.62 8.78±0.59 9.02±0.69 9.11±0.59 9.71±0.96 10.10±0.94 10.34±0.70 10.48±0.63 NMMKL ( IN S ) 1.17±0.66 1.30±0.73 1.26±0.62 1.41±0.80 94.33±69.43 49.04±46.70 55.44±24.58 72.82±13.20 SVM-RFE ( IN S ) 0.45±0.07 0.29±0.18 0.33±0.14 0.35±0.10 10.60±12.95 7.86±5.05 7.73±5.43 7.98±7.60 9.35±0.78 9.33±0.75 9.32±0.86 9.31±0.77 28.67±2.89 28.43±2.92 28.40±3.00 28.18±2.79

WDBC

USPS

6.66± 1.12

B REAST C ANCER

0.67±0.10

L EUKEMIA

2.07±0.21

NMMKL on the Microarray datasets because it cannot handle such high dimensional problems. As expected, in general, FGM can obtain the most sparse results on all datasets. The testing accuracies of various methods are listed in Table 3. From this table, we can observe that FGM can obtain competitive results or even better results on all the small benchmark datasets. Table 4 lists the training time of various methods spent on different small datasets. From this table, we can observe that QCQP-SSVM and LPSVM shows better efficiency on dealing with small problems. However, both of them are incapable of very large problems. 3.5. Experiments on Very High Dimensional Datasets In this subsection, we verify the performance of FGM on large-scale datasets listed in Table 1. These datasets have both large number of instances and features. Note, some of the methods, such as NMMKL, QCQP-SSVM and LPSVM cannot be used due to their high computational cost or high memory storage. Therefore, we only consider the comparison among FGM, FGM-B and SVM-RFE. For the parameter settings, we did the experiments by fixing C = 5. As to the parameter B for FGM and SVM-RFE, we set it in the range of {2, 5, 10, 50, 100, 150, 200, 250} for the former four datasets and {2, 5, 10, 20, 30, 40, 50, 60} for the two URL datasets. For SVM-RFE, we remove 100 features in each step for the first four large datasets. However, for the two URL datasets, SVM-RFE with 100 features elimination of each step is still very computationally expensive, hence we remove 10,000 features in each step if the number of remaining features is larger than 20,000. We respectively recorded the testing accuracy against the number of selected features and the training time versus different B in Figure 4 and Figure 5, respectively. From Figure 4, we have the following observations: (a) On real-sim, rcv1.binary and news20.binary datasets, FGM obviously outperforms SVM-RFE and FGM-B on testing accuracy with selected features. Meanwhile, FGM-B also shows improved performance compared with SVM-RFE

on these datasets. (b) On Arxiv astro-ph dataset, although FGM does not show significant improvements compared with SVM-RFE, its counterpart, FGM-B, is slightly better than SVM-RFE. (c) From the results of the two URL datasets, we can easily see that FGM is much better than SVM-RFE when identifying a small number of features. Finally, from Figure 5, we can conclude that FGM is very efficient when dealing with very high dimensional problems.

4. Conclusion
In this paper, we propose a novel SVM algorithm to learn a sparse feature subset for classification. In particular, a 0-1 vector is introduced into SVM to control whether or not the features are selected, resulting in a Mixed Integer Programming (MIP) problem. By introducing a convex relaxation, the MIP is further transformed into a convex Multiple Kernel Learning problem with exponentially large number of base kernels. Finally, an efficient and scalable cutting plane algorithm, namely "Feature Generating Machine (FGM)", is introduced to iteratively generates and learns a pool of informative and sparse feature subsets. Because FGM only requires to solve a small number of MKL problems with very few linear kernels, and the internal subproblem of MKL only involves linear SVM that can be solved by stateof-the-art LIBLinear software, FGM is very suitable for solving large-scale and very high dimensional problems. Moreover, with the property of global convergence, the size of the final feature subset in FGM can be optimally determined, catering to the "non-monotonic" requirement in feature selection. Comprehensive experiments on both synthetic dataset and real-word datasets verify the good classification performance and efficiency of FGM.

Acknowledgments
This research was in part supported by Singapore MOE AcRF Tier-1 Research Grant (RG15/08).

Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
98 94 90 86 82 78 74 70 66 62 58 54 50 46 0 94 90 86 82 78 74 70 66 62 58 54 50 46 0

References
Blum, A. L. and Langley, P. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245­271, 1997. Bradley, P. S. and Mangasarian, O. L. Feature selection via concave minimization and support vector machines. In ICML, 1998. Chan, A.B., Vasconcelos, N., and Lanckriet, G.R.G. Direct convex relaxations of sparse SVM. In ICML, 2007. Chen, J. and Ye, J. Training SVM with indefinite kernels. In ICML, 2008. Fung, G.M. and Mangasarian, O.L. A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 28: 185­202, 2004. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157­1182, 2003. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389­422, 2002. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In ICML, 2008. Joachims, T. Training linear SVMs in linear time. In ACM KDD, 2006. Kelley, J. E. The cutting plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703­712, 1960. Kim, S.-J. and Boyd, S. A minimax theorem with applications to machine learning, signal processing, and finance. SIAM Journal on Optimization, 2008. Li, Y.F., Kwok, J.T., Tsang, I.W., and Zhou, Z.H. A convex method for locating regions of interest with multiinstance learning. In ECML, 2009a. Li, Y.F., Tsang, I.W., Kwok, J.T., and Zhou, Z.H. Tighter and convex maximum margin clustering. In AISTATS, 2009b. Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. Identifying suspicious URLs: An application of large-scale online learning. In ICML, 2009. Rakotomamonjy, A., F., Bach, Y., Grandvalet, and S., Canu. SimpleMKL. J. Mach. Learn. Res., 9:2491­2521, 2008. Weston, J., Elisseeff, A., and Scholkopf, B. Use of the zeronorm with linear models and kernel methods. J. Mach. Learn. Res., 3:1439­1461, 2003. Xu, Z., Jin, R., J., Ye, Lyu, Michael R., and I, King. Nonmonotonic feature selection. In ICML, 2009. Zhu, J., Rossett, S., Hastie, T., and Tibshirani, R. 1-norm support vector machines. In NIPS, 2003.

Testing accuracy (in %)

FGM FGM-B SVM-RFE
100 200 300 400 500

Testing accuracy (in %)

FGM FGM-B SVM-RFE
100 200 300 400 500

# selected features

# selected features

(a) real-sim
98 96 94 92 90 88 86 84 82 80 78 76 74 0 74

(b) rcv1.binary
Testing accuracy (in %)

Testing accuracy (in %)

70 66 62 58 54 50 46 42 38 0 100 200 300

FGM FGM-B SVM-RFE
100 200 300 400 500

FGM FGM-B SVM-RFE
400 500

# selected features

# selected features

(c) Arxiv astro-ph
100 96 92 88 84 80 76 72 68 64 60 56 0 20 40 60
100 95 90 85 80 75 70 65 60 55 50 45 40 35 0

(d) news20.binary
Testing accuracy (in %)

Testing accuracy (in %)

FGM FGM-B SVM-RFE
80 100

FGM FGM-B SVM-RFE
20 40 60 80 100

# selected features

# selected features

(e) URL0 (f) URL1 Figure 4. Testing accuracy on different data sets
130 350

Training time (in seconds)

CPU time (in seconds)

120 110 100 90 80 70 60 50 40 2 5 10 50 100

300 250 200 150 100 50 2

FGM SVM-RFE
150 200 250

FGM SVM-RFE
5 10 50 100 150 200 250

B

B

(a) real-sim
1400
4 x 10 3.5 3 2.5 2 1.5 1 0.5 0 2

(b) rcv1.binary
4

CPU time (in seconds)

1200 1000 800 600 400 200 0 2 5 10 50 100 150 200 250

FGM SVM-RFE

CPU time (in seconds)

FGM SVM-RFE

5

10

50

100

150

200

250

B

B

(c) Arxiv astro-ph
1200 1400 1000 800 600 400 200 2 5 10 20 30

(d) news20.binary

CPU time (in seconds)

CPU time (in seconds)

1200 1000 800 600 400 200 2 5 10 20 30

FGM SVM-RFE
40 50 60

FGM SVM-RFE
40 50 60

B

B

(e) URL0 (f) URL1 Figure 5. Training time on various data sets