Feature Selection as a One-Player Game

Romaric Gaudel romaric.gaudel@lri.fr Mich`le Sebag e michele.sebag@lri.fr LRI, CNRS UMR 8623 & INRIA-Saclay, B^t 490, Universit´ Paris-Sud 11, 91405 Orsay Cedex FRANCE a e

Abstract
This paper formalizes Feature Selection as a Reinforcement Learning problem, leading to a provably optimal though intractable selection policy. As a second contribution, this paper presents an approximation thereof, based on a one-player game approach and relying on the Monte-Carlo tree search UCT (Upper Confidence Tree) proposed by Kocsis and Szepesvari (2006). The Feature Uct SElection (FUSE) algorithm extends UCT to deal with i) a finite unknown horizon (the target number of relevant features); ii) the huge branching factor of the search tree, reflecting the size of the feature set. Finally, a frugal reward function is proposed as a rough but unbiased estimate of the relevance of a feature subset. A proof of concept of FUSE is shown on benchmark data sets.

This paper formalizes Feature Selection as a Reinforcement Learning (RL) problem: find a good feature selection strategy, ultimately resulting in a minimal generalization error. While this formalization enables to define a provably optimal policy, such a policy is clearly intractable. An approximation thereof is obtained by casting the RL problem as a one-player game; the Upper Confidence Tree (UCT) framework (Kocsis & Szepesv´ri, 2006) yields a robust approach a to optimization under uncertainty. The paper is organized as follows: section 2 briefly reviews related work, without aiming at exhaustivity. The formalization of FS as a Reinforcement Learning problem is presented in section 3. For the sake of self-containedness, section 4 describes the UCT framework. Section 5 focuses on how to deal with the huge branching factor of the search tree, while section 6 presents other UCT extensions needed to handle the Feature Selection problem, and gives an overview of the FUSE (Feature UCT SElection) algorithm. A proof of principle of the approach on some benchmark problems from the NIPS 2003 FS Challenge (Guyon et al., 2004) is discussed in section 7 and the paper concludes with some perspectives for further work.

1. Introduction
Feature Selection (FS), one key issue in statistical learning, is a combinatorial optimization problem aimed at minimizing the generalization error of the learned hypothesis. FS is tackled using three main approaches: scoring, wrapping and embedded FS (more in section 2). Scoring approaches independently assess and rank the features w.r.t. the classification problem; in counterpart, they poorly account for the feature inter-dependencies. Wrapper methods basically tackle the whole combinatorial optimization problem, exploring the whole powerset of the feature set and computing for each candidate subset an estimate of the generalization error. Embedded approaches exploits the learned hypothesis and/or incorporate sparsity criteria during learning to achieve FS.
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. State of the Art
Feature Selection mostly aims at reducing the generalization error of the learned hypothesis, through removing noise (filtering out irrelevant features) or considering simpler hypothesis spaces (filtering out redundant features). As already mentioned, three categories of FS algorithms are classically distinguished. Filtering approaches proceed by independently ranking features using some score function, and selecting the top-ranked ones. The score basically measures the correlation between the feature and the class values, using e.g. Student's t-test, or its multi-class extension (aka. ANOVA, the Analysis of variance), or the Fisher test. The main limitation of the above score functions is to discard feature interdependencies. A

Feature Selection as a One-Player Game

filtering approach accounting for feature correlations, RELIEF measures the feature impact on each example neighborhood and its purity (Kira & Rendell, 1992); the example neighborhood, defined from the Euclidean distance built from all features, might however be biased due to feature redundancy. Wrapper approaches tackle the whole combinatorial optimization problem of FS, exploring the powerset of the feature set F and measuring (an estimate of) the generalization error of all considered subsets. The crux of wrapper approaches is how to navigate in the feature powerset lattice, and deal with the exploration versus exploitation (EvE) dilemma. On the one hand, exploration must be limited for the sake of scalability. On the other hand, exploitation aka. myopic search is bound to end up in a local optimum. Hill-climbing approaches (Hall, 2000), mixing forward and backward selection (Zhang, 2008a), using look-ahead or making large steps in the lattice (Margaritis, 2009), combining global and local search operators (Boull´, 2007), e among many other heuristics, have been proposed to address the EvE dilemma. Embedded Approaches combine learning and feature selection through prior or posterior regularization. (Tibshirani, 1994) pioneered the use of L1 regularization, amenable to Linear Programming-based learning (Fung & Mangasarian, 2000) and paving the way for the so-called least absolute shrinkage and selection operator (Lasso) approach (Efron et al., 2004). Quite a few other regularization approaches, mostly considering linear hypothesis spaces, have been considered along the same lines (Guyon et al., 2002; Chan et al., 2007; Zhang, 2008b; Helleputte & Dupont, 2009). An extension of the above to non-linear spaces is given by Multiple Kernel Learning (MLKL) (Bach, 2008), achieving a layer-wise exploration of the "kernel lattice" as opposed to the feature lattice. While the kernel lattice indeed offers a more compact and hence more tractable search space than the feature powerset, the approach clearly cannot be extended beyond certain limits. Other approaches use non-linear hypotheses (Shen et al., 2008; Varma & Babu, 2009) or random forests (Breiman et al., 1984; Rogers & Gunn, 2005), to compute feature scores. The main weakness of these scores in a FS perspective is to be hindered by feature redundancy. In summary, filtering and embedded approaches alike consider each feature in a fixed context: either standalone (and they do not capture their correlations) or together with all other features (and they are hindered by feature redundancy). Only wrapper approaches, considering feature subsets, have a chance to see fea-

tures at their best; but the price to pay is that of considering sufficiently many subsets. Regarding wrapper approaches, the delicate issue is to control the Exploration versus Exploitation (EvE) tradeoff. It thus comes naturally to wonder whether: i/ a feature subset can be provided with a cheap, unbiased, assessment; ii/ these assessments can be used to control the EvE tradeoff in an efficient way, and gradually focus the search on the most promising regions of the feature lattice. Both questions are at the root of the proposed Multi-Armed Bandit-based approach to feature selection.

3. Feature Selection as Reinforcement Learning
The first contribution of the presented work is to formalize FS as a Markov Decision Process (MDP) in the feature set lattice, as follows. Let F denote the finite set of features, plus an additional stopping feature noted fs . The state space S of the MDP is the powerset of F . Final states are all states F  F containing fs . The action space, likewise, is the set of features. Within any non final state F , a possible action is to select a feature in F \ F : letting p : S × F × S  I + R be the transition function, p(F, f, F  ) is non zero if F  = F  {f }. As the main goal of FS is to yield a minimal generalization error at the end of the learning process, it comes naturally that the reward function associated to a final state F is the generalization error of the learned hypothesis A(F \ {fs}) noted Err(A(F )). Let  be an MDP policy associating to each state an action. Let F denote the final state (the subset of features) built by following  starting from the empty state. Clearly the optimal policy   is the one minimizing Err(A(F )):   = argmin Err (A (F ))


(1)

Following Bellman's optimality principle (Bellman, 1957) and defining the optimal value function V  as: V  (F ) = Err(A(F )) min V  (F  {f })
f F \F

if F is final otherwise

(2)

the optimal policy   is defined from V  as follows:   (F ) = argmin V  (F  {f })
f F \F

(3)

While the above   policy is provably optimal, it does not lead to a tractable algorithm, as the state space

Feature Selection as a One-Player Game

is exponential in the number of features. The second contribution of the paper is to provide a tractable approximation of the above optimal policy, taking inspiration from (Rolet et al., 2009). The proposed policy approximation relies on the Upper Confidence Tree a (UCT) framework (Kocsis & Szepesv´ri, 2006) which extends the optimal exploration vs. exploitation tradeoff of the Upper Confidence Bound algorithm (UCB) (Auer et al., 2002) to the case of sequential decision making.

5. The Many-armed Issue
A general limitation of Multi-Arm Bandit algorithms, when dealing with a large number of actions compared to the number of allowed iterations, is to be biased toward exploration (Wang et al., 2008). This limitation is even more severe for UCT, which must resist overexploration at each level of the tree. This section recalls several heuristics proposed to control exploration and details how these are combined in the proposed approach. 5.1. Restricting the Number of Arms A first modification of the original UCT algorithm is to limit the strength of the exploration term, taking into account the empirical variance F,a of the rewards ^2 collected when selecting action a from node F . Formally, Eq. (4) is replaced by the so-called UCB1-tuned equation (Auer et al., 2002):

4. Upper Confidence Tree for Optimal Policy Approximation
UCT, a Monte-Carlo tree search algorithm asymmetrically growing the search tree to explore the most promising regions thereof (Fig. 1). UCT proceeds by successive iterations where each iteration executes a sequence of actions divided into two phases: a bandit-based phase and a random phase. Within the bandit phase, in each node F , one selects the action maximizing the Upper Confidence Bound (UCB) criterion (Auer et al., 2002):

Bandit-Based Phase Search Tree

New Node

Random Phase

µF,a + ^

ce ln (TF ) min tF,a

1 2 , + ^ 4 F,a

2 ln (TF ) tF,a

(5)

Figure 1. One iteration of FUSE (UCT approach for FS).

For the sake of selfcontainedness this section briefly summarizes

a = argmax µF,a + ^
aA

ce ln (TF ) tF,a

Two heuristics, a discrete and a continuous one, have been considered to further control the exploration. The continuous one proposed by (Gelly & Silver, 2007) sets constant ce to a very small value; the discrete one proposed by (Coulom, 2006) restricts the number of considered child nodes depending on the number of visits TF to the parent node. Formally, a new child b node is added whenever the integer part of TF is incremented by 1, where b < 1 is a parameter of the algorithm. 5.2. Rapid Action Value Estimation As shown in earlier work, the selection of new nodes can benefit from any prior knowledge (Rolet et al., 2009) or any knowledge gained within the search. Taking inspiration from MoGo (Gelly & Silver, 2007), knowledge gradually acquired along the search is summarized through the so-called Rapid Action Value Estimation vector (RAVE), associating to each feature f the average reward of all final nodes Ft containing f : g-RAVEf = average{V (Ft ), f  Ft } (6)

(4)

where TF stands for the number of times node F has been visited, tF,a denotes the number of times a has been selected in node F , and µF,a is the average reward ^ collected when selecting action a from node F . Parameter ce (set to 2 in the original UCB formula) was introduced in UCT to control the exploration strength a (Kocsis & Szepesv´ri, 2006). At some point, one arrives at a node F which does not belong to the search tree. This node is added to the tree and the iteration switches to the random phase: new actions are selected either randomly or following ad-hoc rules until arriving at a final node. At this point, the instant reward is computed and the TF , tF,a and µF,a indicators attached to each node and action ^ of the visited branch are updated accordingly.

While g-RAVE provides a global indication on feature relevance, it also makes sense to consider the feature conditionally to those selected within the current node,

Feature Selection as a One-Player Game

yielding the -RAVE vector:

-RAVEF,f = average{V (Ft ), F ; Ft , f  Ft } (7) where the average is taken over iterations t visiting node F and ending up in final node Ft , noted F ; Ft . Regarding the stopping feature fs , one RAVE score is associated to each size of feature set. Specifically, let (d) fs be the stopping feature at level d; g-RAVEf (d) is s defined as the average reward of all final nodes Ft of size d + 1:

guides the FUSE exploration, FUSE can inversely be viewed as a sophisticated way of building the RAVE score, providing an educated guess about the feature relevance.

6. UCT-based Feature Selection: FUSE
This section finally gives an overview of the FUSE algorithm, first detailing the reward function and the random phase policy which are specific to the Feature Selection task. 6.1. Instant Reward Function A primary requirement for Monte-Carlo tree search is to set up a computationally cheap reward to be computed in each iteration. The FUSE reward function, aimed at estimating the generalization error, proceeds by learning from few examples, for the sake of tractability, while accounting for non-linear concepts for the sake of efficiency. Specifically, the reward attached to a feature subset F proceeds as follows. Let dF denote the Euclidean distance based on features in F . Let L denote the training set and V an aggressive subsample of L. For each labeled example z = (x, y) in V, let NF,k (x) denote the set of the k nearest neighbors of x in L after dF ; define sF (z) the number of positive examples among these neighbors: sF (z) = |{z   NF,k (x), y  > 0}| (10)

g-RAVEf (d) = average{V (Ft ), |Ft | = d + 1}
s

(8)

Global and local scores provide an estimate of the feature relevance (conditionally to feature subset F in the case of -RAVE and µF,f ) achieving different bias vs. ^ variance trade-offs: g-RAVE is computed from more trials than -RAVE, which in turn is computed from more trials than µF,f ; but µF,f (respectively -RAVE) ^ ^ more precisely reflects the impact of f conditionally to F than -RAVE (resp. g-RAVE). 5.3. Selection of New Nodes RAVE scores are used to focus the search and avoid a (hopeless) uniform exploration of the feature space. How to use RAVE depends on whether the exploration heuristics (section 5.1) is discrete or continuous. The discrete one selects the top ranked feature after RAVE b whenever the integer part of TF is incremented. The continuous one replaces the UCB formula (Eq. (5)) by a weighted sum (1 - ) · UCB +  · -RAVE where  = c c+tF,f controls the impact of -RAVE, the weight of which becomes small as soon as f has been sufficiently selected from node F (tF,f > c) (Gelly & Silver, 2007). Along the same lines, we consider both g-RAVE and -RAVE to select the feature f maximizing:

A reasonable tradeoff between stable and computationally frugal reward is given by the Area Under the ROC curve (AUC, aka. Mann Whitney Wilcoxon criterion), defined1 as: |{(z, z )  V 2 , sF (x) < sF (x ), y < y  }| |{(z, z  )  V 2 , y < y  }| (11)

V (F ) =

(1-)·^F,f + (1 - ) · -RAVEF,f +  · g-RAVEf µ + ce ln (TF ) min tF,f 1 2 , ^ + 4 F,f 2 ln (TF ) tF,f (9)

Upon computing V (F ), the score of all nodes in the current path is updated. Although FUSE actually explores a graph (the lattice of the feature set), it does not update the reward of all nodes containing F due to the high branching factor of the graph. Computational complexity: V (F ) is computed with linear complexity in the size n of L up to a
Algorithmically, for each positive example z  V let v(z) (respectively w(z)) denote the number of negative examples z   V s.t. sF (z  ) < sF (z) (resp. sF (z  ) = sF (z)), 1 and let S(F ) be the sum of z  V of v(z) + 2 w(z) over all positive examples. Score V (F ) is set to S(F ) divided by the product of the number of positive and negative examples in V.
1

where  = clcl l and tl denotes the number of itera+t tions involved in -RAVE computation. It must be noted that the FUSE tree and the RAVE score are tightly coupled. While the RAVE score

Feature Selection as a One-Player Game

Algorithm 1 FUSE FUSE Input: number of iterations T and many-armed behavior MA Output: search tree T and g-RAVE score Initialize T  , f, g-RAVE(f) = 0 for t = 1 to T do Iterate(T , g-RAVE, ) end for Iterate Input: search tree T , score g-RAVE, subset F Output: reward V if F final then V  V (F \ {fs }) ; Update g-RAVE else if t(F ) = 0 then if MA = progressive widening then f  argmax UCB1-tuned(F, f )
f AllowedF eatures(F )

6.3. Overview of FUSE The FUSE algorithm (algorithm 1) constructs both a search tree and a RAVE score, yielding two different FS strategies. The first one, noted FUSE, retains the features in the most often visited path in the search tree (to be preferred to the path with maximal average reward after the common UCT practice). The second one denoted FUSER uses the RAVE score to rank the features, and proceeds as a standard filtering FS algorithm. In the following, the prefix C or D will be used to indicate whether FUSE and FUSER are used with the continuous or discrete heuristics (section 5.1).

7. Experimental Validation
This section aims at providing a proof of principle of the approach on standard benchmark data sets from the NIPS 2003 FS Challenge (Guyon et al., 2004). Due to space limitations, only binary classification problems and numerical features are considered. 7.1. Experimental Setting The goal of the experiments is to answer three main questions. The first one is related to the overall FS performance of the proposed approach depending on the complexity of the underlying target concept and feature correlations. The second one regards the comparative strengths and weaknesses of FUSE and FUSER , and the merits of the discrete and continuous heuristics used to control exploration. The third one regards the convergence speed of the approach depending on the computational effort. Three benchmark data sets (Table 1) have been considered to answer the first question. The Madelon and Arcene data sets involve concepts with correlated features. Madelon is an artificial 500-feature dataset, where the target concept is set to the XOR of five relevant features. The other 495 features involve 15 features, built as linear combinations of the relevant ones; the remaining features are irrelevant. Arcene is the concatenation of several datasets relevant to related though different concepts (each dataset being related to a particular type of cancer); expectedly, the target concept in Arcene involves the disjunction of overlapping sub concepts. The third dataset, Colon, is a microarray dataset introduced by (Alon et al., 1999), which is considered to be "simple". After (Guyon et al., 2007; Shen et al., 2008), Madelon and Colon are centered and normalized (based on the training set). For the sake of tractability, FUSE and CFS only consider the top 2000 features of the Arcene

else f   argmax tradeoff UCB/RAVE(F, f )
f F \F

end if V  iterate(T , g-RAVE, F  {f  }) else V  iterate random(T , g-RAVE, F ) end if Update TF , tf , µF,f , F,f and -RAVEF,. ^ ^2 end if Iterate random Input: search tree T , score g-RAVE, subset F Output: reward V while rand() < q |F | do f   uniformly selected feature in F \ (F  {fs }) F  F  {f  } end while V  V (F ) ; Update g-RAVE

logarithmic term. More precisely the complexity is ~ O(mnd) where m denotes the size of V and d the size of the feature subset. 6.2. Random Phase A key specificity of the Feature Selection task, viewed as Reinforcement Learning problem, is that it deals with a finite unknown horizon (the target number of features). This specificity is accounted for in the random UCT phase, by enforcing the selection of the stopping feature at level d with probability 1 - q d , where q < 1 is a parameter of the algorithm.

Feature Selection as a One-Player Game Table 1. Datasets characteristics. Data set Madelon Arcene Colon Samples 2600 200 62 Features 500 10, 000 2, 000 Properties XOR-like Disjunctive "Easy"

dataset, ranked after their ANOVA score. All reported results are averaged on ten independent runs. Each run considers a 5-fold stratified crossvalidation (CV). For each fold, FUSE is launched with 200,000 Monte-Carlo iterations. The runtime is respectively 45 minutes, 5 minutes and 4 minutes on Madelon, Arcene and Colon on an Intel Core 2×2.6GHz CPU with 2GB memory (only considering FS on the training set, i.e. 80% of the dataset). FUSE parameters include k (used in the reward function, section 6.1) set to 5; q (used to control the average depth in the random phase, section 6.2), set to 1 - 10i for i = -1, -3, -5. The exploration control heuristics involves parameter ce set to 10i for i = -4, -2, 0, 2; the discrete heuristics (section 5.1) involves parameter b set to 2 in all experiments; the continuous heuristics involves c and cl (section 5.3) both set to 10i for i = -, 2, 4. FUSE performance is assessed comparatively to three baseline approaches: Correlation-based Feature Selection (CFS) (Hall, 2000), the Random-Forest based Gini score (Gini-RF) (Breiman et al., 1984) and Lasso (Tibshirani, 1994)2 . FUSER was further compared to its naive variant RANDR , using the average feature score built from uniformly selected 20-feature subsets. For the sake of a fair comparison, all FS algorithms were combined with the same end learner, a Gaussian SVM (Collobert et al., 2002); its hyper-parameters are optimized using 5 fold CV on the training set unless otherwise specified. 7.2. Experimental Results Some answers to the question of the performance of FUSE algorithms are given on Fig. 2, reporting their generalization error comparatively to that of Gini-RF, CFS and Lasso. The differences among the datasets are visible as Lasso is outperformed by other approaches on Arcene and Madelon (which are not linear) whereas all algorithms similarly perform on the Colon dataset. On Arcene and Madelon, the comparUsing the Weka implementation of CFS; the R implementation of Gini-RF (with 1,000 trees); the R (glmnet) implementation of Lasso.
2

ison between FUSE, CFS and Gini-RF suggests that the FUSE algorithms take the best of both worlds. On the one hand, they detect features which are only relevant when combined with other features, like GiniRF, and they improve on Gini-RF when few features are considered. On the other hand, they can filter out redundant features, like CFS, and they statistically significantly improve on CFS when considering many features. This good behavior is explained from the reward used in FUSE, which allows for a global assessment of feature subsets without making assumptions about the complexity of the underlying target concept. In the meanwhile, the assessment of many feature subsets is made possible as only a tiny subsample of the instance space is considered in each iteration. Fig. 2 also illustrates the difference between FUSE and FUSER , and the comparative merits of the discrete (progressive-widening-based) and continuous (Mogolike) heuristics used to control the exploration width. A first remark is that FUSE does not control the depth of the search tree in an efficient way; the best FUSE paths involve from 5 to 15 features. The filtering approach based on the RAVE score built by FUSE, FUSER , thus appears to be significantly more efficient than FUSE (although both algorithms almost always select the same top features, and thus coincide at the beginning of the curve reporting the performance vs. the number of features). The discrete and continuous heuristics yield same results at their best, i.e. when the c and cl parameters of the continuous heuristics are well chosen. For this reason, the discrete heuristics is preferred as more robust: the single b parameter used to control the branching factor depending on the number of visits of the node was set to 2 in all experiments. These results are comparable to the state of the art along the NIPS FS challenge setting (using an additional separate test set): the best performance on Madelon (reached by (Shen et al., 2008)) is 6.22% vs 6.5 for FUSER . Regarding the scalability question, some answers are graphically provided by Fig. 3, reporting the average test error versus the number of iterations in log scale on the Madelon dataset; the performances reported for D-FUSER , C-FUSER and RANDR correspond to the best 20 features for the sake of comparison with D-FUSE and C-FUSE. FUSE reaches its optimum after 10,000 iterations and thereafter stagnates, basically because the exploration does not allow for an in-depth exploration of the search tree and the size of the best path remains limited. Interestingly, FUSER converges more slowly than FUSE at the beginning; but it soon

Feature Selection as a One-Player Game
0.4 0.35 0.3 test error 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 number of used top-ranked features 30 D-FUSE R D-FUSE C-FUSE R C-FUSE CFS Gini-RF Lasso test error 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 50 100 150 number of used top-ranked features 200 D-FUSE R D-FUSE C-FUSE R C-FUSE CFS Gini-RF Lasso test error 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 50 100 150 number of used top-ranked features 200 D-FUSE R D-FUSE C-FUSE R C-FUSE CFS Gini-RF Lasso

(a) Madelon

(b) Arcene

(c) Colon

Figure 2. Performance versus the number of selected features. The performance measures the average error on 50 splits of Madelon (left), Arcene (center) and Colon (right). D-FUSE and C-FUSE are only visible at the beginning of the curve as they select from 5 to 15 features.

catches up and improves on FUSE after 10,000 iterations. In the meanwhile, FUSER is faster by an order of magnitude than RANDR , i.e. reaches the same level of performances with about 10 times less iterations.
0.5

0.4

test error

0.3

0.2 D-FUSE 0.1 D-FUSER C-FUSE R C-FUSE R RAND 0 1 10

10

2

10 iteration

3

10

4

10

5

(a) Test error
20 number of features chosen by FUSE D-FUSE C-FUSE

15

10

problem aimed at minimizing the generalization error; and ii/ to use this formalization to revisit wrapper approaches, for these uniquely simultaneously account for feature interdependencies and filter out useless redundancies. Algorithmically, the FS combinatorial optimization problem is tackled by combining Monte-Carlo Tree Search and a fast approximation of the generalization error; several extensions of the UCT framework have been proposed to handle the specificities of the FS search space, resulting in the FUSE algorithms. While a proof-of-principle experiment demonstrates the comparative performances of the approach, it raises several questions for further investigation. A main question is whether the proposed approach should be viewed as a wrapper, or a filtering FS approach. On the one hand, the RAVE vector summarizing the trials is a way of supporting a betterinformed FUSE exploration, and thus a wrapper FS approach; on the other hand, FUSE can be viewed as a sophisticated way of building the RAVE score, supporting a filtering FS approach. Further research will extend the approach to multiclass problems and mixed (continuous and discrete) search spaces, combine FUSE with other end learners, and reconsider the instant reward criterion. A longer term perspective is to extend the presented approach to Feature Construction, by extending FUSE to explore a regular language as in (de Mesmay et al., 2009).

5

0 1 10

10

2

10 iteration

3

10

4

10

5

(b) Size of FUSE feature subset Figure 3. Average test error (a) and size of D-FUSE and C-FUSE feature subset (b) on Madelon vs. number N of iterations in logscale. D-FUSER and C-FUSER use the top 20 features after the RAVE score built from N iterations; RANDR use the top-20 features after the average score out of N uniformly selected 20-feature subsets.

Acknowledgments
We thank Olivier Teytaud and Philippe Rolet for many fruitful discussions, and the anonymous reviewers for their suggestions and comments. We acknowledge the support of the PASCAL2 Network of Excellence, IST2007-216886.

8. Discussion and Perspectives
The main merits of the presented work are i/ to formalize Feature Selection as a Reinforcement Learning

Feature Selection as a One-Player Game

References
Alon, U., Barkai, N., Notterman, D. A., Gishdagger, K., Ybarradagger, S., Mackdagger, D., and Levine, A. J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. of the Nat. Ac. of Sc. of the USA, pp. 6745­6750, June 1999. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the Multiarmed Bandit problem. Mach. Learn., 47(2-3):235­256, 2002. Bach, F. Exploring large feature spaces with Hierarchical Multiple Kernel Learning. In NIPS'08, pp. 105­112, 2008. Bellman, R. Dynamic Programming. Press, 1957. Princeton Univ.

Hall, M. A. Correlation-based Feature Selection for discrete and numeric class Machine Learning. In ICML'00, pp. 359­366, 2000. Helleputte, T. and Dupont, P. Partially supervised Feature Selection with regularized linear models. In ICML'09, pp. 409­416, 2009. Kira, K. and Rendell, L. A. A practical approach to feature selection. In ML'92, pp. 249­256, 1992. Kocsis, L. and Szepesv´ri, C. Bandit based Monte-Carlo a planning. In ECML'06, pp. 282­293, 2006. Margaritis, D. Toward provably correct Feature Selection in arbitrary domains. In NIPS'09, pp. 1240­1248, 2009. Rogers, J. and Gunn, S. R. Identifying feature relevance using a Random Forest. In SLSFS, pp. 173­184, 2005. Rolet, P., Sebag, M., and Teytaud, O. Boosting Active Learning to optimality: a tractable Monte-Carlo, Billiard-based algorithm. In ECML'09, pp. 302­317, 2009. Shen, K. Q., Ong, C. J., Li, X. P., and Wilder-Smith, E. P. V. Feature selection via sensitivity analysis of SVM probabilistic outputs. Mach. Learn., 70(1):1­20, 2008. Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267­288, 1994. Varma, M. and Babu, B. R. More generality in efficient Multiple Kernel Learning. In ICML'09, pp. 1065­1072, 2009. Wang, Y., Audibert, J.Y., and Munos, R. Algorithms for infinitely Many-Armed Bandits. In NIPS08, pp. 1729­ 1736, 2008. Zhang, T. Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models. In NIPS'08, pp. 1921­1928, 2008a. Zhang, T. Multi-stage Convex Relaxation for Learning with Sparse Regularization. In NIPS'08, pp. 1929­1936, 2008b.

Boull´, M. Compression-based averaging of selective Naive e Bayes classifiers. J. Mach. Learn. Res., 8:1659­1685, 2007. Breiman, L., Friedman, J., Stone, C. J., and Olshen, R.A. Classification and Regression Trees. Taylor & Francis, Inc., 1984. Chan, A. B., Vasconcelos, N., and Lanckriet, G. R. G. Direct convex relaxations of sparse SVM. In ICML'07, pp. 145­153, 2007. Collobert, R., Bengio, S., and Marithoz, J. Torch: A modular machine learning software library. Technical report, IDIAP, 2002. Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In Computers and Games, pp. 72­83, 2006. de Mesmay, F., Rimmel, A., Voronenko, Y., and P¨schel, u M. Bandit-based optimization on graphs with application to library performance tuning. In ICML'09, pp. 729­736, 2009. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Annals of Statistics, 32:407­499, 2004. Fung, G. and Mangasarian, O. L. Data Selection for Support Vector Machine Classifiers. In KDD'00, pp. 64­70, 2000. Gelly, S. and Silver, D. Combining online and offline knowledge in UCT. In ICML'07, pp. 273­280, 2007. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer classification using Support Vector Machines. Mach. Learn., 46(1-3):389­422, 2002. Guyon, I., Gunn, S. R., Ben-Hur, A., and Dror, G. Result analysis of the NIPS 2003 Feature Selection challenge. In NIPS'04, pp. 545­552, 2004. Guyon, I., Li, J., Mader, T., Pletscher, P. A., Schneider, G., and Uhr, M. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recogn. Lett., 28(12):1438­1444, 2007.