Hierarchical Penalization Marie Szafranski 1 , Yves Grandvalet 1, 2 and Pierre Morizet-Mahoudeaux 1 Heudiasyc 1 , UMR CNRS 6599 ´ ` Universite de Technologie de Compiegne ` BP 20529, 60205 Compiegne Cedex, France IDIAP 2 Rue du Simplon 4 Case Postale 592, CH-1920 Martigny, Switzerland marie.szafranski@hds.utc.fr Abstract Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels. Keywords ­ Optimization: constrained and convex optimization. Supervised learning: regression, kernel methods, sparsity and feature selection. 1 Introduction In regression, we want to explain or to predict a response variable y from a set of explanatory variables x = (x1 , . . . , xj , . . . , xd ), where y R and j , xj R. For this purpose, we use a model such that y = f (x) + , where f is a function able to characterize y when x is observed and is a residual error. Supervised learning consists in estimating f from the available training dataset S = {(xi , yi )}n 1 . i= It can be achieved in a predictive or a descriptive perspective: to predict accurate responses for future observations, or to show the correlations that exist between the set of explanatory variables and the response variable, and thus, give an interpretation to the model. In the linear case, the function f consists of an estimate = (1 , . . . , j , . . . , d ) applied to x, that is to say f (x) = x . In a predictive perspective, x produces an estimate of y , for any observation x. In a descriptive perspective, |j | can be interpreted as a degree of relevance of variable xj . Ordinary Least Squares (OLS) minimizes the sum of the residual squared error. When the explanatory variables are numerous and many of them are correlated, the variability of the OLS estimate tends to increase. This leads to reduced prediction accuracy, and an interpretation of the model becomes tricky. Coefficient shrinkage is a major approach of regularization procedures in linear regression models. It overcomes the drawbacks described above by adding a constraint on the norm of the estimate . According to the chosen norm, coefficients associated to variables with little predictive information may be shrunk, or even removed when variables are irrelevant. This latest case is referred to as variable selection. In particular, ridge regression shrinks coefficients with regard to the 2-norm, while the lasso (Least Absolute Shrinkage and Selection Operator) [1] and the lars (Least Angle Regression Stepwise) [2] both shrink and remove coefficients using the 1-norm. 1 t x1 Q 2 x l | E E x3 l | d d x4 s d d l d| E x5 x6 s l | 1 2,1 E x1 J 1 1,1 Q x2 2,2 1,2 E 2 2,3 E x3 J2 l | l | 0 d 2,4 d 4 1,3 x s d d l d| E 5 2,5 3 x 2,6 J 3 x6 s Figure 1: left: toy-example of the original structure of variables; right: equivalent tree structure considered for the formalization of the scaling problem. In some applications, explanatory variables that share a similar characteristic can be gathered into groups ­ or factors. Sometimes, they can be organized hierarchically. For instance, in genomics, where explanatory variables are (products of) genes, some factors can be identified from the prior information available in the hierarchies of Gene Ontology. Then, it becomes necessary to find methods that retain meaningful factors instead of individual variables. Group-lasso and group-lars [3] can be considered as hierarchical penalization methods, with trees of height two defining the hierarchies. They perform variable selection by encouraging sparseness over predefined factors. These techniques seem perfectible in the sense that hierarchies can be extended to more than two levels and sparseness integrated within groups. This papers proposes a penalizer, derived from an adaptive penalization formulation [4], that highlights factors of interest by balancing constraints on each element, at each level of a hierarchy. It performs soft selection at the factor level, and shrinks variables within groups, to favor solutions with few leading terms. Section 2 introduces the framework of hierarchical penalization and the associated algorithm is presented in Section 3. Section 4 shows how this framework can be applied to linear and kernel regression. We conclude with a general survey of our future works. 2 Hierarchical Penalization 2.1 Formalization We introduce hierarchical penalization by considering problems where the variables are organized in a tree structure of height two, such as the example displayed in figure 1. The nodes of height one are labelled in {1, . . . , K }. The set of children (that is, leaves) of node k is denoted Jk and its cardinality is dk . As displayed on the right-hand-side of figure 1, a branch stemming from the root and going to node k is labelled by 1,k , and the branch reaching leaf j is labelled by 2,j . We consider the problem of minimizing a differentiable loss function L(·), subject to sparseness constraints on and the subsets of defined in a tree hierarchy. This reads 2 kK j j min , (1a) L( ) + , 1,k 2,j =1 Jk kK jd sub ject to dk 1,k = 1 , 2,j = 1 , (1b) =1 =1 1,k 0 k = 1, . . . , K , 2,j 0 j = 1, . . . , d , (1c) where > 0 is a Lagrangian parameter that controls the amount of shrinkage, x/y is defined by continuation at zero as x/0 = if x = 0 and 0/0 = 0. 2 The second term of expression (1a) penalizes , according to the tree structure, via scaling factors 1 and 2 . The constraints (1b) shrink the coefficients at group level and inside groups. In what follows, we show that problem (1) is convex and that this joint shrinkage encourages sparsity at the group level. 2.2 Two important properties We first prove that the optimization problem (1) is tractable and moreover convex. Then, we show an equivalence with another optimization problem, which exhibits the exact nature of the constraints applied to the coefficients . Proposition 1 Provided L(·) is convex, problem (1) is convex. Proof: A problem minimizing a convex criterion on a convex set is convex. Since L(·) is convex and x2 is positive, the criterion (1a) is convex provided f (x, y , z ) = yz is convex. To show this, we compute the Hessian: t t 8 -4 x -4 x y z 2 0 0 2 2 2 1 x x x x x = 2 - y - y + y y . 4(y z ) 2 2f (x, y , z ) = -4 x 3 x2 y y yz 2 2 x x x x -z -z -z -x -4 x 3 x2 z z yz z Hence, the Hessian is positive semi-definite, and criterion (1a) is convex. Next, constraints (1c) define half-spaces for 1 and 2 , which are convex sets. Equality constraints (1b) define linear subspaces of dimension K - 1 and d - 1 which are also convex sets. The intersection of convex sets being a convex set, the constraints define a convex admissible set, and problem P (1) is convex. roposition 2 Problem (1) is equivalent to kK 1 j 4 dk min L( ) + =1 Jk 3 2 4 4 |j | 3 . (2) Sketch of proof: The Lagrangian of problem (1) is L = L( ) + 2 jd =1 kK j =1 Jk 2 j 1,k 2,j kK =1 + 1 K k =1 + dk 1,k - 1 2,j - 1 - 1,k 1,k - jd =1 2,j 2,j . Hence, the optimality conditions for 1,k and 2,j are 2 j j + 1 dk - 1,k L - 3 1 2 =0 2 2 Jk 1,k 2,j 1,k 2 L j =0 - + 2 - 2,j 1 2,j 2 2 3 2 1,k 2,j -3 3 1 4 = 0 . = 0 After some tedious algebra, the optimality conditions for 1,k and 2,j can be expressed as 1,k = dk 4 (sk ) 4 kK =1 4 3 1 4 dk (sk ) 4 3 , and 2,j 4 dk |j | 3 = for j Jk , 1 1 kK 3 4 4 4 (sk ) dk (sk ) =1 where sk = j Jk |j | . Plugging these condition3 in criterion (1a) yields the claimed result. s 2.3 Sparseness Proposition 2 shows how the penalization influences the groups of variables and each variable in each group. Note that, thanks to the positivity of the squared term in (2), the expression can be further simplified to 3 4 kK 1 j 4 4 min L( ) + (3) dk | j | 3 , =1 Jk where, for any L( ), there is a one-to-one mapping from in (2) to in (3). This expression can be interpreted as the Lagrangian formulation of a constrained optimization problem, where the admissible set for is defined by the multiplicand of . We display the shape of the admissible set in figure 2, and compare it to ridge regression, which does not favor sparsity, lasso, which encourages sparsity for all variables but does not take into account the group structure, and group-lasso, which is invariant to rotations of within-group variables. One sees that hierarchical penalization combines some features of lasso and group-lasso. ridge regression 2 2 2 1 +2 +3 1 lasso |1 |+|2 |+|3 |1 2( group-lasso 2 2 1 +2 hierarchical penalization 3 4" 4 1" 2 4 |1 | 3 +|2 | 3 4 +|3 |1 ) 1 2 +|3 |1 Figure 2: Admissible sets for various penalties, the two horizontal axes is the (1 , 2 ) plane (first group) and the vertical axis is for 3 (second group). By looking at the curvature of these sets when they meet axes, one gets a good intuition on why ridge regression does not suppress variables, why lasso does, why group-lasso suppresses groups of variables but not within-group variables, and why hierarchical penalization should do both. This intuition is however not correct for hierarchical penalization because the boundary of the admissible set is differentiable in the within-group hyper-plane (1 , 2 ) at 1 = 0 and 2 = 0. However, as its curvature is very high, solutions with few leading terms in the within-group variables are encouraged. To go beyond the hints provided by these figures, we detail here the optimality conditions for minimizing (3). The first-order optimality conditions are 1. for j = 0, j Jk and 2. for j = 0, j Jk and j Jk |j | = 0, |j | = 0, 1 L( ) 4 + dk vj = 0, where vj [-1, 1]; j L( ) = 0; j Jk 1 1 L( ) 1 4 3. for j = 0, j Jk , + dk sign(j ) + | 4 j |j | 3 Jk j =j | 4 3 1 -4 = 0. These equations signify respectively that 1. the variables belonging to groups that are estimated to be irrelevant are penalized with the highest strength, thus limiting the number of groups influencing the solution; 2. when a group has some non-zero relevance, all variables enter the set of active variables provided they influence the fitting criterion; 3. however, the penalization strength increases very rapidly (as a smooth step function) for small values of |j |, thus limiting the number of j with large magnitude. 4 Overall, hierarchical penalization is thus expected to provide solutions with few active groups and few leading variables within each group. 3 Algorithm To solve problem (3), we use an active set algorithm, based on the approach proposed by Osborne et al. [5] for the lasso. This algorithm iterates two phases: first, the optimization problem is solved with a sub-optimal set of active variables, that is, non-zero variables: we define A = {j | j = 0}, the current active set of variables, = {j }j A , the vector of coefficients associated to A, and Gk = {Jk A}, the subset of coefficients associated to group k . Then, at each iteration, we solve the problem 3 4 kK 1 j 4 4 3 min L( ) = L( ) + dk |j | , (4) =1 Gk by alternating steps A and B described below. Second, the set of active variables is incrementally updated as detailed in steps C and D. A Compute a candidate update from an admissible vector The goal is to solve min L( + h), where is the current estimate of the solution and h R|A| . h The difficulties in solving (4) stem from the discontinuities of the derivative due to the absolute values. These difficulties are circumvented by replacing |j + hj | by sign(j )(j + hj ). This enables the use of powerful continuous optimizers based either on the Newton, quasi-Newton or conjugate gradient methods according to the size of the problem. B Obtain a new admissible vector Let = + h. If for all j , sign(j ) = sign(j ), then is sign-feasible, and we go to step C, otherwise: m + B.1 Let S be the set of indices m such that sign(m ) = sign(m ). Let µ = min - , that is, mS hm µ is the largest step in direction h such that sign(m + µhm ) = sign(m ), except for one m , for which + µh = 0. variable, = arg min - m hm B.2 Set = + µh and sign( ) = - sign( ), and compute a new direction h as in step A. If, for the new solution , sign( ) = sign( ), then is removed from A. Go to step A. B.3 Iterate step B until is sign-feasible. C Test optimality of If the appropriate optimality condition holds for all inactive variables ( 1 j L( ) 4 dk , C.1 for Jk , where |j | = 0, then Jk j L( ) C.2 for Jk , where |j | = 0, then 0, = Jk then is a solution. Else, go to step D. D Select the variable that enters the active set = 0), that is L( ) where k is the group of variable . , D.2 Update the active set: A A { }, with initial vector: = [ , 0] t where the sign of the L( new zero component is - sign ) . D.1 Select variable , A that maximizes dk / 1 -4 D.3 Go to step A. The algorithm is initialized with A = , and the first variable is selected with the process described at step D. 5 4 Experiments We illustrate on two datasets how hierarchical penalization can be useful in exploratory analysis and in prediction. Then, we show how the algorithm can be applied for multiple kernel learning in kernel regression. 4.1 Abalone Database The Abalone problem [6] consists in predicting the age of abalone from physical measurements. The dataset is composed of 8 attributes. One concerns the sex of abalone, and has been encoded with dummy variables, that is xsex = (100) for male, xsex = (010) for female, or xsex = (001) for i i i infant. This variable defines the first group. The second group is composed of 3 attributes concerning size parameters (length, diameter and height), and the last group is composed of weight parameters (whole, shucked, viscera and shell weight). We randomly selected 2920 examples for training, including the tuning of by 10-fold cross validation, and left the 1257 other for testing. The mean squared test error is at par with lasso (4.3). The coefficients estimated on the training set are reported in table 4.1. Weight parameters are a main contributor to the estimation of the age of an abalon, while sex is not essential, except for infant. sex size weight 0.051 -0.044 4.370 0.036 1.134 -4.499 -0.360 0.358 -1.110 0.516 1.7405 11.989 1 1.399 0 P j Jk 4 Table 1: Coefficients obtained on the Abalone dataset. The last column represents the value dk @ |j | 3 A 4 13 4 . 4.2 Delve Census Database The Delve Census problem [7] consists in predicting the median price of a house in different survey regions. Each 22732 survey region is represented by 134 demographic information measurements. Several prototypes are available. We focussed on the prototype "house-price-16L", composed of 16 variables. We derived this prototype by including all the other variables related to these 16 variables. The final dataset is then composed of 37 variables, split up into 10 groups1 . We randomly selected 8000 observations for training and left the 14732 for testing. We divided the training observations into 10 distinct datasets. For each dataset, the parameter was selected by a 10-fold cross validation, and the mean squared error was computed on the testing set. We reported on table 4.2 the mean squared test errors obtained with the hierarchical penalization (hp), the group-lasso (gl) and the lasso estimates. Datasets 1 2 3 4 5 6 7 8 9 10 mean error hp (×109 ) 2.363 2.745 2.289 4.481 2.211 2.364 2.460 2.298 2.461 2.286 2.596 gl (×109 ) 2.463 2.849 2.285 4.206 2.230 2.363 2.464 2.308 2.458 2.289 2.591 lasso (×109 ) 2.380 2.716 2.293 4.656 2.216 2.368 2.490 2.295 2.483 2.288 2.618 Table 2: Mean squared test errors obtained with different methods for the 10 datasets. Hierarchical penalization performs better than lasso on 8 datasets, and better than group-lasso on 7 datasets. However the lowest overall mean error is achieved by group-lasso. 4.3 Multiple Kernel Learning Multiple Kernel Learning has drawn much interest in classification with support vector machines (SVMs) starting from the work of Lanckriet et al. [8]. The problem consists in learning a convex 1 A description of the dataset is available at http://www.hds.utc.fr/~mszafran/nips07/. 6 combination of kernels in the SVM optimization algorithm. Here, we show that hierarchical penalization is well suited for this purpose for other kernel predictors, and we illustrate its effect on kernel smoothing in the regression setup. Kernel smoothing has been studied in nonparametric statistics since the 60's [9]. Here, we consider the model where the response variable y is estimated by a sum of kernel functions yi = jn =1 k Kh (xi , xj ) + i , where Kh is the kernel with scale factor (or bandwidth) h, and i is a residual error. For the purpose of combining m bandwidths, the general criterion (3) reads 2 3 4 in y km j n km 1 j n 4 4 mim n k,j Khk (xi , xj ) + nk |k,j | 3 . (5) i- { k }k=1 =1 =1 =1 =1 =1 The penalized model (5) has been applied to the motorcycle dataset [9]. This uni-dimensional problems enables to display the contribution of each bandwidth to the solution. We used Gaussian kernels, with 7 bandwidths ranging from 10-1 to 102 . Figure 3 displays the results obtained for different penalization parameters: the estimated function obtained by the combination of the selected bandwidths, and the contribution of each bandwidth to the model. We display three settings for the penalization parameter , corresponding to slight overfitting, good fit and slight under-fitting. The coefficients of bandwidths h2 , h6 and h7 were always null and are thus not displayed. As expected, when the penalization parameter increases, the fit becomes smoother, and the number of contributing bandwidths decreases. We also observe that the effective contribution of some bandwidths is limited to a few kernels: there are few leading terms in the expansion. 5 Conclusion and further works Hierarchical penalization is a generic framework enabling to process hierarchically structured variables by usual statistical models. The structure is provided to the model via constraints on the subgroups of variables defined at each level of the hierarchy. The fitted model is then biased towards statistical explanations that are "simple" with respect to this structure, that is, solutions which promote a small number of groups of variables, with a few leading components. In this paper, we detailed the general framework of hierarchical penalization for tree structures of height two, and discussed its specific properties in terms of convexity and parsimony. Then, we proposed an efficient active set algorithm that incrementally builds an optimal solution to the problem. We finally illustrated how the approach can be used when groups of features, or when discrete variables exist, after being encoded by several binary variables, result in groups of variables. Finally, we also shown how the algorithm can be used to learn from multiple kernels in regression. We are now performing quantitative empirical evaluations, with applications to regression, classification and clustering, and comparisons to other regularization schemes, such as the group-lasso. We then plan to extend the formalization to hierarchies of arbitrary height, whose properties are currently under study. We will then be able to tackle new applications, such as genomics, where the available gene ontologies are hierarchical structures that can be faithfully approximated by trees. References [1] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267­288, 1996. [2] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407­499, 2004. [3] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1):49­67, 2006. 7 h5 =10 h4 =10 2 1 h3 =1 h1 =10-1 combined =10 =25 =50 Figure 3: Hierarchical penalization applied to kernel smoothing on the motorcycle data. Combined: the points represent data and the solid line the function of estimated responses. Isolated bandwidths: the points represent partial residuals and the solid line represents the contribution of the bandwidth to the model. [4] Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003. [5] M. R. Osborne, B. Presnell, and B. A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319­337, June 2000. [6] C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine learning databases, 1998. URL http://www.ics.uci.edu/~mlearn/MLRepository.html. [7] Delve: Data for evaluating learning in valid experiments. URL http://www.cs.toronto. edu/~delve/. [8] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble. A statistical framework for genomic data fusion. Bioinformatics, 20:2626­2635, 2004. ¨ [9] W. Hardle. Applied Nonparametric Regression, volume 19. Economic Society Monographs, 1990. 8