Efficient multiple hyperparameter learning for log-linear models

Chuong B. Do Chuan-Sheng Foo Andrew Y. Ng Computer Science Department Stanford University Stanford, CA 94305 {chuongdo,csfoo,ang}@cs.stanford.edu

Abstract
Using multiple regularization hyperparameters is an effec tive method for managing model complexity in problems where input features have varying amounts of noise. While algorithms for choosing multiple hyperparamet ers are often used in neural networks and support vector machines, they are not co mmon in structured prediction tasks, such as sequence labeling or parsing. In this paper, we consider the problem of learning regularization hyperparameters fo r log-linear models, a class of probabilistic models for structured prediction ta sks which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regu larization priors with multiple hyperparameters. In both simulations and the real -world task of computational RNA secondary structure prediction, we find that mu ltiple hyperparameter learning provides a significant boost in accuracy compar ed to models learned using only a single regularization hyperparameter.

1

Introduction

In many supervised learning methods, overfitting is control led through the use of regularization penalties for limiting model complexity. The effectiveness of penalty-based regularization for a given learning task depends not only on the type of regulariz ation penalty used (e.g., L1 vs L2 ) [29] but also (and perhaps even more importantly) on the choice of hyperparameters governing the regularization penalty (e.g., the hyperparameter  in an isotropic Gaussian parameter prior, ||w||2 ). When only a single hyperparameter must be tuned, cross-valid ation provides a simple yet reliable procedure for hyperparameter selection. For example, the r egularization hyperparameter C in a support vector machine (SVM) is usually tuned by training the SVM with several different values of C , and selecting the one that achieves the best performance on a holdout set. In many situations, using multiple hyperparameters gives the distinct advanta ge of allowing models with features of varying strength; for instance, in a natural language proce ssing (NLP) task, features based on word bigrams are typically noisier than those based on individua l word occurrences, and hence should be "more regularized" to prevent overfitting. Unfortunatel y, for sophisticated models with multiple hyperparameters [23], the naЈve "grid search" strategy of trying out many possible combi nations of i hyperparameter settings quickly grows infeasible as the nu mber of hyperparameters becomes large. Scalable strategies for cross-validation­based hyperpar ameter learning that rely on computing the gradient of cross-validation loss with respect to the de sired hyperparameters arose first in the neural network modeling community [20, 21, 1, 12]. More rece ntly, similar cross-validation optimization techniques have been proposed for other supervise d learning models [3], including support vector machines [4, 10, 16], Gaussian processes [35, 33 ], and related kernel learning methods [18, 17, 39]. Here, we consider the problem of hyperparam eter learning for a specialized class of structured classification models known as conditional log-linear models (CLLMs), a generalization of conditional random fields (CRFs) [19]. 1


Whereas standard binary classification involves mapping an o bject x  X to some binary output y  Y (where Y = {Б1}), the input space X and output space Y in a structured classification task generally contain complex combinatorial objects (such as sequences, trees, or matchings). Designing hyperparameter learning algorithms for structured cla ssification models thus yields a number of unique computational challenges not normally encountered in the flat classification setting. In this paper, we derive a gradient-based approach for optimizing t he hyperparameters of a CLLM using the loss incurred on a holdout set. We describe the required algo rithms specific to CLLMs which make the needed computations tractable. Finally, we demonstrate on both simulations and a real-world computational biology task that our hyperparameter learni ng method can give gains over learning flat unstructured regularization priors.

2

Preliminaries

For many real-world applications, C is assumed to take a simple form, such as a scaled identity matrix, C I. While this parameterization may be partially motivated by c oncerns of hyperparameter overfitting [28], such a choice usually stems from the difficulty of hyperparameter inference. In practice, "grid-search" procedures provide a reliable m ethod for determining hyperparamet.ers to low-precision: one trains the model using several c andidate values of C (e.g., C  ) . . , 2-2 , 2-1 , 20 , 21 , 22 , . . . , and chooses the C that minimizes holdout logloss. While this strategy is suitable for tuning a single model hyperparameter, in the more general setting of OPT2 where multiple hyperparameters must be learned, more sophistica ted strategies are necessary.

1 where 2 wT Cw (for some prespecified positive definite matrix C) is a regularization penalty used to prevent overfitting. Here, C can be interpreted as the inverse covariance matrix of a Gaus sian prior on the parameters w. Generally, C is parameterized using a small number of free variables, known as the hyperparameters of the model. While a number of efficient procedures exist for solving the op timization problem OPT1 [34, 11], little attention is usually given to choosing an appropriate regularization matrix C. Given a holdout ( ~ m set H = x(i) , y (i) ) i=1 of i.i.d examples drawn from D, hyperparameter learning itself can be ~~ cast as an optimization problem: ~ ~ . im log Pw x(i) , y (i) minimize - ~ (OPT2) C 0 =1

Conditional log-linear models (CLLMs) are a probabilistic framework for sequence labeling or parsing problems, where X is an exponentially large space of possible input sequences and Y is an exponentially large space of candidate label sequences or parse trees. Let F : X з Y  Rn be a fixed vector-valued mapping from input-output pairs to an n-dimensional feature space. CLLMs model the conditional probability of y given x as Pw (y | x) =( exp(wT F(x, y ))/Z (x) where m y T  1 Z (x) = Given a training set T = x(i) , y (i) ) i=1 of i.i.d. labeled  Y exp(w F(x, y )). input-output pairs drawn from some unknown fixed distributi on D over X з Y , the parameter learning problem is typically posed as maximum a posteriori (MAP) estimation (or equivalently, regularized logloss minimization): 1 , im w = arg min log Pw (y (i) | x(i) ) (OPT1) wT Cw - 2 w R n =1

3

Learning multiple hyperparameters

In this section, we lay the framework for multiple hyperpara meter learning by describing a simple yet flexible parameterization of C that arises quite naturally in many practical problems. We t hen describe a generic strategy for hyperparameter adaptation based on gradient-based optimization. Consider a setting in which predefined subsets of parameter c omponents (which we call regularization groups) are constrained to use the same hyperparameters [6]. For in stance, in a typical NLP task, word occurrence features may be placed in a separate regularization group from word bigram features. Formally, let k be a fixed number of regularization groups, and let
1 Conditional random fields (CRFs) correspond to the case in which the factorization of this probability distribution has an explicit graphical model structure (which is generally the c ase for sequence labeling problems, but not parsing tasks).

2


 : {1, . . . , n}  {1, . . . , k } be a prespecified mapping from a parameter component to its co rresponding regularization group. Furthermore, for a vector x  Rk , define its expansion x  Rn as x = (x(1) , x(2) , . . . , x(n) ). In the sequel, we parameterize C  Rnзn in terms of some hyperparameter vector d  Rk as the diagonal matrix, C = diag(exp(d)). Under this representation, C is necessarily positive definite, so OPT2 can be written as an unyonstrained dminimization over the variables d  Rk . c m Specifically, if T (w) = - i=1 log Pw (i) | x(i) enotes the training logloss and H (w) = ~ t ~ m - i=1 log Pw y (i) | x(i) he holdout logloss for a parameter vector w, then we have the opti~ mization problem . 1 wT Cw + T (w) (OPT2') minimize H (w ) subject to w = arg min 2 d R k w R n

For any fixed setting of these hyperparameters, the objective function of OPT2' can be evaluated by (1) using the hyperparameters d to determine the regularization matrix C, (2) solving OPT1 using C to determine w and (3) computing the holdout logloss using the parameters w . In this next section, we derive a method for computing the gradient of the objective function of OPT2' with respect to the hyperparameters. Given both procedures for f unction and gradient evaluation, we may apply standard gradient-based optimization (e.g., conjugate gradient or L-BFGS [30]) in order to find a local optimum of the objective. In general, we observe that only a few iterations ( 5) are usually sufficient to determine reasonable hyperparameter s to low accuracy.

4

The hyperparameter gradient

Note that the optimization objective H (w ) is a function of w . In turn, w is a function of the hyperparameters d, as implicitly defined by the gradient stationarity condition, Cw + w T (w ) = 0. To compute the hyperparameter gradient, we will use both of these facts. 4.1 Deriving the hyperparameter gradient d H (w ) = JT w H (w ) d
  wi / dj . 

First, we apply the chain rule to the objective function of OPT2' to obtain (1) where Jd is the n з k Jacobian matrix whose (i, j )th entry is The term w H (w ) is simply the gradient of the holdout logloss evaluated at w . For decomposable models, this may be computed exactly via dynamic programming (e.g., the forward/backward algorithm for chainstructured models or the inside/outside algorithm for gram mar-based models). Next, we show how to compute the Jacobian matrix Jd . Recall that at the optimum of the smooth unconstrained optimization problem OPT1, the partial derivative of the objective with respect to any 1 parameter must vanish. In particular, the partial derivati ve of 2 wT Cw + T (w) with respect to wi  vanishes when w = w , so  T (w ), (2) 0 = CT w + i  wi where CT denotes the ith row of the C matrix. Since (2) uniquely defines w (as OPT1 is a i strictly convex optimization problem), we can use implicit differentiation to obtain the needed partial derivatives. Specifically, we can differentiate both sides of (2) with respect to dj to obtain + pn pn w      0= Cip + Cip wp T (w ) w, (3) p  dj  dj  wp  wi  dj p =1 =1  pn C    = I{(i)=j } wi exp(dj ) + T (w ) w . (4) ip +  wp  wi  dj p =1 Stacking (4) for all i  {1, . . . , n} and j  {1, . . . , k }, we obtain the equivalent matrix equation, 0 = B + (C + 2 T (w ))Jd w (5)

 where B is the n з k matrix whose (i, j )th element is I{(i)=j } wi exp(dj ), and 2 T (w ) is the w Hessian of the training logloss evaluated at w . Finally, solving these equations for Jd , we obtain

Jd = -(C + 2 T (w ))-1 B. w 3

(6)


Algorithm 1: Gradient computation for hyperparameter selection. Input: Output: ( m ( ~ m training set T = x(i) , y (i) ) i=1 , holdout set H = x(i) , y (i) ) i=1 ~~ current hyperparameters d  Rk

hyperparameter gradient d H (w )

1. Compute solution w to OPT1 using regularization matrix C = diag(exp(d)).
 2. Form the matrix B  Rnзk such that (B)ij = I{(i)=j } wi exp(dj ). 3. Use conjugate gradient algorithm to solve the linear syst em, 2 (C + w T (w ))x = w H (w ).

4. Return -BT x. Figure 1: Pseudocode for gradient computation 4.2 Computing the hyperparameter gradient efficiently In principle, one could simply use (6) to obtain the Jacobian matrix Jd directly. However, computing the n з n matrix (C + 2 T (w ))-1 is difficult. Computing the Hessian matrix 2 T (w ) in a w w typical conditional random field requires approximately n times the cost of a single logloss gradient evaluation. Once the Hessian has been computed, typical matrix inversion routines take O(n3 ) time. Even more problematic, the (n2 ) memory usage for storing the Hessian is prohibitive as typic al log-linear models (e.g., in NLP) may have thousands or even millions of features. To deal with these problems, we first explain why (C + 2 T (w ))v for any arbitrary vector v  Rn can be w computed in O(n) time, even though forming (C + 2 wT (w ))-1 is expensive. Using this result, b we then describe an efficient procedure for computing the hol dout hyperparameter gradient which avoids the expensive Hessian computation and inversion steps of the direct method. First, since C is diagonal, the product of C with any arbitrary vector v is trivially computable in O(n) time. Second, although direct computation of the Hessian is inefficient in a generic log-linear model, computing the product of the Hessian with v can be accomplished by any of the following three methods, in order of increasing implementation effor t (and numerical precision): 1. We can approximate the Hessian-vector product via finite-differences based on the identity, w T (w + rv) - w t (w ) 2 T (w ) З v = lim . (7) w r 0 r 2. In the complex-step derivative method [24], we extend the domain of our analytically differentiable training loss function T (З) to the complex space, Cn , and apply the identity, Im {w T (w + i З rv)} . (8) 2 T (w ) З v = lim w r 0 r where Im {З} denotes the imaginary part of its complex argument (in this c ase, a vector). Because there is no subtraction in the numerator of the right -hand expression, the complexstep derivative does not suffer from the numerical problems of the finite-differencing method that result from cancellation. As a consequence, much smaller step sizes can be used, allowing for greater accuracy. 3. Finally, Pearlmutter [31] provided an exact method for efficiently computing Hessianvector products in O(n) time and space given an O(n) algorithm for analytic gradient computation. In particular, Pearlmutter defined the differ ential operator, r f (w + rv) - f (w)  Rv {f (w)} = lim = f (w + rv) , (9) r 0 r r =0 for which one can verify that Rv {w T (w )} = 2 T (w ) З v. From the standard rules w for differential operators,
Rv {c} = 0 Rv {w} = v Rv Rv {cf (w) + dg (w)} = cRv {f (w)} + dRv {g (w)} Rv {f (w)g (w)} = Rv {f (w)}g (w) + f (w)Rv {g (w)} Rv {f (w)} Rv {f (w)/g (w)} = Rv {f (w)}g (w) - f (w)Rv {g (w)} . g (w ) 2

We can apply these rules to an existing gradient computation routine to compute Rv {w T (w )}. In particular, for each entry z of some dynamic programming table in 4

ef (w) =

e

f (w)


(a) y1 xj 1 y2 xj 2 ЗЗЗ yL

(b)
0.5

(c)
0.55

Proportion of incorrect labels

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0 10 20

Proportion of incorrect labels

ЗЗЗ

xj L

grid single multiple grouped

0.5

grid single multiple grouped

0.45

"observed features"

j  {1, . . . , R}

0.4

xj 1

xj 2

ЗЗЗ

xj L

0.35

"noise features"

j  {R + 1, . . . , 40}

30

40

0.3

0

20

40

60

80

Number of relevant features, R

Training set size, M

Figure 2: HMM simulation experiments. (a) State diagram of the HMM used in the simulations. (b) Testing set performance when varying R, using M = 10. (c) Testing set performance when varying M , using R = 5. In both (b) and (c), each point represents an average over 10 independent runs of HMM training/holdout/testing set generation and CRF training and hyperparameter optimization. the gradient procedure, we maintain an auxiliary entry cont aining Rv {z }; at the end of the gradient computation, then Rv {w T (w )} contains the desired Hessian-vector product. Hessian-vector products for graphical models were previou sly used in the context of online training [36]. In our experiments, we found that simple finite-differencing provided sufficient accuracy for our application. To exploit the fact that matrix-vector products are efficien tly computable, we use the conjugate gradient (CG) method to solve the matrix equation (5) to obtain Jd . Unlike direct methods for solving linear systems Ax = b, CG is an iterative method which relies on the matrix A only through matrix-vector products Av. In practice, few steps of the CG algorithm are generally nee ded to find an approximate solution of a linear system with acceptable accuracy. Using CG in this way amounts to solving k linear systems, one for each column of the Jd matrix. Unlike the direct method of forming the (C + 2 T (w )) matrix and its inverse, solving w the linear systems avoids the expensive (n2 ) cost of Hessian computation and matrix inversion. Nevertheless, even this approach for computing the Jacobia n matrices still requires the solution of multiple linear systems, which scales poorly when the number of hyperparameters k is large. However, we can do much better by reorganizing the computati ons in such a way that the Jacobian matrix Jd is never explicitly required. In particular, substituting (6) into (1), d H (w ) = -BT (C + 2 T (w ))-1 w H (w ) (10) w we observe that it suffices to solve the single linear system, (11) (C + 2 T (w ))x = w H (w ) w and then form d H (w ) = -BT x. By organizing the computations this way, the number of least squares problems that must be solved is substantially reduc ed from k to only one. A similar trick was previously used for hyperparameter adaptation in SVMs [ 16] and kernel logistic regression [33]. Figure 1 shows a summary of our algorithm for hyperparameter gradient computation.2

5

Experiments

To test the effectiveness of our hyperparameter learning al gorithm, we applied it to two tasks: a simulated sequence labeling task involving noisy features, and a real-world application of conditional log-linear models to the biological problem of RNA secondary structure prediction. Sequence labeling simulation. For our simulation test, we constructed a simple linear-cha in hidden Markov model (HMM) with binary-valued hidden nodes, yi  {0, 1}.3 We associated 40 binary-valued features xj , j  {1, . . . , 40} with each hidden state yi , including R "relevant" obi served features whose values were chosen based on yi , and (40 - R) "irrelevant" noise features
In practice, roughly 50-100 iterations of CG were sufficient to obtain hyp erparameter gradients, meaning that the cost of running Algorithm 1 was approximately the same as the cost of solving OPT1 for a single fixed setting of the hyperparameters. Roughly 3-5 line searches were suffic ient to identify good hyperparameter settings; assuming that each line search takes 2-4 times the cost of solving OPT1, the overall hyperparameter learning procedure takes approximately 20 times the cost of solving OPT1 once. 3 For our HMM, we set initial state probabilities to 0.5 each, and used self-transition probabilities of 0.6.
2

5


(a)

.a u . g . c . c . u . 5'

.g

. Grammar a S  S | S  |  | S  | S S ^ . a . g Parse . g S  uS a  ucS ga . a  uccS gga  uccgS gga .  . . .  uccguagaagga 3'

(c)
Regularization group hairpin loop lengths multi-branch loop lengths bulge loop lengths hairpin loop length extensions (  30) helix closing base pairs helix base pairs internal loop asymmetry internal loop lengths exterior loop lengths helix stacking interactions terminal mismatch interactions single base pair stacking interactions exp(di ) 0.00 0.14 0.45 0.99 1.45 2.30 4.88 9.29 12.43 159.35 327.45 497.51

(b)
Sens. (spec.) ROC area Logloss Original 73.9 (50.4) 0.6161 2585.7 Modified 75.5 (52.4) 0.6337 2529.0

Figure 3: RNA secondary structure prediction. (a) An exampl e of the nested base-pairing structure in an RNA molecule, a simple example context-free grammar for parsing this structure where S is a ^ nonterminal, and ,   {a, c, g , u} (not the same as the CONTRAfold grammar), and an example parse. (b) RNA folding performance using either a single hyp erparameter (the "Original" CONTRAfold), or multiple grouped hyperparameters (our "Modifi ed" version). Sensitivity/specificity numbers are based on maximum expected accuracy parsing with  = 1024 (see [7]). (c) Learned RNA structure prediction hyperparameters using holdout cr oss-validation on all 151 sequences. whose values were chosen to be either 0 or 1 with equal probabi lity, independent of yi .4 Figure 2a shows the graphical model representing the HMM. For each run , we used the HMM to simulate training, holdout, and testing sets of M , 10, and 1000 sequences, respectively, each of length 10. Next, we constructed a CRF based on an HMM model similar to that shown in Figure 2a in which potentials were included for the initial node y1 , between each yi and yi=1 , and between yi and each xj (including both the observed features and the noise feature s). We then performed i gradient-based hyperparameter learning using three diffe rent parameter-tying schemes: (a) all hyperparameters constrained to be equal, (b) different hyperpar ameters for each parameter of the model, and (c) transitions, observed features, and noise features each grouped together. Figure 2b shows the performance of the CRF for each of the three parameter-tying gradient-based optimization schemes, as well as the performance of scheme (a) when using the standard grid-search strategy of trying . . regularization matrices C I for C  . . , 2-2 , 2-1 , 20 , 21 , 22 , . . . As seen in Figures 2b and 2c, using either a single hyperparam eter or all separate hyperparameters generally gave similar results, with a slight tendency for the separate hyperparameter model to overfit. Enforcing regularization groups, however, gave consistently lower error rates, achieving an absolute reduction in generalization error over the next -best model of 6.7%, corresponding to a relative reduction of 16.2%. RNA secondary structure prediction. We also applied our framework to the problem of RNA secondary structure prediction. Ribonucleic acid (RNA) mo lecules are long nucleic acid polymers present in the cells of all living organisms. For many types o f RNA, three-dimensional (or tertiary) structure plays an important role in determining the RNA's f unction. Here, we focus on the task of predicting RNA secondary structure, i.e., the pattern of nucleotide base pairings which form the two-dimensional scaffold upon which RNA tertiary structures assemble. As a starting point, we used CONTRAfold [7], a current state-of-the-art technique based on CLLMs, which has outperformed alternative approaches based on empirical measurement of t hermodynamic parameters. In CONTRAfold, valid secondary structure configurations fo r an RNA molecule are assumed to involved base-pairs that "nest" in such a way as to allow straightforward modeling of RNA sequences and structures using context-free grammars (see Fi gure 3a). The CONTRAfold log-linear model extends upon a simple discriminatively-trained stochastic context-free grammar (SCFG) by incorporating features chosen to closely match the energet ic terms found in standard physics-based models of RNA structure. For instance, many features in the C ONTRAfold program are dedicated to accurately modeling the free energies of various types of lo ops that occur in RNA secondary structures (e.g., hairpin loops, bulge loops, interior loops, et c.). To control overfitting, CONTRAfold uses a flat L2 regularization penalty. While somewhat effective, such a non-structured regularization term
4

Specifically, we drew each xj independently according to P (xj = v | yi = v ) = 0.6, v  {0, 1}. i i

|||

6


ignores the fact that different features in the model are "no isy" to varying degrees (i.e., features vary widely in their expected correlation with structure). To explore the benefits of using multiple hyperparameters, w e modified the existing CONTRAfold implementation to perform an "outer" optimization loop based on our algorithm, using as regularization groups the twelve classes of structural f eatures used in the CONTRAfold program. We performed two-fold cross-validation over a subset of 151 known structures from the Rfam database [13]. Following [7], we used the maximum expected accuracy algorithm for decoding, which returns a set of candidates parses reflecting d ifferent trade-offs between sensitivity (proportion of true base-pairs called) and specificity (pro portion of called base-pairs which are correct). Introducing multiple hyperparameters gave both inc reased sensitivity and increased specificity compared to the original model (see Figure 3b), which was lea rned using a single regularization hyperparameter. Overall, the testing logloss decreased by ro ughly 2.2% while the estimated testing ROC area increased by roughly 2.9%. To put these numbers in context, [7] observed a difference in testing ROC area between the full CONTRAfold model and an impaired model which did not use single base-pair stacking features, terminal mismatch features, and helix stacking feat ures, that was smaller than the ROC area difference we report here. As shown in Figure 3c, our learnin g algorithm assigned relatively high regularization penalties for each of these three regulariz ation groups. By using multiple hyperparameters, our modified prediction program was able to effect ively leverage these noisy features, while still using the simpler features of RNA secondary stru cture, such as helix base-pairs and loop length features. In contrast, the original CONTRAfold model, which assumed a uniform regularization penalty for all parameters, could not as effectively in corporate these complex features without compromising the modeling of simpler features.

6

Discussion and related work

In this work, we presented a gradient-based approach for hyp erparameter learning based on minimizing logloss on a holdout set. While the use of cross-validation loss as a proxy for generalization error is fairly natural, in many other supervised learning m ethods besides log-linear models, other objective functions have been proposed for hyperparameter optimization. In SVMs, approaches based on optimizing generalization bounds [4], such as the radius/margin-bound [15] or maximal discrepancy criterion [2] have been proposed. Comparable g eneralization bounds are not generally known for CRFs; even in SVMs, however, generalization bound-based methods empirically do not outperform simpler methods based on optimizing five-fold cr oss-validation error [8]. A different method for dealing with hyperparameters, commo n in neural network modeling, is the Bayesian approach of treating hyperparameters themsel ves as parameters in the model to be estimated. In an ideal Bayesian scheme, one does not perform hyperparameter or parameter inference, but rather integrates over all possible hyperparameters an d parameters in order to obtain a posterior distribution over predicted outputs given the training dat a. This integration can be performed using a hybrid Monte Carlo strategy [27, 38]. For the types of large -scale log-linear models we consider in this paper, however, the computational expense of sampling-based strategies can be extremely high due to slow convergence of MCMC techniques [26]. Empirical Bayesian (i.e., ML-II) strategies, such as Automatic Relevance Determination (ARD) [22], take the intermediate approach of integrating over parameters to obtain the marginal likelihood (known as the log evidence), which is then optimized with respect to the hyperparameters. Computing marginal likelihoods, however, can be quit e costly, especially for log-linear models. One method for doing this involves approximating the parame ter posterior distribution as a Gaussian centered at the posterior mode [22, 37]. In this strategy, however, the "Occam factor" used for hyperparameter optimization still requires a Hessian computati on, which does not scale well for log-linear models. An alternate approach based on using a modification of expectation propagation (EP) [25] was applied in the context of Bayesian CRFs [32] and later extended to graph-based semi-supervised learning [14]. As described, however, inference in these mo dels relies on non-traditional "probitstyle" potentials for efficiency reasons, and known algorit hms for inference in Bayesian CRFs are limited to graphical models with fixed structure. In contrast, our approach works broadly for a variety of log- linear models, including the grammar-based models common in computational biology and natural language processing. Furthermore, our algorithm is simple and efficient, both concep tually and in practice: one iteratively optimizes the parameters of a log-linear model using a fixed s etting of the hyperparameters, and then one changes the hyperparameters based on the holdout loglos s gradient. The gradient computation 7


relies primarily on a simple conjugate gradient solver for l inear systems, coupled with the ability to compute Hessian-vector products (straightforward in an y modern programming language that allows for operation overloading). As we demonstrated in the c ontext of RNA secondary structure prediction, gradient-based hyperparameter learning is a p ractical and effective method for tuning hyperparameters when applied to large-scale log-linear mo dels. Finally we note that for neural networks, [9] and [5] propose d techniques for simultaneous optimization of hyperparameters and parameters; these results suggest that similar procedures for faster hyperparameter learning that do not require a doubly-neste d optimization may be possible.

References
[1] L. Andersen, J. Larsen, L. Hansen, and M. Hintz-Madsen. Adaptive regularization of neural classifiers, 1997. [2] D. Anguita, S. Ridella, F. Rivieccio, and R. Zunino. Hyperparameter design criteria for support vector classifiers. Neurocomputing, 55:109­134, 2003. [3] Y. Bengio. Gradient-based optimization of hyperparameters, 2000 . [4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1­3):131­159, 2002. [5] D. Chen and M. Hagan. Optimal use of regularization and cross-validation in neural network modeling. In IJCNN, 1999. [6] C. B. Do, S. S. Gross, and S. Batzoglou. CONTRAlign: discriminative training for protein sequence alignment. In RECOMB, pages 160­174, 2006. [7] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90­e98, 2006. [8] K. Duan, S. S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51(4):41­59, 2003. [9] R. Eigenmann and J. A. Nossek. Gradient based adaptive regularization. In NNSP, pages 87­94, 1999. [10] T. Glasmachers and C. Igel. Gradient-based adaptation of general Gaussian kernels. Neural Comp., 17(10):2099­2105, 2005. [11] A. Globerson, T. Y. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear structured prediction. In ICML, 2007. [12] C. Goutte and J. Larsen. Adaptive regularization of neural netwo rks using conjugate gradient. In ICASSP, 1998. [13] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33:D121­D124, 2005. [14] A. Kapoor, Y. Qi, H. Ahn, and R. Picard. Hyperparameter and kernel learning for graph based semisupervised classification. In NIPS, pages 627­634, 2006. [15] S. S. Keerthi. Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Transaction on Neural Networks, 13(5):1225, 2002. [16] S. S. Keerthi, V. Sindhwani, and O. Chapelle. An efficient method for gradient-based adaptation of hyperparameters in svm models. In NIPS, 2007. [17] K. Kobayashi, D. Kitakoshi, and R. Nakano. Yet faster method to optimize SVR hyperparameters based on minimizing cross-validation error. IJCNN, 2, 2005. [18] K. Kobayashi and R. Nakano. Faster optimization of SVR hyperparameters based on minimizing crossvalidation error. Cybernetics and Intelligent Systems, 2004 IEEE Conference on, 2, 2004. [19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fie lds: probabilistic models for segmenting and labeling sequence data. In ICML 18, pages 282­289, 2001. [20] J. Larsen, L. K. Hansen, C. Svarer, and M. Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In NNSP, 1996. [21] J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen. Adaptive regularization in neural network modeling. In Neural Networks: Tricks of the Trade, pages 113­132, 1996. [22] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415­447, 1992. [23] D. J. C. MacKay and R. Takeuchi. Interpolation models with multiple hyperparameters. Statistics and Computing, 8:15­23, 1998. [24] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step derivative approximation. ACM Trans. Math. Softw., 29(3):245­262, 2003. [25] T. P. Minka. Expectation propagation for approximate Bayesian inference. UAI, 17:362­369, 2001. [26] I. Murray and Z. Ghahramani. Bayesian learning in undirected gr aphical models: Approximate MCMC algorithms. In UAI, pages 392­399, 2004. [27] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. [28] A. Y. Ng. Preventing overfitting of cross-validation data. In ICML, pages 245­253, 1997. [29] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, page 78, 2004. [30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999. [31] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147­160, 1994.

8


[32] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random fields. In AISTATS, 2005. [33] M. Seeger. Cross-validation optimization for large scale hierarchic al classification kernel methods. In NIPS, 2007. [34] F. Sha and F. Pereira. Shallow parsing with conditional random field s. In NAACL, pages 134­141, 2003. [35] S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Comp., 13(5):1103­1118, 2001. [36] S. V. N. Viswanathan, N. N. Schradolph, M. W. Schmidt, and K. Murphy. Accelerated training of conditional random fields with stochastic meta-descent. In ICML, 2006. [37] M. Wellings and S. Parise. Bayesian random fields: the Bethe-Laplace approximation. In ICML, 2006. [38] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342­1351, 1998. [39] X. Zhang and W. S. Lee. Hyperparameter learning for graph ba sed semi-supervised learning algorithms. In NIPS, 2007.

9