Local Minima Embedding Minyoung Kim Fernando De la Torre Robotics Institute, Carnegie Mellon University, PA 15213, USA mikim@cs.rutgers.edu ftorre@cs.cmu.edu Abstract Dimensionality reduction is a commonly used step in many algorithms for visualization, classification, clustering and modeling. Most dimensionality reduction algorithms find a low dimensional embedding that preserves the structure of high-dimensional data points. This paper proposes Local Minima Embedding (LME), a technique to find a lowdimensional embedding that preserves the local minima structure of a given objective function. LME provides an embedding that is useful for visualizing and understanding the relation between the original variables that create local minima. Additionally, the embedding can potentially be used to sample the original function to discover new local minima. The function in the embedded space takes an analytic form and hence the gradients can be computed analytically. We illustrate the benefits of LME in both synthetic data and real problems in the context of image alignment. To the best of our knowledge this is the first paper that addresses the problem of finding an embedding that preserves local minima properties of an objective function. Figure 1. Illustration of local minima embedding (LME). LME finds both an embedding (z) of the input space (x) and a new target function (g(z)) defined on the embedded space that preserves the local minima structure of f (x). The three local minima of the original function (f (x)) are preserved in g(z). 1. Introduction Optimization algorithms occupy a central role within the arsenal of computational methods used for solving problems in the fields of machine learning, statistics, computer vision, and pattern recognition. In particular, many machine learning algorithms can be cast as optimization problems. A major challenge in optimization is to find the global minimum and to understand the structure of local minima of a given problem. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Motivated by the need of visualizing and understanding the structure of local minima, in this paper, we consider the problem of finding an embedding an error function that preserves local minima properties. Fig. 1 illustrates the main idea of the paper. The objective function (f (x)) defined on the 2D parameter space (left plot) contains three local minima. After performing our 1D embedding, the new objective function (right plot), now defined on the lower dimensional embedded space (z), preserves the local minima structure of the original function. We formulate the task of local minima preserving embedding (denoted by LME) as follows: Given a real-valued function f (x) on the input space x Rp , we find the low-dimensional embedding of x, denoted by z Rq (q p), while simultaneously approximating f (x) to a real-valued function g(z) so that it preserves the local minima properties of f (x) as much as possible. LME is useful for visualization, as well as doing a gradient search in the low dimensional space (e.g., 1D search), that typically can be done more efficiently than in the input space. Moreover, the embedding can potentially capture underlying redundancies or dependencies that reside in the original variables, providing a chance to explore potentially unobserved local minima points. Local Minima Embedding The rest of the paper is organized as follows: after reviewing some related work on high-dimensional data embedding in Sec. 2 and defining the notation used throughout the paper in Sec. 3, we propose our LME algorithm in Sec. 4. Experimental results on synthetic and real data are illustrated in Sec. 5. X3 X1 X4 X6 X5 X2 2. Related Work A large literature on dimensionality reduction (DR) has been devoted to the classical unsupervised and supervised settings. In the former, discovering a low dimensional structure of the data can often be accomplished by either extracting global statistical information (e.g., principal directions of PCA and kernel PCA (Sch¨lkopf et al., 1998)), or by exploiting the o geometric nature of data (e.g., local linear structures of LLE (Roweis & Saul, 2000), geodesic distances of ISOMAP (Tenenbaum et al., 2000), locality preserving projection (LPP) (He & Niyogi, 2003). In supervised dimensionality reduction, each data point is marked with additional label information that guides the formation of the low-dimensional embedded space. When the label indicates a discrete class membership, such a class label can be exploited to enforce that data points are either grouped together or far apart from one another in the embedded space. The well-known Fisher discriminant analysis and the diverse forms of metric learning algorithms (Globerson & Roweis, 2005; Xing et al., 2002; Yan et al., 2007) fall into this category. On the other hand, in certain applications it is reasonable to regard the label as a smoothly varying real-value. The DR in this setting, often referred to as dimensionality reduction for regression, can be treated within a regression framework, where we regress a target (output label) from input data points. The embedding then tries to minimize the loss of information caused by dimensionality reduction in terms of the regression performance, often achieved by enforcing conditional independence between output and input given the embedding (Fukumizu et al., 2004; Kim & Pavlovic, 2008; Li, 1991; Nilsson et al., 2007). A relatively unexplored problem in DR is finding embeddings that preserve local minima of a given objective function. Closest to our work is the structure preserving embedding (Shaw & Jebara, 2009) whose goal is to preserve the nearest neighbor structure of the input graph. They explicitly enforce affinity (neighbors) and repulsive (non-neighbors) constraints to fully respect the original graph topology. Recent related work in computer vision aims to perform feature selection/weighting in generative (Nguyen & de la Torre, 2008) and dicriminative (Wu et al., 2008) models that Figure 2. Example illustrating the neighborhood topology. avoid local minima in image alignment. Unlike previous work on dimensionality reduction on data samples or graph structures, this paper proposes a dimensionality reduction technique to preserve the local minima structure of a given objective function. 3. Notation and Setup We denote by f (x) the target function in the original space. For simplicity and tractability, we assume that we are given a set of n paired samples D = {(xi , f (xi ))}n , where xi Rp . We use the i=1 boldfaced matrix/vector notation, X = [x1 , . . . , xn ] and f = [f (x1 ), . . . , f (xn )] , which are of dimension (n × p) and (n × 1), respectively. We assume that D contains three types of samples: (i) some of the local minima points of the target function f (·), (ii) their neighborhood points obtained by random sampling around the local minima, and (iii) some other non-neighbor points. These points are all labeled, meaning that we know which data points are local minima, neighbors of local minima, or nonneighbors. We define a neighbor as a point x which is close to a local minimum x in the original space (i.e., ||x-x || < for some small ). The set of neighbors of x within the radius is denoted by N (x ). By definition, the function value of a neighbor is not smaller than that of the local minimum (i.e., f (x ) f (x)). For simplicity we restrict ourselves to disjoint neighborhoods, meaning that N (x ) N (x ) = for any two local minima x and x . Given D and the neighborhood threshold constant , we represent the neighborhood topology as the (n × n) matrix U which is a 0/1 matrix with Uji = 1 iff xi is a local minimum and xj N (xi ). We also define the diagonal matrix V whose diagonal entries are the row sums of U. For better understanding, consider the synthetic example in Fig. 2 where D contains 6 points with two local minima x2,5 in blue/red color, their neighbors x1,4,6 , and a non-neighbor x3 . In this case, it is easy to see that the neighborhood topology Local Minima Embedding matrix U and the row-sum matrix V are: U= 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 ,V = 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 . g(z ) g(z) as well as the stationary point condition g(z) z=z = 0. We also need to avoid that z several local minima collapse in a single point. · Condition III (Function Approximation): Overall, g(z) should approximate f (x) well. 4.1.1. Enforcing Condition I To enforce condition I, namely preserving the neighbor structures around the local minima points, LME enforces the following constraint: For every local minimum x , x N (x ) z N (z ), where z = B x and z = B x . Here, the neighborhood set in the embedded space, N (z ), is defined for a new radius constant . A straightforward way to enforce the above constraint would be to have inequality constraints as follows: ||B x - B x ||2 ||B x - B x ||2 , for all x N (x ) and x N (x ). / (2) (1) U and V facilitate expressing the local minima constraints (i.e., f (x ) f (x) for x N (x )) compactly as a single linear inequality system, namely (U - V)f 0. 4. Local Minima Embedding (LME) LME learns two mappings: (i) A low-dimensional embedding z = B x where B = [b1 , . . . , bq ] is the (p×q) embedding matrix1 , and z Rq is the q-dimensional (q p) embedded point of x. (ii) A new target function g(z) on the embedded space, which approximates f (x) while preserving the local minima properties of the original function f (·). We assume a linear form g(z) = c z with the parameter c Rq . Note that as a linear function does not entail a local minimum, we should kernelize it (now a kernel on the embedded space). Also notice that this roughly corresponds to performing a kernel regression with the data {(zi , f (xi ))}n . However, as shown in this section, we i=1 do not solve these two problems separately, but optimize them simultaneously in a principled manner. 4.1. Constraints for LME Given a set of n paired samples D = {(xi , f (xi ))}n , i=1 where xi Rp , LME finds a low-dimensional embedding that satisfies three criteria: · Condition I (Local Structure Preservation): The embedding preserves the locality (or proximity) structure of local minima points and their neighborhood points. That is, if a point is close to a local minimum in the original space, so is it in the embedded space. · Condition II (Local Minima Preservation): Within each neighborhood around a local minimum, the target function g(·) has to satisfy the local minima property. More specifically, when we denote by x (z ) and x (z), a local minimum and its neighbor in the original (embedded) space, respectively, one can enforce that 1 One can always consider a kernel extension, in which B becomes an operator composed of q (nonlinear) functions in the RKHS of x. For expositional convenience, we use the linear representation throughout the paper, which is then turned into a kernel version straightforwardly using the kernel trick. However, (2) has the form of differences of quadratic functions, which can yield non-convex constraints in the optimization of B. Instead, one can consider a soft version of neighborhood preserving by penalizing pairs of data points which have higher proximity in the original space, but lower proximity in the embedded space. This can be expressed as: min B i,j wij ||zi - zj ||2 , (3) where wij is the measure of affinity between xi and xj in the original space, having a larger value if xi and xj are closer. Note that (3) has the same form2 as that of the locality preserving projection (He & Niyogi, 2003) and the Laplace eigenmap (Belkin & Niyogi, 2002). There are several diverse ways to construct the affinity matrix W = (wij ). Throughout the paper, we assume that W = Kx , the kernel matrix in the original space. Using the matrix notation Z = [z1 , . . . , zn ] = XB, (3) is equivalent to: min tr(Z LZ) = tr(B X LXB), B (4) where L = D - W is the graph Laplacian induced from W, and D is the diagonal matrix with row-sum entries of W. As is done in LPP, one typically needs to impose certain regularization on B to avoid the trivial solution of the constant Z (or B = 0). In our case, however, since this trivial case is automatically discarded by the least-square objective (8), we do not explicitly consider it. 2 Local Minima Embedding 4.1.2. Enforcing Condition II In condition II, we enforce the local minima criteria for a new target function g(·) within the neighborhoods of local minima points. We consider two conditions: (i) minimality within a neighborhood, and (ii) stationary point condition. The former can be formally stated as: For every local minimum x and x N (x ), g(z ) g(z). (5) As g(z) = c z, the inequality constraints (5) can be equivalently expressed in a matrix form as: (U - V)Zc 0. (6) Note that (6) forms a set of nonlinear and non-convex constraints as we optimize for both c and Z. The latter stationary point condition can be written as: g(z) z=z = 0. As there always exists a numerical z error within machine precision, we introduce a slack variable , a small positive number. Then g(z) z 2 minB,c s.t. ||XBc - f ||2 + tr(B X LXB) (U - V)XBc 0 g(z) z 2 , ||z - z ||2 .(9) z=z Here is the trade-off parameter that balances two cost terms. As mentioned earlier, it is necessary to kernelize the target function g(·). We consider two RBF kernels, one for the original space and the other for the embedded space: 1 kx (xi , xj ) = exp - 22 ||xi - xj ||2 and x kz (zi , zj ) = exp - 1 2 2z ||zi - zj ||2 , where the scale parameters x and z are assumed fixed. We denote their kernel matrices evaluated on the training data by Kx and Kz , respectively, which are of dimension (n × n). Using the dual representation, the embedding operator B can be parameterized by the (n × q) matrix = (ij ), namely B = [b1 (·), . . . , bq (·)], where n bj (·) = i=1 ij kx (·, xi ) for j = 1, . . . , q. This, by the kernel trick, leads to the nonlinear embedding expressed as Z = Kx , while replacing the second term in the objective (9) by tr( Kx LKx ). Similarly, the target function g(·) can be represented in a dual form in the embedded space as: n . z=z (7) Unfortunately, (7) is not a convex constraint in general due to the nonlinearity of g(z) in z. Sometimes, we have observed that the local minima points (and their neighbors) are collapsed to one another in their embedding. To address this issue, we additionally enforce the following constraint for every pair of local minima points (x , x ): ||z - z || , where is chosen appropriately (e.g., = 1). 4.1.3. Enforcing Condition III Finally, the last condition forces the new target function g(·) to approximate well the original function f (·). In the least-squares sense, this can be written as: min i g(z) = i=1 i kz (z, zi ), (10) ||g(zi ) - f (xi )||2 = ||Zc - f ||2 . (8) It is worth noticing that in (8) we enforce g(·) to be close to f (·) for all the training data points, which has an auxiliary effect of imposing the ordering constraints on the function values at the local minima points. For instance, consider two local minima points x and x where x is the global minimum while x is not. Then it is desirable to have the function values ordered accordingly, namely g(z ) g(z ). This can be enforced either explicitly by a set of inequalities similar to (6) or implicitly by (8). We take the latter approach. 4.2. Energy function for LME Combining all the constraints from section 4.1, (4), (6), (7), and (8), LME optimizes (with Z = XB): parameterized by the (n×1) vector = (i ). It should be noted that the functional form of (10) sets an upper bound on the number of local minima in the new function. That is, g(z) has at most n local minima points in the embedded space, which is reasonable given a finite sample scenario. The kernel trick replaces Zc by Kz , which yields the kernelized version of (9): 1 min, ||Kz - f ||2 + tr( Kx LKx ) 2 s.t. (U - V)Kz 0 g(z) z 2 , ||z - z ||2 .(11) z=z As Kz is nonlinearly related to through the RBF function, (11) is an instance of non-convex optimization with non-convex inequality constraints. We solve this problem using the constrained optimization toolbox in Matlab with the function fmincon(). Once we have found the optimal and , the embedding of a new test point x can be obtained from: z = kx (x), (12) where kx (x) = [kx (x, x1 ), . . . , kx (x, xn )] is the (n×1) test kernel vector for x. Local Minima Embedding 5. Experiments This section demonstrates the effectiveness of LME for visualization and local minima discovery. In the first experiment we tested the ability of LME to find an embedding that preserves the local minima of an analytic function. In the second experiment LME found an embeeding of the error function used for template matching over translation, rotation and scale. 5.1. Synthetic data We performed a synthetic experiment on the three hump camel function, defined as 1 (13) y = x6 - 1.05x4 + 2x2 + x2 - x1 x2 . 1 1 2 6 1 It has 3 local minima in the 2D space, where one of them is the global minimum taking a strictly smaller function value than the remaining two local minima. Fig. 3(a) and Fig. 3(b) depict the function and the contour plot, respectively. The training data D is composed of all local minima points (depicted as squares), 20 randomly sampled neighbor points (crosses) per each local minimum within a ball of radius 0.4, and 30 non-neighbor samples (black dots with sample IDs). We used two RBF kernels for nonlinear mappings, B and g(·), and set the parameters as x = z = 1.0. Starting from random and as initial iterates, we optimized (11) using fmincon() in Matlab until convergence. We found the 1D embedding and the transformed target function as shown in Fig. 3(c), where the green curve represents the embedded function g(z) obtained by LME (10). LME successfully preserves the same local minima of the original function. We also compared LME with two baseline approaches that perform embedding of the function (KPCA and KLPP (He & Niyogi, 2003)) followed by a leastsquare fitting of the objective function in the embedded spaces. As shown in Fig. 3(d) and Fig. 3(e), both independent embedding approaches fail to preserve the local minima structure of the original function. 5.2. Template matching This section tested the LME on the error function for template matching. Given an image frame I and a template model I0 , template matching minimizes: min ||I((x; p)) - I0 ||2 , 2 p (a) Image frame (b) Template model Figure 4. (a) Image frame (128 × 128 pixels). (b) Template patch 48 × 48 pixels. lm 1 (err=44.02) lm 2 (err=67.05) lm 3 (err=71.27) lm 4 (err=81.08) (a) Local minima (b) Images & errors Figure 5. (a) 4 local minima points depicted as red boxes with different positions, scales, and rotations found by sampling and gradient search. The global minimum is depicted as a yellow box, while the local minima are in red. (b) Images and sum-of-squared errors of the 4 local minima. parameter (scale), and in-plane rotation parameter (rotation). We used Fig. 4(a) as an image frame I of size (128 × 128) which contains a face template image patch I0 (Fig. 4(b)) at its center. The template patch is of size (48 × 48). The image frame has uniform gray background, where the background intensity is set to the average intensity of the template patch. Note that the error function is defined as the sumof-squared-distances (SSD) on the 4D input space x. To find some local minima, we did numerical gradient search around some randomly chosen starting points. We found 4 local minima points (lm-1 lm-4) depicted as the red boxes in Fig. 5(a), as well as the global minimum depicted as the yellow box. We also computed the SSD for these 4 local minima w.r.t. the template patch as shown in Fig. 5(b). For each local (and global) minima point, we randomly sampled 8 neighbors taking into account different granularities for different parameters. We also randomly sampled 20 non-neighboring points. LME was applied to reduce the original 4D input to 1D. The resulting embedding and the transformed target error function is shown in Fig. 6. As shown in the figure, the embedding preserves all the local minima points of the given target function observed in the data samples. Interestingly, the embedding reflects a new local minimum point (originally not observable in the available samples). The new point3 has a lower SSD error than the 3 (14) where (x; p) is a geometric transformation (e.g. affine) that transforms the pixel coordinates x into the warped ones by the parameters p. We used a Euclidean transformation composed by 4 parameters (p): 2 translation parameters (x-pos and y-pos), 1 scale We note that this point does not necessarily correspond Local Minima Embedding 6 18 16 14 1 12 0.5 10 8 6 4 2 0 4 0 -0.5 -1 -1.5 -2 2 0 -2 -4 -3 -2 -1 0 1 2 3 -2 -1.5 -1 37 35 38 -0.5 0 0.5 1 1.5 2 39 42 34 41 40 36 43 2 1.5 5 4 3 38 35 40 g(z) 2 37 1 0 -1 -1 34 43 42 36 41 39 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 z (a) Original function 2 1.8 1.6 1.4 35 38 (b) Contour plot 0.478 0.476 0.474 40 0.472 0.47 34 42 43 41 39 36 35 37 38 40 (c) LME g(z) 39 37 g(z) 42 36 -1 0 43 34 41 1 2 3 1.2 1 0.8 0.6 0.4 0.2 -4 -3 -2 0.468 0.466 0.464 0.462 0.46 0.458 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 z z (d) Kernel PCA (e) Kernel LPP Figure 3. Three hump camel function. Blue square = global minimum, red squares = (non-global) local minima, crosses = neighbor samples, and black dots (with sample ID numbers) = non-neighbor samples. 180 160 140 120 g(z) 100 80 lm 4 60 40 20 0 -7 -6 -5 -4 -3 -2 -1 0 lm 2 lm 3 lm 1 New local minimum 1 2 3 gm 4 z Figure 6. LME for the template matching example. The four local minima points are represented in blue squares, and the global minimum in red. The newly found local minimum point is depicted as a black cross. other local minima points, and the corresponding image can be obtained by solving a pre-image problem. This image is shown in Fig. 8(b), under the title New LM. As shown, the image corresponding to the new local minimum is better aligned than the best local minimum (lm-1). This is an example where LME is to a local minimum in the original space. But it turned out to be a local minimum along a particular direction. These types of important points (e.g., saddle points) can be useful for understanding the impact of input variables on the error function, although more research needs to be done to exactly establish the relation between the minima in the embedded space and those in the original space. Figure 7. Zoom of Fig. 6 from lm 1 to gm. We split the line search into four regions: lm 1local max 1 (I), local max 1New local minimum (II), New local minimumlocal max 2 (III), and local max 2gm (IV). The line search results are shown in Fig. 8. able to discover the structure of the input w.r.t the error function, potentially yielding new local minima. Furthermore, to see the effectiveness of LME in finding the dependencies in the original parameters, we did a one dimensional line search in the embedded space around the regions adjoining the global minimum. More specifically, we took the region from the best local minimum in the training samples (i.e., lm1) to the global minimum (i.e., gm). This region is zoomed in and shown in Fig. 7. We split the region Local Minima Embedding into four segments knotting on the local/global minima/maxima points (I IV). For each of these four segments, we took 5 points (the two ending knots and the three uniformly sampled points between the ending points). We solved the pre-image problem (with the initial points chosen from the previous solutions) to retrieve the corresponding parameters in the original 4D input space. The pre-image provides the visual image for these parameters. The results (parameters and images) retrieved by the line search for each of four segments are shown in Fig. 8. Observe that most local minima are produced by rotation and scale, which are well known to be worse conditioned for registration than translation parameters. For example, for region I, we rotated the lm-1 image counter-clockwise to get the local max 1. Region II, that goes from the local maxima to the new local minimum, corresponds to a rotation clockwise plus a decrease in the scale parameter. This yields a new local minima point, better aligned than lm-1. Continuing these rotation/scale changes, however, led to increasing errors, and we denote as LMax-2 the new the local maximum point shown in Fig. 8(c). In region IV, we again reversed the trend in both scale and rotation parameters, and finally land at the global minimum as shown in Fig. 8(d). This example shows how LME allows to visualize the structure of the local minima and understand the parameters in the original space that create local minima. References Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering, 2002. NIPS. Fukumizu, K., Bach, F., and Jordan, M. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. JMLR, 2004. Globerson, Amir and Roweis, Sam. Metric learning by collapsing classes, 2005. NIPS. He, Xiaofei and Niyogi, Partha. Locality preserving projections, 2003. NIPS. Kim, Minyoung and Pavlovic, Vladimir. Dimensionality reduction using covariance operator inverse regression, 2008. CVPR. Li, K.-C. Sliced inverse regression for dimension reduction. JASA, 1991. Nguyen, M. and de la Torre, F. Local minima free parameterized appearance models, 2008. CVPR. Nilsson, J., Sha, F., and Jordan, M. Regression on manifolds using kernel dimension reduction, 2007. ICML. Roweis, Sam and Saul, Lawrence. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323­2326, 2000. Sch¨lkopf, B., Smola, A. J., and Muller, K.-R. Nonlino ear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299­1319, 1998. Shaw, B. and Jebara, T. Structure preserving embedding, 2009. ICML. Tenenbaum, J. B., Silva, V. De, and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319­2323, 2000. Wu, Hao, Liu, Xiaoming, and Doretto, Gianfranco. Face alignment via boosted ranking model, 2008. CVPR. Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Russell, Stuart. Distance metric learning with application to clustering with side-information, 2002. NIPS. Yan, Shuicheng, Xu, Dong, Zhang, Benyu, Zhang, Hong-Jiang, Yang, Qiang, and Lin, Stephen. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans. on PAMI, 29 (1):40­51, 2007. 6. Conclusion This paper proposes LME, an embedding technique that preserves the local minima structure of a given error function. We have shown in synthetic and real examples that LME is a useful technique to understand the structure of high-dimensional error functions. Despite a promising technique LME has a few issues that need to be further explored. Firstly, given the embedded point we need to solve the pre-image problem to find the original parameter in the original space. The pre-image problem is sensible to local minima because there is no one-to-one mapping between the original input space and the embedded space. We plan to convexify it by formulating a learning problem in the kernel (Hilbert) feature space directly. Secondly, during the paper we have assumed that the locations of the local minima are known. It is left as future work to address the unsupervised problem where the locations of the local minima are not known. Local Minima Embedding (a) lm 1 to lmax 1 (region I) (b) lmax 1 to lm new (region II) (c) lm new to lmax 2 (region III) (d) lmax 2 to gm (region IV) Figure 8. Line search along the 4 regions (from I to IV): (a) lm 1local max 1 (I), (b) local max 1New local minimum (II), (c) New local minimumlocal max 2 (III), and (d) local max 2gm (IV).