The Elastic Embedding Algorithm for Dimensionality Reduction

´ Miguel A. Carreira-Perpi~´n na mcarreira-perpinan@ucmerced.edu Electrical Engineering and Computer Science, School of Engineering, University of California, Merced

Abstract
We propose a new dimensionality reduction method, the elastic embedding (EE), that optimises an intuitive, nonlinear objective function of the low-dimensional coordinates of the data. The method reveals a fundamental relation betwen a spectral method, Laplacian eigenmaps, and a nonlinear method, stochastic neighbour embedding; and shows that EE can be seen as learning both the coordinates and the affinities between data points. We give a homotopy method to train EE, characterise the critical value of the homotopy parameter, and study the method's behaviour. For a fixed homotopy parameter, we give a globally convergent iterative algorithm that is very effective and requires no user parameters. Finally, we give an extension to outof-sample points. In standard datasets, EE obtains results as good or better than those of SNE, but more efficiently and robustly.

data, whose solution is given by the nontrivial extremal eigenvectors of a certain N × N matrix, often sparse. Their success is due to their lack of local optima, to the availability of excellent numerical eigensolvers, and to the remarkably good results (often capturing the global structure of the underlying manifold) that can be obtained with simple affinity functions such as Gaussian. That said, such simple affinities can only do so much, and the maps obtained typically collapse non-similar points in small latent space regions or, conversely, leave large gaps in it; they also show significant boundary effects. These problems are particularly clear when the data consists of several separate manifolds. Using more sophisticated affinities that encode non-local information (as in Isomap or in maximum variance unfolding; Weinberger & Saul, 2006) can improve this at a higher computational cost, but are still sensitive to noise in the data and in the graph. Nonlinear methods such as stochastic neighbour embedding (SNE) can go beyond spectral methods and find better optima, representing the global and local structure as well as dealing with multiple manifolds (Hinton & Roweis, 2003; van der Maaten & Hinton, 2008). However, nonlinear methods are far less developed. Only a few such methods have been proposed, and while their results are very encouraging, their optimisation is costly and prone to local optima, and our understanding of the algorithms is limited to an intuitive interpretation of their objective function. The objective of this paper is (1) to propose a new nonlinear algorithm of this type, with results as good as those of SNE but more efficient and robust; and (2) to further our understanding of this type of algorithms.

1. Introduction
We consider the problem of dimensionality reduction, also called manifold learning, where we seek to explain an observed high-dimensional data set Y = (y1 , . . . , yn ) of D ×N in terms of a much smaller number of dimensions L  D. We will focus on algorithms that take as input Y and estimate only the coordinates of the corresponding low-dimensional (latent) points X = (x1 , . . . , xn ) of L × N . Within this framework, there has been an enormous amount of recent work on spectral dimensionality reduction, mainly algorithmic but also theoretical. Spectral methods such as Isomap (Tenenbaum et al., 2000), LLE (Roweis & Saul, 2000) or Laplacian eigenmaps (Belkin & Niyogi, 2003) formulate an objective function of X based on pairwise affinities defined on a neighbourhood graph of the
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

2. Related work
Metric MDS (Borg & Groenen, 2005) preserves dataspace distances in latent space by minimising an objective function (stress) of the latent coordinates. The linear version (classical scaling) results in an eigenvalue problem with a unique solution in generic cases. Several nonlinear versions ex-

The Elastic Embedding Algorithm for Dimensionality Reduction

ist, such as the Sammon mapping (Sammon, 1969), which are generally difficult to optimise and prone to bad local optima. Stochastic neighbour embedding (SNE) (Hinton & Roweis, 2003) preserves probabilities instead of distances, and earlier papers (Hinton & Roweis, 2003; van der Maaten & Hinton, 2008) have shown its superiority over other MDS-like methods when dealing with data that lies in nonlinear, clustered manifolds (though the optimisation is still difficult). We will focus on SNE-type methods. SNE defines the following normalised, non-symmetric affinities pnm and qnm for each data point n in the data and latent spaces, respectively: pnm = qnm = exp (-d2 ) nm 2  exp (-dnm ) n=m exp (- xn - xm )
n=m 2

Q has longer tails, which allows x-points corresponding to y-pairs at a moderate distance to separate more. SNE and our EE (described later) are related with the elastic net (EN) (Durbin et al., 1989), originally proposed to approximate the travelling salesman problem. The EN minimises the sum of a log-sum term (Gaussian-mixture likelihood) that moves Y-space centroids towards the data, and a data-independent quadratic prior on the centroids (a graph Laplacian prior), using a homotopy method. In the EN, the centroids move in the data, rather than the latent, space; and the quadratic prior on them enforces a predetermined topology rather than being based on data affinities. An interesting connection with EE (see section 6), is that the EN prior is provably equivalent to a certain Mexican-hat interaction term (Carreira-Perpi~´n & Goodhill, 2004). na EE (and SNE) can be seen as symmetrising the constraints of Laplacian eigenmaps, where both types of mistakes are penalised: placing far apart latent points that correspond to similar data points, and placing close together latent points that correspond to dissimilar data points. A related phenomenon occurs with principal curves (Hastie & Stuetzle, 1989) and Dimensionality Reduction by Unsupervised Regresna sion (Carreira-Perpi~´n & Lu, 2008); the latter may be seen as a symmetrised version of the former that penalises errors in the data and the latent space. Closing the loop in this way seems to lead to better embeddings. In the rest of the paper, we show a relation between SNE and Laplacian eigenmaps (sec. 3) that immediately suggests our EE algorithm, which we then study (sec. 4) and apply in practice (sec. 5).

pnn = 0 .
2

(1) (2)

exp (- xn - xm
1 2

2

)

We will take = (yn - ym )/n , that is, Gaussian affinities, though other types of affinity may be used. Each width n is chosen by a binary search so the entropy of the distribution Pn over neighbours is roughly log k (for a user-provided k  N , which is then the perplexity, or effective number of neighbours). SNE minimises the following objective function:
N N

d2 nm

ESNE (X) =
n=1

D (Pn Qn ) =
n,m=1

pnm log

pnm (3) qnm

and so tries to match the latent-space distributions over neighbours to the data-space ones. One important disadvantage of SNE is that its gradient-based optimisation is slow and requires care to find good optima. The user must tune for each dataset several parameters (learning rate, momentum rate, amount of gradient jitter, etc., and all these must be adapted by hand as the optimisation proceeds). Some versions of SNE have been proposed that slightly simplify the gradient by symmetrising the objective function (Venna & Kaski, 2007) or the probabilities (Cook et al., 2007), but the normalisation term in the qnm terms still makes it very nonlinear. When the dimensionality L of the latent space is smaller than the intrinsic dimensionality of the data, the resulting map is unavoidably distorted. For purposes of visualisation (rather than of faithful dimensionality reduction), Cook et al. (2007) and van der Maaten & Hinton (2008) have proposed two ways of improving the separation of clusters in this case, UNI-SNE and t-SNE, resp. UNI-SNE biases each qnm by a fixed constant, while t-SNE defines Q as a tdistribution with one degree of freedom. In both cases

3. A Relation between SNE and LE
Laplacian eigenmaps (LE) (Belkin & Niyogi, 2003) is a spectral method that optimises
N

ELE (X) =
n,m=1

wnm xn - xm

2

(4)

subject to quadratic and linear constraints, and has a unique solution given by the nontrivial trailing eigenvectors of the normalised version of the graph Laplacian matrix L = D - W, where WN ×N is the symmetric affinity matrix (typically Gaussian) and D = N diag n=1 wnm the degree matrix. LE discourages placing far apart latent points that correspond to similar data points, but places no direct constraint on pairs associated with distant data points. This often leads to distorted maps where large clusters of points col-

The Elastic Embedding Algorithm for Dimensionality Reduction

lapse (as happens with related methods such as LLE). There is a fundamental relation between LE and SNE. Expanding (3) and ignoring terms that do not depend on X, we have that
N

ESNE (X) =
n,m=1 N

pnm xn - xm

2

+
n=1 N

log
n=m

exp (- xn - xm )

2

(5)

since The first term m=1 pnm = 1 for each n. in the RHS is identical to the LE objective if using normalised affinities as in diffusion maps (i.e., taking wnm = pnm ). It is a local distance term, and also a data-dependent term (since it depends on the data Y through the pnm ). The second term encourages latent points to separate from each other as much as possible (or until the exponentials become negligible). It is a global distance term, symmetric wrt xn and xm , which pushes apart all point pairs equally, irrespective of whether their high-dimensional counterparts are close or far in data space. It is also a data-independent term, since it does not depend on the data Y. Therefore, SNE may be seen as LE with a dataindependent prior that blows points apart from each other. It is thus more accurate to say that the SNE objective function enforces keeping the images of nearby objects nearby while pushing all images apart from each other, rather than to say that it enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart. However, this prior does cause the result from SNE to be radically different from that of LE, improving the spacing of points and clusters, and better representing the manifold structure. We are now ready to introduce our algorithm.

+ preserves local distances, where wnm could be (normalised) Gaussian affinities, geodesic distances, commuting times or other affinities, possibly nonsymmetric or sparse. The right (-) term preserves global distances or separates latent points as in SNE but in a simpler way. This repulsion becomes negligible once neigbouring xs are farther apart than a characteristic, -dependent scale, so the map remains somewhat compact. The regularisation parameter   0 trades off both terms. For simplicity, consider full 2 1 + graphs wnm = exp (- 2 (yn - ym )/ ) and w- = 1 nm - + n = m, with wnn = wnn = 0 n; although some of our results use sparse graphs. Note that the X resulting from EE are equivalent up to rigid motions, and that globally rescaling the data simply rescales : E(X; ; Y, ) = E(X; /2 ; Y, ).

We can then obtain the gradient of E from eq. (6): E =4 xn G(X; ) =
N

wnm (xn - xm )
m=n

(7) (8)

E = 4X(L+ - L- ) = 4XL X
2

where we define the affinities
- - wnm = wnm exp (- xn - xm )

(9) (10)

wnm =

+ wnm

-

- wnm

and their graph Laplacians L = D - W, L = D - W in the usual way. Note that L+ is the usual (unnormalised) graph Laplacian that appears in Laplacian eigenmaps. W can be considered a learned affinity matrix and contains negative weights for  > 0. Both the objective function and the gradient of EE are quite less nonlinear than those of SNE and its variations because we have eliminated the cumbersome log-sum term. This results in an easier optimisation and presumably fewer local optima. At a minimiser (for each ) we have G(X; ) = 0, so the embedding X() satisfies XL = 0 and therefore consists of eigenvectors of the nullspace of the graph Laplacian L for the learned graph affinity matrix W. In minimising E at each , we both construct this graph and find its nullspace eigenvectors (a spectral problem). Note that this does not mean that EE at a given  is equivalent to LE using as affinity matrix W, as LE would find the eigenvectors associated with the algebraically smallest eigenvalues, which for large enough  are negative. 4.1. Study of the case N = 2 The simple case of N = 2 points in 1D is surprisingly informative. Take w.l.o.g. one point at the origin and

4. The Elastic Embedding (EE)
We define the objective function
N

E(X; ) =
n,m=1

+ wnm xn - xm N

2

+
n,m=1

- wnm exp (- xn - xm )

2

(6)

- and we have two where wnm = w- yn - ym nm + graphs: one with attractive weights W+ = (wnm ) and - - the other with repulsive weights W = (wnm ), both nonnegative. The left (+) term is the LE term and

2

The Elastic Embedding Algorithm for Dimensionality Reduction
15

5 4

x 10

4

(x())2

(s())2

10

3 2 1

5

0 10
-2

0
10
-1

10

0

10

1

10

2

10

3

10

4

10

5

10

-8

10

-6

10

-4

10

-2

10

0

10

2

10

4

10

6


2


Figure 1. Left: plot of (x()) when the dataset has N = 2 points in 1D, for w+ = w- =  = 1. Right: plot of the 1 squared diameter of X() for the Swiss roll example (blue vertical lines bound  ). 1

call x  0 the position of the other. The objective 2 function for   0 is E(x; ) = 2(w+ x2 + w- e-x ). Define the critical value  = w+ /w- at which a bi1 furcation occurs. For  <  , E has a minimum at 1 x = 0 and nothing interesting happens, the points are coincident. For  >  it has a maximum at 1 x = 0 and a minimum at x = log (/ ); as  1 grows, the points separate very slowly (square-rootlog). Fig. 1 (left) shows the squared scale of the map (x())2 = max (0, log (/ )), which grows logarithmi1 cally after the bifurcation. When N > 2, X() follows the same behaviour for small or large  (fig. 1 right), but shows multiple bifurcations and local minima for intermediate . 4.2. Study of the critical  =  for N > 2 1 For    , the EE objective function E(X; ) is min1 imised by the point X = 0, corresponding to an embedding where all latent points are coincident. For  >  , the embedding unfolds. We want to locate 1 this bifurcation and study the evolution of the point X = 0 at it. X = 0 is a stationary point for all , and the Hessian of E(X; ) at it equals H(0; ) = 4 diag L+ - L- , . . . , L+ - L- since L- = L- , and assuming X in row-major order. Using Taylor's theorem, we can approximate E(X; ) near X = 0 to second order. Since the Hessian is block-diagonal with N × N blocks corresponding to the L dimensions of X, we can study the behaviour of each dimension separately by studying the function e(x; ) = xT Lx, where L = L+ - L- and x is an N ×1 vector containing the coordinates of all N points in dimension 1 (i.e., the first row of X transposed). The function e is the sum of two quadratic functions of opposite sign and the negative one is weighted by . For  = 0, e equals the positive quadratic, for    it tends to the negative one, and for intermediate ,

e is a hyperbolic paraboloid. For    , e is posi1 tive semidefinite and has global minima along the line x = u0 for   R and u0 = 1; this is the eigenvector of L associated with a null eigenvalue, and represents the embedding where all points are coincident; we will refer to this embedding as x = 0 for simplicity. Eigenvector 1 exists for all  and represents EE's invariance to global translations of X. We are interested in the critical value  when L stops being positive semidefi1 nite and the hyperbolic paraboloid first arises. At that point, x = 0 is about to stop being a minimum and become a saddle point (or a maximum in particular cases), and a second eigenvector u exists with null 1 eigenvalue and orthogonal to u0 . For  >  , u is 1 1 associated with a negative eigenvalue, and e decreases fastest along u . Thus, when  is just larger than  , 1 1 each dimension of the embedding expands along the negative eigenvector u of L. Note that u is the sec1 1 ond trailing eigenvector of L at  =  and thus corre1 sponds to the 1D embedding that LE would produce with an affinity matrix W+ -  W- and constraints 1 XXT = I and X1 = 0; in practice this is close to the first nontrivial trailing eigenvector of L+ . In summary, at the bifurcation  =  , the latent 1 points separate and an embedding X arises that is 1D and very similar to the 1D LE embedding. This embedding X() keeps evolving as  is increased. We do not have an explicit form for  or u , but we 1 1 have upper and lower bounds l1    u1 that are 1 quite tight in practice (proof omitted for lack of space): l1 = max u1 = min
+ wnm + 2 - , min - N n,m wnm

(11) (12)

+ L+ L+ + 2 11 N N , . . . , - , - , . . . , -N . - N L11 LN N 2

4.3. Minimising E(X; ) for fixed  We have noticed that minimising E with gradient descent or conjugate gradients is very slow and requires tiny steps (this also applies to SNE). Using search directions derived from a fixed-point iteration works much better. Rearranging the stationary point equation (a matrix of L × N ) G(X; ) = E = 4X(D+ - W+ - D- + W- ) = 0 X as a splitting G = X(A + B) = 0, where A is symmetric positive definite and B is symmetric, we obtain a fixed point iteration X  -XBA-1 . Although this iteration does not always converge, it does suggest using a search direction  = -XBA-1 - X = -X(BA-1 + I) = -GA-1 along which we can decrease E with a line search X  X +  for   0.

The Elastic Embedding Algorithm for Dimensionality Reduction

The direction  is descent and, if cond (A) is upper bounded, it never becomes too close to being orthogonal to the gradient (proof omitted). If the line search satisfies e.g. the Wolfe conditions, then, by Zoutendijk's theorem (th. 3.2 in Nocedal & Wright, 2006), the algorithm converges to a stationary point of E from any initial X0  RL×N . We have tried several splittings and found them to improve greatly over the gradient (A = I), in particular A = 4D+ (computable in O(N L)). In practice, we find this requires no line search at all ( = 1) except when  is close to a bifurcation, most notably near  . 1 The cost of each iteration is thus O(LN 2 ) (or O(LN ) with sparse W+ and W- ), dominated by the gradient computation (8). 4.4. Algorithms to find the optimal embedding We now have two ways of finding the optimal embedding. We can run the homotopy method by increasing (relatively slowly)  from above the critical value  1 (in practice, from above its upper bound u1 ), minimising at each step E(X; ) over X and tracking the path X(), and stopping when the scale grows logarithmically. Other than not increasing  too fast, no special care or ad-hoc user parameters are needed for the optimisation. This makes the algorithm almost deterministic, in that the early exploration at small scales can find and track the same, deep minimum, and in our experience produces good results, but is slower. The second, much faster way is to select a large enough , fix it, and optimise there. The result then does depend on the initial X, which can be taken random, or from the embedding of a spectral method (rescaled to match the EE scale at , which can be estimated in a quick, pilot run from random X). Note that SNE implicitly sets  = 1. A third approach, not explored here, is to minimise E subject to quadratic constraints on X (as in LE, LLE, etc.). Then, the solution for  = 0 is not anymore X = 0, but the solution of the corresponding spectral problem, which is a better initial point. However, the now constrained optimisation is more difficult. 4.5. Out-of-sample extension EE (like SNE and spectral methods) returns lowdimensional projections X only for points in the training set Y. One way to define mappings F and f that apply to new points y or x, respectively, is to fit them to (Y, X) or (X, Y), respectively. Although this can be made to work in practice, the result does depend on the choice of mappings, which is left to the user.

Here we follow instead the more natural approach proposed in the LELVM model of Carreira-Perpi~´n & Lu na (2007) for LE, which returns a nonparametric mapping. The idea is, given a new point y, to solve the original problem (i.e., to minimise the EE error E) over the unknown projection x, keeping all other projections X fixed (so as not to disturb the embedding we have already obtained). The same idea applies to map a new x to the y-space. Then, the error function augmented with y and x consists of the old error function E(X) (applied to X and Y) plus the new term
N

E  (x, y) = 2
n=1

w+ (y, yn ) x - xn

2

+ w- (y, yn ) exp - x - xn with kernels w+ (y, yn ) = exp -
1 2

2

(y - yn )/
2

2

- w- (y, yn ) = wn y - yn

induced from the affinity kernels that were used in the EE training (using the same neighbourhood structure). Then, we define the dimensionality reduction and reconstruction mappings as follows: F(y) = arg min E  (x, y)
x

f (x) = arg min E  (x, y) (13)
y

initialising x to the xn whose yn is closest to y in Euclidean distance, and analogously for f . Unlike in the LELVM model, where these problems had a closedform solution, in our case they are nonlinear optimisation problems that can be solved e.g. with gradient descent. By equating the gradient to zero we see that the minima of eq. (13) have the form of a linear combination of X or Y, e.g.
N

x=
n=1 +

w(y, yn )
N n =1

w(y, yn )

xn = F(y)
2

w(y, yn ) = w (y, yn ) - w- (y, yn ) exp - x - xn

but, unlike in the LELVM model, this linear combination is not necessarily convex because the weights can be negative. Thus, the out-of-sample mappings can result in values beyond the convex hull of the data, and this allows to extrapolate to some extent.

5. Experimental Results
Fig. 2 shows the result of EE with a 2D spiral with N = 200 points and full-graph Gaussian affinities with  = 0.05. We show all the lower and upper bounds, which imply   [4·10-5 , 10-4 ] (vertical blue lines in 1

The Elastic Embedding Algorithm for Dimensionality Reduction

Spiral Y
30

Embedding X
20 0.8 10 0.6 0.4 0.2

Affinities wnm

Lower bounds Upper bounds
4 x 10
-5

10

0

# iterations
50 40 30 20

0.6 0.4

3

10

-1

xn

y2

0.2 0 -0.2

wnm

0

2

10

-2

-10 -20

1

10

-3

0 50 100 150 200 -0.2 50 100 150 200 0
1 2

10 0
-4 -3 -2 -1 0 1 2

-0.4 -0.5

y1

0

0.5

-30

10

-4

n

n

1

100

200

300

400

10

10

10

10

10

10

10


Figure 2. EE trained with homotopy with a 2D spiral.

EE,  = 2 · 10-7
0.5

EE,  = 10-6
1
5

EE,  = 10-2
15

EE,  = 101
50

EE,  = 107

0

0

0

0

0

-1
-0.5 -1 -0.5 0 0.5 1

-5

-15

-50
-20 0 20 40

-1

0

1

-10

-5

0

5

10

-40

-100

-50

0

50

100

True X
40 30

Laplacian eigenmaps
0.01

Isomap
10
0.005

SNE
80 60 40 20
0

t-SNE
0 -20

0.005

5

0

0

20
-0.005

10
-0.01 -0.005 0 0.005

-5

-40
-0.005

-60
-0.005 0 0.005

10

20

30

40

50

60

70

80

90

-15

-10

-5

0

5

10

15

-80 -150

-100

-50

0

50

100

150

Figure 3. Swiss roll. Top: EE with homotopy; we show X for different . Bottom: true X and results with other methods.

the right plot). We used the homotopy method with 80 values of  from 10-2 to 102 . For each  we ran the optimisation until the relative function change was less than 10-3 or we reached 50 iterations. The step size was 1 nearly always, 0.8 occasionally. The right plot shows that more iterations are required shortly after the  bifurcation; occasional spikes in that plot 1 indicate subsequent bifurcations as new minima arise and the map changes. The initial X do not unfold the spiral correctly, but eventually they do, and this deep minimum is tracked henceforth. As  increases, initial local clustering and boundary effects typically associated with an LE embedding are removed and the result is a perfectly spaced sequence matching the data spacing. The initial affinities wnm of eq. (9) are Gaussian, but as  increases they develop negative lobes and adopt a Mexican-hat form (the plot shows wnm for two interior and two extreme points). As  further increases (enlarging the map and forcing points to be equidistant) wnm become much more negative. Fig. 3 shows the result of EE with a 3D Swiss roll + with N = 2 000 points, wnm as k-nearest-neighbour - Gaussian affinities and wnm = 1 n, m. We set k = 12,  = 15 for all methods. The bounds indicate   [5 · 10-9 , 10-8 ], so we varied  from 10-7 to 1 107 . After the critical  , X expands along the 1D LE 1 solution and later on the 2D map unfolds. This small solution globally unfolds the Swiss roll but shows

defects similar to those of spectral methods (local clusters and gaps, boundary effects; see the LE plot). But these disappear as  increases; X for   [10-1 , 101 ] is extremely similar to the true X (see also the result of Isomap, ideally suited to this problem). For very large , in the region of log-growth of the scale (see fig. 1 right), the point-separating prior dominates and the 2D arrangement tends to a round hexagonal grid (that still preserves the global structure, though). SNE attains a good map, better than LE's but worse than EE's. However, t-SNE does poorly, grouping points in local clusters that push away from each other. As noted in the introduction, t-SNE was designed to correct the map when the its dimension does not match the intrinsic one (not the case here). Initialising X from the true X produces similar results for SNE and t-SNE, indicating this is not just a bad local optimum. For SNE, perhaps better results would be obtained if
20 10 0 -2 -10 -3 -20 -40 -20 0 20 40

6
0

4 0 -2 -4 -5 0 5 10

0 -1 -2 -3

-1 2

+ - Figure 4. Affinities wnm = wnm - wnm exp (- xn - xm 2 ) learned for a point xn near the centre of the Swiss roll for  = 101 (right plot: zoom view).

The Elastic Embedding Algorithm for Dimensionality Reduction

EE
10 1 2 3 4 5 6 7 8 9 10 15 8 10

SNE
150 100

t-SNE
8 6

EE and out-of-sample
testing

6

4
5

4

50

2

2

0

0

0

0

-2 -5 -4
-50

-2

-4
-6 -10 -8
-100

-6

-10 -10

-8

-6

-4

-2

0

2

4

6

8

10

-15 -15

-10

-5

0

5

10

15

-150 -150

-8
-100 -50 0 50 100 150

8

6

4

2

0

-2

-4

-6

-8

-10

1 One original image, seq. 1 37

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

Figure 5. Results of EE, SNE and t-SNE with the COIL-20 dataset, all randomly initialised. Right plot: result of EE when trained on half the data, and points x = F(y) (· marks) predicted for the other half of the data (only 5 of the sequences are shown to avoid clutter). Below : images y = f (x) predicted for out-of-sample points in x-space along sequence 1.

introducing a  weight as in EE. Fig. 4 shows that the affinities wnm again evolve from initially Gaussian to a 2D Mexican-hat shape, with a central positive region, negative intermediate regions and zero farther away. Fig. 5 shows the result with the COIL-20 dataset, containing rotation sequences of 10 objects every 5 degrees, each a greyscale image of 128 × 128 pixels (total N = 720 points in D = 16 384 dimensions). Thus, this contains ten 1D manifolds. We did not apply PCA to the images, and used SNE affinities with perplexity k = 12. We ran EE for a fixed  = 1 from a random initial X. The results shown for EE, SNE and t-SNE are quite robust (e.g. initialising one method from the result of another produces very similar maps). They again indicate that EE (and SNE) do a good job at both separating different clusters and capturing each sequence's 1D order. A self-intersection or a sequence that folds over itself (e.g. sequence 5) is mostly caused by quasi-symmetric COIL-20 objects that look very similar from the back or the front. t-SNE is very good at separating clusters but unfortunately it also separates parts of a cluster; most sequences appear in several pieces and folded over themselves. A further advantage of EE is that it trains faster than SNE or t-SNE. Fig. 5 (rightmost) shows the result of training EE with half of the data (the even-numbered images in each sequence). We computed the 2D projection xm = F(ym ) of each of the test ym (odd-numbered images) with EE's out-of-sample extension. They project to their expected locations: between each even image,

and in pairs for sequence 6 which folded over itself. We then mapped these xm back to image space as  ym = f (xm ) = f (F(ym )) and achieved the reconstructions shown. Although blurred (remember we are using only 2 latent dimensions) they perfectly capture the viewpoint and general structure of the object.

6. Discussion
The intuition of Hinton & Roweis (2003) in proposing SNE was to emphasise both local and global distances through the matching of data and latent probabilities P and Q, so that e.g. making qnm large when pnm was small would waste some of the probability mass in Q. In our view, it is not the (normalised) probabilities that make SNE do well. We can use other (non-normalised) affinities for pnm and get good results, and we have shown that the normalisation term in qnm , which gives rise to the cumbersome log-sum term in ESNE , is unnecessary. Rather, what makes SNE, t-SNE and EE do well is the use of Gaussian or other decaying functions to modulate the relative contributions of local vs global distances, and this is more easily achieved by EE. EE reveals an important relation between nonlinear and spectral methods in that, at a fixed , EE both learns the embedding X and the pairwise affinities W. We already know that putting some effort in learning good affinities can give better results than a simple functional form (e.g. Gaussian), as in the MVU method (Weinberger & Saul, 2006). However,

The Elastic Embedding Algorithm for Dimensionality Reduction

most work on spectral methods for dimensionality reduction and clustering and in kernel methods has focused on nonnegative, positive semidefinite affinity matrices (although nonpositive affinities do arise naturally in some constrained clustering algorithms, e.g. Lu & Carreira-Perpi~´n, 2008). Our results suggest na that some eigenvectors (not necessarily the extremal ones) resulting from affinity matrices that are not positive definite and that have negative entries may contain far more useful information. Remarkably, the affinities learned by EE look like Mexican-hat functions (that adapt to each point) in the  region were the best maps arise. It is intriguing that similar Mexican-hat functions are a fundamental component in the majority of the models proposed to explain pattern formation in cortical maps, in particular the elastic net (Carreira-Perpi~´n & Goodhill, 2004). na

Proc. AISTATS, pp. 59­66, San Juan, Puerto Rico, March 21­24 2007. ´ Carreira-Perpi~´n, Miguel A. and Lu, Zhengdong. Dina mensionality reduction by unsupervised regression. In Proc. CVPR, Anchorage, AK, June 23­28 2008. Cook, James, Sutskever, Ilya, Mnih, Andriy, and Hinton, Geoffrey. Visualizing similarity data with a mixture of maps. In Proc. AISTATS, San Juan, Puerto Rico, March 21­24 2007. Durbin, Richard, Szeliski, Richard, and Yuille, Alan. An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1(3): 348­358, Fall 1989. Hastie, Trevor J. and Stuetzle, Werner. Principal curves. J. Amer. Stat. Assoc., 84(406):502­516, June 1989. Hinton, Geoffrey and Roweis, Sam T. Stochastic neighbor embedding. In Proc. NIPS 2002, pp. 857­ 864. MIT Press, Cambridge, MA, 2003. ´ Lu, Zhengdong and Carreira-Perpi~´n, Miguel A. na Constrained spectral clustering through affinity propagation. In Proc. CVPR, Anchorage, AK, June 23­28 2008. Nocedal, Jorge and Wright, Stephen J. Numerical Optimization. Springer-Verlag, second edition, 2006. Roweis, Sam T. and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323­2326, December 22 2000. Sammon, Jr., John W. A nonlinear mapping for data structure analysis. IEEE Trans. Computers, C­18 (5):401­409, May 1969. Tenenbaum, Joshua B., de Silva, Vin, and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319­ 2323, December 22 2000. van der Maaten, Laurens J. P. and Hinton, Geoffrey E. t-distributed stochastic neighbor embedding. Journal of Machine Learning Research, 9:2579­2605, November 2008. Venna, Jarkko and Kaski, Samuel. Nonlinear dimensionality reduction as information retrieval. In Proc. AISTATS, San Juan, Puerto Rico, March 21­24 2007. Weinberger, Kilian Q. and Saul, Lawrence K. Unsupervised learning of image manifolds by semidefinite programming. Int. J. Computer Vision, 70(1):77­ 90, October 2006.

7. Conclusion
Our paper has proposed an algorithm that we think improves over SNE, producing results of similar or better quality more quickly and robustly. We have given a theoretical study of its homotopy parameter and of efficient, parameter-free, globally convergent optimisation algorithms, and an out-of-sample extension. All these ideas are directly applicable to SNE, t-SNE and earlier algorithms. Beyond this, our work has explored a new direction that we hope will spur further research: the relation of nonlinear methods such as SNE or EE with spectral methods, and the learning of affinities.

Acknowledgements
I thank Jianwu Zeng for help with the figures. Work supported by NSF CAREER award IIS­0754089.

References
Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373­1396, June 2003. Borg, Ingwer and Groenen, Patrick. Modern Multidimensional Scaling: Theory and Application. Springer-Verlag, second edition, 2005. ´ Carreira-Perpi~´n, Miguel A. and Goodhill, Geofna frey J. Influence of lateral connections on the structure of cortical maps. J. Neurophysiol., 92(5):2947­ 2959, November 2004. ´ Carreira-Perpi~´n, Miguel A. and Lu, Zhengdong. The na Laplacian Eigenmaps Latent Variable Model. In