Deep Supervised t-Distributed Embedding

Renqiang Min minrq@cs.toronto.edu Dep. of Computer Science, University of Toronto, 10 King's College Rd, Toronto ON, M5S 3G4 CANADA Laurens van der Maaten lvdmaaten@gmail.com Dep. of Computer Science and Eng., UC San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA Faculty of EEMCS, Delft University of Technology, Mekelweg 4, 2628 CN Delft THE NETHERLANDS Zineng Yuan zineng.yuan@utoronto.ca Dep. of Molecular Genetics, University of Toronto, 160 College St, Toronto ON, M5S 3E1 CANADA Anthony Bonner bonner@cs.toronto.edu Dep. of Computer Science, University of Toronto, 10 King's College Rd, Toronto ON, M5S 3G4 CANADA Zhaolei Zhang zhaolei.zhang@utoronto.ca Dep. of Med. Research, University of Toronto, 160 College St, Toronto ON, M5S 3E1 CANADA

Abstract
Deep learning has been successfully applied to perform non-linear embedding. In this paper, we present supervised embedding techniques that use a deep network to collapse classes. The network is pre-trained using a stack of RBMs, and finetuned using approaches that try to collapse classes. The finetuning is inspired by ideas from NCA, but it uses a Student t-distribution to model the similarities of data points belonging to the same class in the embedding. We investigate two types of objective functions: deep t-distributed MCML (dt-MCML) and deep tdistributed NCA (dt-NCA). Our experiments on two handwritten digit data sets reveal the strong performance of dt-MCML in supervised parametric data visualization, whereas dt-NCA outperforms alternative techniques when embeddings with more than two or three dimensions are constructed, e.g., to obtain good classification performances. Overall, our results demonstrate the advantage of using a deep architecture and a heavy-tailed t-distribution for measuring pairwise similarities in supervised embedding.
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

1. Introduction
Given the class information of training data points, linear feature transformations and linear dimensionality reductions have been widely applied to perform high-dimensional data embedding either for data visualization or for k-NN classification. Various techniques have been proposed that learn a linear transformation or Mahalanobis metric that tries to decrease the pairwise distances between data points with the same class, while increasing the separation between data points with dissimilar classes (Globerson & Roweis, 2006; Goldberger et al., 2005; Weinberger et al., 2006; Xing et al., 2003). In other words, the techniques aim to collapse classes in a low-dimensional embedding1 . However, making data points that correspond the same class collapse cannot always be achieved simple linear transformations, especially when the data consists of one or more complex nonlinear (sub)manifolds. Hence, we need to resort to more powerful non-linear transformations. Recent advances in the training of deep networks provide a way to model non-linear transformations of data. Such deep networks are typically pre-trained using, e.g., a stack of Restricted Boltzmann Machines (RBMs; Hinton et al. (2006)) or denoising autoencoders (Larochelle et al., 2007). Subsequently, the pre-trained networks are finetuned as to,
1 In techniques that learn a Mahalanobis metric, this embedding can be computed by projecting the data onto the eigenvectors of the metric M (weighted by the squareroot of the corresponding eigenvalues).

Deep Supervised t-Distributed Embedding

e.g., construct a generative model of the data (Hinton et al., 2006), learn a classifier (Larochelle et al., 2007), increase the separation between classes (Min et al., 2009; Salakhutdinov & Hinton, 2007; Weston et al., 2008), or preserve pairwise similarities between data points (van der Maaten, 2009). In this paper, we present two supervised embedding techniques, called deep t-distributed MCML (dtMCML) and deep t-distributed NCA (dt-NCA), that use a deep feedforward neural network to model the transformation from the high-dimensional to the lowdimensional space. The networks are pre-trained using a stack of RBMs, and finetuned by minimizing objective functions that aim to collapse classes. The objective functions are inspired by the objective functions of Maximally Collapsing Metric Learning (MCML; Globerson & Roweis (2006)) and Neighborhood Components Analysis (NCA; Goldberger et al. (2005)), but they use a Student t-distribution to measure the similarities for pairwise data points in low-dimensional space. The advantage of using a tdistribution to measure these pairwise similarities is four-fold: (1) due to the heavier tail of the distribution, it works better on data sets in which (some of) the class distributions are multimodal, (2) due to the more peaked mode of the distribution, it leads to tighter clusters, i.e., it collapses classes better, (3) the heavier tails lead to more separation between classes, and (4) due to the steeper gradient along the tail of the t-distribution, the optimization using gradient descent is easier. We performed experiments with dt-MCML and dtNCA on the USPS and MNIST handwritten digit data sets. The results of the experiments reveal the strong performance of both techniques. In particular, dtMCML performs very well in learning settings in which the dimensionality of the latent space is very low, e.g., when performing data visualization. In contrast, if the dimensionality of the latent space is higher than two, dt-NCA obtains a superior performance. The outline of this paper is as follows. In Section 2, we introduce the two new embedding techniques, called dt-MCML and dt-NCA. In Section 3, we present the results of our experiments with the new techniques on two data sets. Section 4 concludes the paper.

Figure 1. The deep network used in dt-MCML and dt-NCA (left) and an illustration of the pre-training of the deep neural network (right). Adopted from Min et al. (2009).

2.1. Deep t-distributed MCML Suppose we are given a set of high-dimensional data points and their corresponding labels D = {x(i) , y (i) : i = 1, . . . , n}, where x(i)  RD , and y (i)  {1, . . . , c}, where c is the total number of classes. MCML learns a Mahalanobis metric M that aims to simultaneously achieve two main goals: (1) it tries to maximize the sum of the pairwise similarities of data points with the same class, and (2) it tries to minimize the sum of the pairwise similarities of data points with dissimilar classes. In fact, learning a Mahalanobis metric is identical to learning a linear mapping A of the data: the linear mapping A is given by the eigenvectors of the Mahalanobis metric M (weighted by the square-root of their corresponding eigenvalues). MCML can thus be thought of as learning a function f (x(i) ) = Ax(i) that transforms the high-dimensional data points to a latent space with d dimensions (i.e., A is a d × D matrix, where typically, d < D). MCML measures the pairwise similarity of the transformed data point f (x(i) ) and the transformed data point f (x(j) ) using a stochastic neighborhood criterion. In other words, it centers a Gaussian distribution over f (x(i) ), measures the density of f (x(j) ) under this Gaussian, and renormalizes: qj|i = where we defined: d2 = ||f (x(i) ) - f (x(j) )||2 . ij The probabilities qj|i can be viewed as the probability exp(-d2 ) ij , exp(-d2 ) k:k=i ik qi|i = 0, (1)

2. Deep Supervised Embedding
In this section, we present the two new techniques for supervised parametric embedding. We present dtMCML in Section 2.1, and dt-NCA in Section 2.2.

Deep Supervised t-Distributed Embedding

that the point f (x(i) ) picks point f (x(j) ) as its nearest neighbor in the latent space. MCML tries to minimize the sum of the Kullback-Leibler divergences between the conditional probabilities qj|i and "ground-truth" probabilities pj|i that are defined based on the class labels of the data. Specifically, MCML defines pj|i  1 iff y (i) = y (j) , and pj|i = 0 iff y (i) = y (j) . The cost function that MCML minimizes is given by: pj|i . KL(Pi ||Qi ) = pj|i log M CM L = qj|i i i
j:j=i

In deep t-distributed MCML, we follow a similar approach as in MCML, however, we introduce three major changes. First, instead of parametrizing the function f by means of a linear mapping A, we define the function f to be a feedforward neural network mapping that is parametrized by the weights of the network W . Second, instead of measuring the similarities of the transformed data points by means of a Gaussian density, we measure densities under a Student-t distribution. Third, we change the normalization of the pairwise similarities, as this leads to significant simplifications in the gradient. Mathematically, we define: qij = (1 + d2 /)- ij
1+ 2 1+ 2

over the use of a Gaussian distribution (as in Equation 1). First, the heavy tails of the Student-t distribution make the cost function somewhat more happy when groups of dissimilar points with the same class are modeled far apart in the latent space. This is beneficial in cases in which the distribution of one or more classes is bimodal or multimodal: collapsing data points from the different modes onto a single mode in the latent space is bound to lead to severe overfitting. Second, the peaked mode of the t-distribution (compared to a Gaussian distribution) leads the cost function to favor solutions in which similar points with the same class are modeled closer together, i.e., it leads to tighter clusters in the embedding. Third, the use of t-distribution forces points with different classes to be further apart in the latent space in order for their similarity to become infinitesimal. Fourth, the use of Student-t distributions makes the gradient optimization easier, because the gradient of the tail of a Student t-distribution is much steeper than that of a Gaussian distribution. As a result, the t-distribution provides more "long-range forces", which makes it easier to collapse groups of points with the same class that are separated at some point in the optimization. The gradient of the cost function of dt-MCML with respect to the location of a latent point f (x(i) ) is given by: 
dt-M CM L f (x(i) )

- 2 kl:k=l (1 + dkl /)

,

qii = 0,

(2)

where  is represents the number of degrees of freedom of the Student-t distribution. The reader should note that the following special cases of the t-distribution: when  = 1, the t-distribution is a Cauchy distribution, whereas when  = , the t-distribution is a Gaussian distribution. The cost function is now defined to be: pij pij log , dt-M CM L = KL(P ||Q) = qij i
j:j=i

=
j:j=i

2( + 1) 

d2 ij 1+ 

- 1+ 2

(pij - qij )(f (x(i) ) - f (x(j) )). Using the gradient above, the gradients of the cost function with respect to the weights W of the neural network can be computed using standard backpropagation. Although it is possible to treat the degrees of freedom  as a parameter, we can also try to learn its optimal value using gradient descent. The required gradient is given by: 
dt-M CM L

where pij  1 iff y (i) = y (j) , pij = 0 iff y (i) = y (j) , and ij pij = 1. In the definition of qij above, the pairwise distance d2 is defined similarly as in NCA, except for that the ij transformation function f is not linear anymore. In particular, the function f : RD  Rd is a nonlinear function that is defined by a feed-forward neural network with weights W . The advantage of using a deep network to parametrize the function f instead of a linear mapping is that a deep network is better at learning the complex nonlinear function that is presumably required to collapse classes in the latent space, in particular, when the data consists of very complex nonlinear (sub)manifolds. The use of a Student-t distribution in the pairwise similarities in Equation 2 has four main advantages


=
ij:i=j

(pij - qij )

d2 1 ij log 1 + 2   (1 + )d2 ij  - . 2 2 (1 + dij ) 2 

We should note that for visualization purposes, a value of  = 1 usually performs very well (van der Maaten, 2009), and learning  is superfluous. 2.2. Deep t-distributed NCA Collapsing classes typically works well for data visualization tasks (Globerson & Roweis, 2006), but col-

Deep Supervised t-Distributed Embedding

lapsing classes is unnecessary to obtain low nearest neighbor errors after high-dimensional data was embedded in a space with a dimensionality that is larger than, say, two. If the dimensionality of the data is reduced to, say, 30 dimensions, the volume difference between the data space and the latent space is exponentially smaller than when the dimensionality of the data is reduced to two dimensions. More importantly, the dimensionality of the latent space can be selected as to match the intrinsic dimensionality of input data. In that case, it is not necessary to completely collapse classes to obtain low nearest neighbor errors, in particular, since collapsing classes may lead to overfitting. Hence, in learning settings in which the dimensionality of the latent space is relatively large, directly minimizing a smooth approximation to the nearest neighbor error, as is done in NCA, may work much better. To this end, we investigate a variant of NCA, called deep t-distributed NCA (dt-NCA), that also parametrizes the mapping from the data to the latent space by means of a deep neural network2 and that measures similarities in the latent space using a t-distribution. The key difference between dt-NCA and dt-MCML, analogous to the key difference between NCA and MCML, is in the cost function that is minimized. In particular, dt-NCA tries to minimize the expected nearest neighbor error by minimizing a smooth approximation of the nearest neighbor error:
dt-N CA

data representation in the latent space f (x(i) ) is given by:    1+   dt-N CA = ij qj|i  qk|i uik vik   f (x(i) ) j:j=i k:k=i   +
j:j=i

qi|j uji vji 
k:k=j

jk qk|j  

-
j:j=i

ij qj|i uij vij -
j:j=i

ji qi|j uji vji  .

The gradient of dt-N CA with respect to  can be calculated as follows: 
dt-N CA


=
i j:i=j

ij qi|j 1 - log uij 2

1+ uij d2 -2 ij 2 -
i j:j=i

ij qj|i

 
k:k=i

qk|i

 1 1+ uik d2 -2 - log uik  . ik 2 2

3. Experiments
We performed experiments with dt-MCML and dtNCA on the USPS and MNIST handwritten digit data sets in order to evaluate their performance. In order to investigate the effect of using a t-distribution instead of a Gaussian distribution to measure similarities in the latent space, we compare dt-MCML and dt-NCA to their counterparts that use a Gaussian distribution to measure the similarities (but that use the same deep neural network as parametrization of the function f ). We refer to these variants as dG-MCML and dG-NCA, respectively. Both the pre-training and the finetuning for dG-MCML and dG-NCA are identical to the pretraining and the finetuning of dt-MCML and dt-NCA, except for that the Student-t distributions in the definition of qij and qj|i are replaced by Gaussian distri1 butions (with variance  = 2 ), respectively. Note that this replacement also leads to different gradients of the respective cost functions (the gradients for dGNCA are derived by Salakhutdinov & Hinton (2007)). In our experiments, we used the same network structure that was proposed by Salakhutdinov & Hinton (2007), i.e, we use a D - 500 - 500 - 2000 - d network. Our selection of this architecture facilitates comparisons with methods used in other papers. We pretrained all neural networks using the pre-training procedure described by Hinton & Salakhutdinov (2006). Following Hinton & Salakhutdinov (2006), we trained

=-
ij:i=j

ij qj|i ,

where ij represents an indicator function, i.e., ij equals 1 if y (i) = y (j) and 0 otherwise, and where we used an asymmetric definition for the similarities qj|i : qj|i = (1 + d2 /)- ij
k:k=i (1
1+ 2 1+ 2

+ d2 /)- ik

,

qi|i = 0.

The motivation behind this definition of the similarities in the latent space is identical to the motivation for the similarities in dt-MCML: it helps preventing overfitting, it constructs tighter natural clusters of points with the same class, it improves separation between points with different classes, and it makes the gradient optimization easier due to the presence of more longe-range forces. If we define uij = 1 + f (x(j) ), the gradient of
2

d2 ij 

- 1+ 2

and vij = f (x(i) ) - with respect to the

dt-N CA

The reader should note another nonlinear variant of NCA was previously investigated by Salakhutdinov & Hinton (2007).

Deep Supervised t-Distributed Embedding Table 1. Mean and standard deviation of test error (in %) on 2-dimensional and 30-dimensional embedding for various techniques on the 6 splits of USPS data set.

Dimensionality d MCML dG-MCML dt-MCML ( = d - 1) dt-MCML (learned ) dG-NCA dt-NCA ( = d - 1) dt-NCA (learned )

2D 35.63 ± 0.44 3.37 ± 0.18 2.46 ± 0.35 2.80 ± 0.36 10.22 ± 0.76 5.11 ± 0.28 6.69 ± 0.92

30D 5.53 ± 0.39 1.67 ± 0.21 1.73 ± 0.47 1.61 ± 0.36 1.91 ± 0.22 1.15 ± 0.21 1.17 ± 0.07

the other techniques. The results also show that dtMCML always outperforms dG-MCML, and that dtNCA always outperforms dG-NCA. Both dt-MCML and dt-NCA perform much much better than standard linear MCML. We did not run linear NCA on the USPS data set, but in (Goldberger et al., 2005), the authors report a classification error on a twodimensional embedding constructed by NCA of approximately 30%, which is much higher than our best classification error of 2.46% on a two-dimensional embedding. Figure 2 shows embeddings of 3, 000 test data points in the USPS-fixed split that were constructed by, respectively, dG-MCML, dt-MCML, dG-NCA, and dtNCA. The embedding constructed by linear MCML is included in the supplemental material. The plots in Figure 2 suggest that dt-MCML produced the best embedding. It puts almost all the data points in the same class close to each other, and it creates large separations between class clusters. The embeddings reveal that both dt-NCA and dt-MCML allow large gaps to form between classes, whereas dG-NCA and dG-MCML must allow for more overlaps between classes. Comparing dt-MCML and dt-NCA, we find that dt-NCA allows data points in the same class to be placed at multiple locations, whereas dt-MCML only maintains one large cluster. This is because dt-NCA maximizes the sum of probabilities qj|i 's, whereas dtMCML maximizes the product of the qij 's. 3.2. Experimental Results on MNIST Data Set We also performed experiments on the MNIST data set. The MNIST data set contains gray-level images with 28×28 = 784 pixels. The data set contains 60, 000 training samples and 10, 000 test samples. Because of the large number of training samples in the MNIST data set, we were forced to use batch training using batches of 10, 000 training samples. In addition, we could not perform experiments using MCML (the projected gradient procedure used in MCML requires the eigendecomposition of n × n matrices). Table 2 presents the classification performance of nearest neighbor classifiers on embedding constructed by the various techniques (computed on the standard MNIST test set). Again, dt-MCML outperforms the other techniques when embedding into two dimensions. In comparison, the test errors of linear NCA (Goldberger et al., 2005), autoencoders (Hinton & Salakhutdinov, 2006), parametric t-SNE with  = 1 (van der Maaten, 2009), parametric t-SNE with learned , and DNet-kNN (Min et al., 2009) on twodimensional embeddings of the MNIST data set are,

the RBMs using contrastive divergence with one Gibbs sweep (CD-1) for 50 iterations using mini-batches of 100 instances, a learning rate of 0.1, and L2 regularization of the weights with a regularization parameter of 2 × 10-4 . We use a momentum of 0.5 in the first 5 iterations, and a momentum of 0.9 in subsequent iterations. The finetuning is performed by running conjugate gradients (without weight decay) until convergence. In the finetuning phase, we continue the training of dGMCML, dt-MCML, dG-NCA, and dt-NCA until the value of the cost function measured on the training set does not decrease anymore. We performed classification experiments on the latent representation of the data using a k-nearest neighbor classifier with k = 5. In the experiments with dt-MCML and dt-NCA in which we learn , we train using  = d - 1 for one epoch first, and we update W and  simultaneously in the subsequent iterations. 3.1. Experimental Results on USPS Data Set The USPS data set contains 11, 000 images of handwritten digits. In the data set, each digit is represented by a 16×16 pixel gray-level image, i.e., the dimensionality of the data set is 256. We constructed six random splits of the USPS data set into a training set of 8, 000 images and a test set of 3, 000 images, and we measured generalization performances that were averaged over these six splits. Table 1 presents the mean classification performances of 5-nearest neighbor classifiers that were trained on embeddings constructed by various techniques, as well as the corresponding standard deviations. The best performance is typeset in boldface. The results show that dt-MCML outperforms all other methods in terms of classification errors measured on a two-dimensional embedding. On a 30-dimensional embedding, the table reveals that dt-NCA (with learned ) outperforms

Deep Supervised t-Distributed Embedding

-13

68 58 48 38 28 18

-18

8 -2 -12 -22 -32 -42

-23 -19

-14

-9

-52 -73 -58 -43 -28 -13 0 1 2 3 4 5 6 250 7 8 200 9 150 100

2

17

32

47

62

76

80 60 40 20

50 0 0 -20 -50 -40 -50 0 50 100 -100 -200 -100 0 100 200

Figure 2. Two-dimensional embeddings of 3, 000 USPS test data points constructed by dG-MCML (top left), dt-MCML (top right), dG-NCA (bottom left), and dt-NCA (bottom right).

respectively, 56.84%, 24.7%, 9.9%, 12.68%, and 2.65%. By contrast, the error obtained by dt-MCML is 2.03%. The results in Table 2 also reveal that, when we are embedding into 30 dimensions, dt-NCA (learned ) outperforms all other techniques. The best performance of dt-NCA of 0.92% is on par with the stateof-the-art results on the MNIST techniques (if adding perturbed digits to the training data is not allowed). Figure 3 shows embeddings of the 10, 000 test data points in the MNIST data set constructed by the various techniques. The embeddings reveal that both dt-NCA and dt-MCML form large separations between classes, whereas dG-NCA and dG-MCML produce overlaps between many of the classes in the data.

4. Concluding Remarks
In this paper, we presented two techniques for supervised parametric dimensionality reduction that used deep networks. The results presented in the previous section reveal that dt-MCML outperforms its linear (MCML) and its deep Gaussian (dG-MCML) counterparts when it is used to construct two-dimensional embeddings, whereas dt-NCA outperforms its linear and its deep Gaussian counterparts (dG-NCA; Salakhutdinov & Hinton (2007)) when it is used to embed data into a space with a dimensionality larger than two. Taken together, our results demonstrate the merits of (1) using a deep architecture to parametrize the mapping and of (2) using a heavy-tailed distribution to measure the pairwise similarities between the low-

Deep Supervised t-Distributed Embedding

16 60

40 12

20

0

7 -28

-23

-18

-15

15 14 13 12 11

0 -20 1 2 3 4 5 6 7 8 100 9 50

-60

-40

-20

0

20

40

0

-50 10 9 8 -26 -24 -22 -20 -18 -16 -100

-150 -150 -100

-50

0

50

100

150

200

Figure 3. Two-dimensional embeddings of 10, 000 MNIST test data points constructed by dG-MCML (top left), dt-MCML (top right), dG-NCA (bottom left), and dt-NCA (bottom right).

Table 2. Test error (in %) on 2-dimensional and 30dimensional embedding for various techniques on the MNIST data set.

Dimensionality d dG-MCML dt-MCML ( = d - 1) dt-MCML (learned ) dG-NCA dt-NCA ( = d - 1) dt-NCA (learned )

2D 2.13 2.03 2.14 7.95 3.48 3.79

30D 1.49 1.63 1.49 1.11 0.92 0.93

dimensional representations of the data points. We did not yet explain in detail why dt-MCML performs better than dt-NCA when constructing twodimensional embeddings, wheras dt-NCA performs

better than dt-MCML when constructing embeddings of a higher dimensionality. This is because dt-MCML tries to collapse all the data points in each class to one point by maximizing the product of the probabilities qij , whereas dt-NCA tries to bring data points in the same class together by maximizing the sum of the asymmetric probabilities qj|i . As a result, when embedding into two dimensions, the objective of dt-NCA can be roughly maximized by setting some of the qi|j 's very large and others very small on training data, to compromise for the limited amount of space available in a two-dimensional space. By contrast, such a setting of the qij 's is not favored by dt-MCML. In fact, the dt-MCML cost function typically does not allow some of the qij 's to be set very small, thus prohibiting data points in the same class from spreading out. This property of dt-MCML has advantages and disadvantages. When embedding in a two-dimensional space,

Deep Supervised t-Distributed Embedding

collapsing classes is often good because the available space to accommodate multiple clusters corresponding to the same class is very limited. However, in latent spaces with a higher dimensionality, there is enough space to accommodate dissimilar data points, as a result of which the collapsing of classes becomes superfluous and leads to overfitting. In contrast, dt-NCA does not require all the data points in the same class to stay very close to each other in the latent space, as a result of which it performs better when embedding in a, say, 30-dimensional latent space. The success of dt-MCML is closely related to the recent success of t-SNE (van der Maaten & Hinton, 2008) in unsupervised dimensionality reduction. However, the reader should note that the reason for the success of t-SNE is a different one. In t-SNE, the use of the Student t-distribution helps to avoid the crowding problem. The crowding problem occurs when one tries to preserve local similarity structure in a lowdimensional data representation, which forces one to model dissimilar points to far apart. The use of the Student-t distribution resolves this problem, because it allows dissimilar points to modeled to far apart. In dtMCML, the aim of the use of a Student t-distribution is to force data points with the same class to collapse better, i.e., to form tight clusters, while increasing the gaps between points with dissimilar labels. The latter characteristic of dt-MCML and dt-NCA is shared with t-SNE: they all encourage larger separations to form between different natural clusters. An additional advantage of dt-MCML over MCML that we did not discuss yet is its computational complexity. Even though dt-MCML is the deep nonlinear extension of MCML, dt-MCML scales very well to massive high-dimensional data sets (although this does require a batch training procedure). By contrast, MCML is slow when it is applied on large data sets (even in batch training procedures) because it uses a projected gradient descent optimizer that performs many eigendecompositions of n × n matrices. In future work, we aim to investigate extensions of dtMCML and dt-NCA that include an additional term penalizing reconstruction error from a deep autoencoder in order to improve the performance of our approach in semi-supervised learning settings. Moreover, we aim to investigate pre-training approaches using denoising autoencoders instead of RBMs.

grant no. 680.50.0908), and by EU-FP7 Social Signal Processing. The authors thank Geoffrey Hinton, Amir Globerson, and Xiaojian Shao for helpful discussions.

References
Globerson, A. and Roweis, S. Metric learning by collapsing classes. In Adv. in Neur. Inf. Proc. Syst., pp. 451­458. 2006. Goldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. Neighbourhood components analysis. In Adv. in Neur. Inf. Proc. Syst., pp. 513­520. 2005. Hinton, G. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504­507, 2006. Hinton, G., Osindero, S., and Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527­1554, 2006. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proc. of the Int. Conf. on Machine Learning, pp. 473­480. 2007. Min, R., Stanley, D., Yuan, Z., Bonner, A., and Zhang, Z. A deep non-linear feature mapping for largemargin kNN classification. In Proc. of the Int. Conference on Data Mining, pp. 357­366. 2009. Salakhutdinov, R. and Hinton, G. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proc. of the Int. Conf. on AI and Statistics, pp. 412­419, 2007. van der Maaten, L. Learning a parametric embedding by preserving local structure. Proc. of the Int. Conf. on AI and Statistics, pp. 384­391, 2009. van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579­2605, 2008. Weinberger, K., Blitzer, J., and Saul, L. Distance metric learning for large margin nearest neighbor classification. In Adv. in Neur. Inf. Proc. Syst., pp. 1473­1480. 2006. Weston, J., Ratle, F., and Collobert, R. Deep learning via semi-supervised embedding. In Proc. of the Int. Conf. on Machine Learning, pp. 1168­1175, 2008. Xing, E., Ng, A., Jordan, M., and Russell, S. Distance metric learning with application to clustering with side-information. In Adv. in Neur. Inf. Proc. Syst., pp. 505­512. 2003.

Acknowledgements
Laurens van der Maaten is supported by the Netherlands Organisation for Scientific Research (NWO;