Grassmann Discriminant Analysis: a Unifying View on Subspace-Based Learning

Jihun Hamm Daniel D. Lee GRASP Laboratory, University of Pennsylvania, Philadelphia, PA 19104 USA

jhham@seas.upenn.edu ddlee@seas.upenn.edu

Abstract
In this paper we propose a discriminant learning framework for problems in which data consist of linear subspaces instead of vectors. By treating subspaces as basic elements, we can make learning algorithms adapt naturally to the problems with linear invariant structures. We propose a unifying view on the subspace-based learning method by formulating the problems on the Grassmann manifold, which is the set of fixed-dimensional linear subspaces of a Euclidean space. Previous methods on the problem typically adopt an inconsistent strategy: feature extraction is performed in the Euclidean space while non-Euclidean distances are used. In our approach, we treat each subspace as a point in the Grassmann space, and perform feature extraction and classification in the same space. We show feasibility of the approach by using the Grassmann kernel functions such as the Pro jection kernel and the Binet-Cauchy kernel. Experiments with real image databases show that the proposed method performs well compared with stateof-the-art algorithms.

the similarity between image vectors (Shakhnarovich et al., 2002; Kondor & Jebara, 2003; Zhou & Chellappa, 2006). In this paper, we specifically focus on those data that can be modeled as a collection of linear subspaces. In the example above, let's assume that the set of images of a single person is well approximated by a low dimensional subspace (Turk & Pentland, 1991), and the whole data is the collection of such subspaces. The benefits of using subspaces are two-fold: 1) comparing two subspaces is cheaper than comparing two sets directly when those sets are very large, and 2) it is more robust to missing data since the subspace can `fill-in' the missing pictures. However the advantages come with the challenge of representing and handling the subspaces appropriately. We approach the subspace-based learning problems by formulating the problems on the Grassmann manifold, the set of fixed-dimensional linear subspaces of a Euclidean space. With this unifying framework we can make analytic comparisons of the various distances of subspaces. In particular, we single out those distances that are induced from the Grassmann kernels, which are positive definite kernel functions on the Grassmann space. The Grassmann kernels allow us to use the usual kernel-based algorithms on this unconventional space and to avoid ad hoc approaches to the problem. We demonstrate the proposed framework by using the Pro jection metric and the Binet-Cauchy metric and by applying kernel Linear Discriminant Analysis to classification problems with real image databases. 1.1. Contributions of the Pap er Although the Pro jection metric and the Binet-Cauchy metric were previously used (Chang et al., 2006; Wolf & Shashua, 2003), their potential for subspace-based learning has not been fully explored. In this work, we provide an analytic exposition of the two metrics as examples of the Grassmann kernels, and contrast the

1. Intro duction
We often encounter learning problems in which the basic elements of the data are sets of vectors instead of vectors. Suppose we want to recognize a person from multiple pictures of the individual, taken from different angles, under different illumination or at different places. When comparing such sets of image vectors, we are free to define the similarity between sets based on
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Grassmann Discriminant Analysis

RD

span( Yi ) u1

G ( m, D )
span( Yj ) v1

Yi

Yj

2

1 , ..., m

Figure 1. Principal angles and Grassmann distances. Let span(Yi ) and span(Yj ) be two subspaces in the Euclidean space RD on the left. The distance between two subspaces span(Yi ) and span(Yj ) can be measured by the principal angles  = [1 , ... , m ] using the usual innerproduct of vectors. In the Grassmann manifold viewpoint, the subspaces span(Yi ) and span(Yj ) are considered as two points on the manifold G (m, D), whose Riemannian distance is related to the principal angles by d(Yi , Yj ) =  2 . Various distances can be defined based on the principal angles.

two metrics with other metrics used in the literature. Several subspace-based classification methods have been previously proposed (Yamaguchi et al., 1998; Sakano, 2000; Fukui & Yamaguchi, 2003; Kim et al., 2007). However, these methods adopt an inconsistent strategy: feature extraction is performed in the Euclidean space when non-Euclidean distances are used. This inconsistency can result in complications and weak guarantees. In our approach, the feature extraction and the distance measurement are integrated around the Grassmann kernel, resulting in a simpler and better-understood formulation. The rest of the paper is organized as follows. In Sec. 2 and 3 we introduce the Grassmann manifolds and derive various distances on the space. In Sec. 4 we present a kernel view of the problem and emphasize the advantages of using positive definite metrics. In Sec. 5 we propose the Grassmann Discriminant Analysis and compare it with other subspace-based discrimination methods. In Sec. 6 we test the proposed algorithm for face recognition and ob ject categorization tasks. We conclude in Sec. 7 with a discussion.

represented by an orthonormal matrix Y of size D by m such that Y Y = Im , where Im is the m by m identity matrix. For example, Y can be the m basis vectors of a set of pictures in RD . However, the matrix representation of a point in G (m, D) is not unique: two matrices Y1 and Y2 are considered the same if and only if span(Y1 ) = span(Y2 ), where span(Y ) denotes the subspace spanned by the column vectors of Y . Equivalently, span(Y1 ) = span(Y2 ) if and only if Y1 R1 = Y2 R2 for some R1 , R2  O(m). With this understanding, we will often use the notation Y when we actually mean its equivalence class span(Y ), and use Y1 = Y2 when we mean span(Y1 ) = span(Y2 ), for simplicity. Formally, the Riemannian distance between two subspaces is the length of the shortest geodesic connecting the two points on the Grassmann manifold. However, there is a more intuitive and computationally efficient way of defining the distances using the principal angles (Golub & Loan, 1996). Definition 2 Let Y1 and Y2 be two orthonormal matrices of size D by m. The principal angles 0  1  · · ·  m   /2 between two subspaces span(Y1 ) and span(Y2 ), are defined recursively by cos k = max max uk uk span(Y1 ) vk span(Y2 ) uk uk
u i u k v k,

2. Grassmann Manifold and Principal Angles
In this section we briefly review the Grassmann manifold and the principal angles. Definition 1 The Grassmann manifold G (m, D) is the set of m-dimensional linear subspaces of the RD . The G (m, D) is a m(D - m)-dimensional compact Riemannian manifold.1 An element of G (m, D) can be
G (m, D) can be derived as a quotient space of orthogonal groups G (m, D) = O(D)/O(m) × O(D - m), where
1

subject to

= 1, vk

v k

= 1,

= 0,

vk vi

= 0, (i = 1, ..., k - 1).

In other words, the first principal angle 1 is the smallest angle between all pairs of unit vectors in the first and the second subspaces. The rest of the principal
O(m) is the group of m by m orthonormal matrices. We refer the readers to (Wong, 1967; Absil et al., 2004) for details on the Riemannian geometry of the space.


Grassmann Discriminant Analysis

angles are similarly defined. It is known (Wong, 1967; Edelman et al., 1999) that the principal angles are ire2 lated to the geodesic distance by d2 (Y1 , Y2 ) = i G (refer to Fig. 1.) The principal angles can be computed from the Singular Value Decomposition (SVD) of Y1 Y2 , Y1 Y2 = U (cos )V
,

The Pro jection metric is the 2-norm of the sine of principal angles (Edelman et al., 1999; Wang et al., 2006). 2. Binet-Cauchy metric 1 dB C (Y1 , Y2 ) = - i cos i
2

1 /2 . (3)

(1) The Binet-Cauchy metric is defined with the product of canonical correlations (Wolf & Shashua, 2003; Vishwanathan & Smola, 2004). As the names hint, these two distances are in fact valid metrics satisfying Def. 3. The proofs are deferred until Sec. 4. 3.2. Other Distances in the Literature We describe a few other distances used in the literature. The principal angles are the keys that relate these distances. 1. Max Correlation dMax (Y1 , Y2 ) = 1 - cos2 1 1 /2 = sin 1 . (4)

where U = [u1 ... um ], V = [v1 ... vm ], and cos  is the diagonal matrix cos  = diag(cos 1 ... cos m ). The cosines of the principal angles cos 1 , ... , cos m are also known as canonical correlations. Although the definition can be extended to the cases where Y1 and Y2 have different number of columns, we will assume Y1 and Y2 have the same size D by m throughout this paper. Also, we will occasionally use G instead of G (m, D) for simplicity.

3. Distances for Subspaces
In this paper we use the term distance as any assignment of nonnegative values for each pair of points in a space X . A valid metric is, however, a distance that satisfies the additional axioms: Definition 3 A real-valued function d : X × X  R is cal led a metric if 1. d(x1 , x2 )  0, 2. d(x1 , x2 ) = 0 if and only if x1 = x2 , 3. d(x1 , x2 ) = d(x2 , x1 ), 4. d(x1 , x2 ) + d(x2 , x3 )  d(x1 , x3 ), for al l x1 , x2 , x3  X . A distance (or a metric) between subspaces d(Y1 , Y2 ) has to be invariant under different representations d(Y1 , Y2 ) = d(Y1 R1 , Y2 R2 ), R1 , R2  O(m). In this section we introduce various distances for subspaces derivable from the principal angles. 3.1. Pro jection Metric and Binet-Cauchy Metric We first underline two main distances of this paper. 1. Projection metric m 1 /2 i 2 dP (Y1 , Y2 ) = sin i =
=1

The max correlation is a distance based on only the largest canonical correlation cos 1 (or the smallest principal angle 1 ). This max correlation was used in previous works (Yamaguchi et al., 1998; Sakano, 2000; Fukui & Yamaguchi, 2003). 2. Min Correlation dMin (Y1 , Y2 ) = 1 - cos2 m 1 /2 = sin m . (5)

The min correlation is defined similarly to the max correlation. However, the min correlation is more closely related to the Pro jection metric: we can rewrite the Pro jection metric as dP = 2-1/2 Y1 Y1 - Y2 Y2 F and the min correlation as dMin = Y1 Y1 - Y2 Y2 2. 3. Procrustes metric m dC F (Y1 , Y2 ) = 2 i
=1

1 /2 sin (i /2)
2

.

(6)

The Procrustes metric is the minimum distance between different representations of two subspaces span(Y1 ) and span(Y2 ): (Chikuse, 2003) 1 /2 dC F = . where U and V are from (1). By definition, the distance is invariant of the choice of the
R1 ,R2 O (m)

m -

im
=1

min

Y1 R1 -Y2 R2 F = Y1 U -Y2 V F ,

cos i (2)

2


Grassmann Discriminant Analysis

bases of span(Y1 ) and span(Y2 ). The Procrustes metric is also called chordal distance (Edelman et al., 1999). We can similarly define the minimum distance using other matrix norms such as dC 2 (Y1 , Y2 ) = Y1 U - Y2 V 2 = 2 sin(m /2). 3.3. Which Distance to Use? The choice of the best distance for a classification task depends on a few factors. The first factor is the distribution of data. Since the distances are defined with particular combinations of the principal angles, the best distance depends highly on the probability distribution of the principal angles of the given data. For example, dMax uses the smallest principal angle 1 only, and may be robust when the data are noisy. On the other hand, when all subspaces are sharply concentrated on one point, dMax will be close to zero for most of the data. In this case, dMin may be more discriminative. The Pro jection metric dP , which uses all the principal angles, will show intermediate characteristics between the two distances. Similar arguments can be made for the Procrustes metrics dC F and dC 2 , which use all angles and the largest angle only, respectively. The second criterion for choosing the distance, is the degree of structure in the distance. Without any structure a distance can be used only with a simple KNearest Neighbor (K-NN) algorithm for classification. When a distance have an extra structure such as triangle inequality, for example, we can speed up the nearest neighbor searches by estimating lower and upper limits of unknown distances (Farag´ et al., 1993). o From this point of view, the max correlation is not a metric and may not be used with more sophisticated algorithms. On the other hand, the Min Correlation and the Procrustes metrics are valid metrics2 . The most structured metrics are those which are induced from a positive definite kernel. Among the metrics mentioned so far, only the Pro jection metric and the Binet-Cauchy metric belong to this class. The proof and the consequences of positive definiteness are the main topics of the next section.

set, and k : X × X  R be a symmetric real-valued function k (xi , xj ) = k (xj , xi ) for all xi , xj  X . Definition 4 A real symmetric function is a (resp. conditional ly) positive definite kernel function, if i ,j ci cj k (xi , xj )  0, for al l x1 , ..., xn (xi  X ) and c1 , ..., cn (ci  R) for any nn  N. (resp. for al l c1 , ..., cn (ci  R) such that i=1 ci = 0.) In this paper we are interested in the kernel functions on the Grassmann space. Definition 5 A Grassmann kernel function is a positive definite kernel function on G . In the following we show that the Pro jection metric and the Binet-Cauchy are induced from the Grassmann kernels. 4.1. Pro jection Metric The Pro jection metric can be understood by associating a point span(Y )  G with its pro jection matrix Y Y by an embedding: P : G (m, D)  RD×D , span(Y )  Y Y
.

(7)

The image P (G (m, D)) is the set of rank-m orthogonal pro jection matrices. This map is in fact an isometric embedding (Chikuse, 2003) and the pro jection metric is simply a Euclidean distance in RD×D . The corresponding innerproduct of the space is tr [(Y1 Y1 )(Y2 Y2 )] = Y1 Y2 F , and therefore 2 Prop osition 1 The Projection kernel kP (Y1 , Y2 ) = Y1 Y2 F 2 is a Grassmann kernel. Pro of The kernel is well-defined because kP (Y1 , Y2 ) = kP (Y1 R1 , Y2 R2 ) for any R1 , R2  O(m). The positive definiteness follows from the properties of the Frobenius norm. For all Y1 , ..., Yn (Yi  G ) and c1 , ..., cn (ci  R) for any n  N, we have i ci cj Yi Yj 2 F
j

(8)

4. Kernel Functions for Subspaces
We have defined a valid metric on Grassmann manifolds. The next question is whether we can define a kernel function compatible with the metric. For this purpose let's recall a few definitions. Let X be any
2 The metric properties follow from the properties of matrix 2-norm and F-norm. To check the conditions in Def. 3 for Procrustes we use the equality minR1 ,R2 Y1 R1 - Y2 R2 2,F = minR3 Y1 - Y2 R3 2,F for R1 , R2 , R3  O(m).

= =

i
j

ci cj tr(Yi Yi Yj Yj ) i ci Yi Yi F  0. 2

= tr(

i

ci Yi Yi )2

We can generate a family of kernels from the Pro jection kernel. For example, the square-root Yi Yj F is also a positive definite kernel.


Grassmann Discriminant Analysis

4.2. Binet-Cauchy Metric The Binet-Cauchy metric can also be understood from an embedding. Let s be a subset of {1, ..., D} with m elements s = {r1 , ..., rm }, and Y (s) be the m × m matrix whose rows are the r1 , ... , rm -th rows of Y . If s1 , s2 , ..., sn are all such choices of the subset s ordered lexicographically, then the Binet-Cauchy embedding is defined as , d B C : G (m, D)  Rn , Y  et Y (s1 ) , ..., det Y (sn ) (9) where n = D Cm is the number of choosing m rows out of D rows. The natural innerproduct in this case is n (s ) ( si ) det Y2 i . r =1 det Y1 Prop osition 2 The Binet-Cauchy kernel kB C (Y1 , Y2 ) = (det Y1 Y2 )2 = det Y1 Y2 Y2 Y1 is a Grassmann kernel. Pro of First, the kernel is well-defined because kB C (Y1 , Y2 ) = kB C (Y1 R1 , Y2 R2 ) for any R1 , R2  O(m). To show that kB C is positive definite it suffices to show that k (Y1 , Y2 ) = det Y1 Y2 is positive definite. From the Binet-Cauchy identity, we have s ( s) ( s) det Y1 det Y2 . det Y1 Y2 = Therefore, for all Y1 , ..., Yn (Yi  G ) and c1 , ..., cn (ci  R) for any n  N, we have i s i ( s) ( s) det Yi det Yj ci cj det Yi Yj = ci cj
j j

the kernel-based algorithms for Hilbert spaces are at our disposal. In contrast, other metrics in the previous sections are not associated with any Grassmann kernel. To show this we can use the following result (Schoenberg, 1938; Hein et al., 2005): Prop osition 3 A metric d is induced from a positive definite kernel if and only if ^ k (x1 , x2 ) = -d2 (x1 , x2 )/2, is conditional ly positive definite. The proposition allows us to show a metric's nonpositive definiteness by constructing an indefinite kernel matrix from (11) as a counterexample. There have been efforts to use indefinite kernels for learning (Ong et al., 2004; Haasdonk, 2005), and several heuristics have been proposed to make an indefinite kernel matrix to a positive definite matrix (Pekalska et al., 2002). However, we do not advocate the use of the heuristics since they change the geometry of the original data. x1 , x2  X (11)

(10)

5. Grassmann Discriminant Analysis
In this section we give an example of the Discriminant Analysis on Grassmann space by using kernel LDA with the Grassmann kernels. 5.1. Linear Discriminant Analysis The Linear Discriminant Analysis (LDA) (Fukunaga, 1990), followed by a K-NN classifier, has been successfully used for classification. Let {x1 , ..., xN } be the data vectors and {y1 , ..., yN } be the class labels yi  {1, ..., C }. Without loss of generality we assume the data are ordered according to the class labels: 1 = y1  y2  ...  yN = C . Each class c has Nc number of samples. { Let µc = 1/Nc i|yi =c} xi b e the mean of class c, and i µ = 1/N xi be the overall mean. LDA searches for the discriminant direction w which maximizes the Rayleigh quotient L(w) = w Sb w/w Sw w where Sb and Sw are the between-class and within-class covariance matrices respectively:
S

i = s
( s) ci det Yi

2  0.

We can also generate another family of kernels from the Binet-Cauchy kernel. Note that although det Y1 Y2 is a Grassmann kernel we prefer using kB C (Y1 , Y2 ) = det(Y1 Y2 )2 , since it is cdirectly related to principal angles det(Y1 Y2 )2 = os2 i , whereas c 1 3 det Y Y2 = os i in general. Another variant arcsin kB C (Y1 , Y2 ) is also a positive definite kernel4 1 /2 and its induced metric d = (arccos(det Y1 Y2 )) is a conditionally positive definite metric. 4.3. Indefinite Kernels from Other Metrics Since the Pro jection metric and the Binet-Cauchy metric are derived from positive definite kernels, all
det Y1 Y2 can be negative whereas cos i , the product of singular values, is nonnegative by definition. 4 Theorem 4.18 and 4.19 (Sch¨lkopf & Smola, 2001). o
3

Sb

=

C 1c Nc (µc - µ)(µc - µ) N =1 C 1c { N =1

T

w

=

(xi - µc )(xi - µc )
i|yi =c}

Q

he optimal w is obtained from the largest eigenvec- - tor of Sw 1 Sb . Since Sw 1 Sb has rank C - 1, there are


Grassmann Discriminant Analysis

C - 1-number of local optima W = {w1 , ..., wC -1 }. By pro jecting data onto the space spanned by W , we achieve dimensionality reduction and feature extraction of data onto the most discriminant subspace. 5.2. Kernel LDA with Grassmann Kernels Kernel LDA can be formulated by using the kernel trick as follows. Let  : G  H be the feature map, and  = [1 ...N ] be the feature matrix of the training points. Assuming w is a linear combination of the those feature vectors, w = , we can rewrite the Rayleigh quotient in terms of  as  K (V - 1N 1N /N )K    SB  =( ,   SW   K (IN - V )K +  2 IN ) (12) where K is the kernel matrix, 1N is a uniform vector [1 ... 1] of length N , V is a block-diagonal matrix whose c-th block is the uniform matrix 1Nc 1Nc /Nc , and  2 IN is a regularizer for making the computation stable. Similarly to LDA, the set of optimal 's are computed from the eigenvectors. L() = The procedures for using kernel LDA with the Grassmann kernels are summarized below: Assume the D by m orthonormal bases {Yi } are already computed from the SVD of sets in the data. Training: 1. Compute the matrix [Ktrain ]ij = kP (Yi , Yj ) or kB C (Yi , Yj ) for al l Yi , Yj in the training set. 2. Solve max L() by eigen-decomposition. 3. Compute the (C - 1)-dimensional coefficients Ftrain =  Ktrain . Testing: 1. Compute the matrix [Ktest ]ij = kP (Yi , Yj ) or kB C (Yi , Yj ) for al l Yi in training set and Yj in the test set. 2. Compute the (C - 1)-dim coefficients Ftest =  Ktest . 3. Perform 1-NN classification from the Euclidean distance between Ftrain and Ftest . Another way of applying LDA to subspaces is to use the Pro jection embedding P (7) or the Binet-Cauchy embedding B C (9) directly. A subspace is represented by a D by D matrix in the former, or by a vector of length n = D Cm in the latter. However, using these embeddings to compute Sb or Sw is a waste

of computation and storage resources when D is large. 5.3. Other Subspace-Based Algorithms 5.3.1. Mutual Subspace Method (MSM) The original MSM (Yamaguchi et al., 1998) performs simple 1-NN classification with dMax with no feature extraction. The method can be extended to any distance described in the paper. There are attempts to use kernels for MSM (Sakano, 2000). However, the kernel is used only to represent data in the original space, and the algorithm is still a 1-NN classification. 5.3.2. Constrained MSM Constrained MSM (Fukui & Yamaguchi, 2003) is a technique that applies dimensionality reduction to bases ofi the subspaces in the original space. Let G= Yi Yi be the sum of the pro jection matrices and {v1 , ..., vD } be the eigenvectors corresponding to the eigenvalues {1  ...  D } of G. The authors claim that the first few eigenvectors v1 , ..., vd of G are more discriminative than the later eigenvectors, and they suggest pro jecting the basis vectors of each subspace Y1 onto the span(v1 , ..., vl ), followed by normalization and orthonormalization. However these procedure lack justifications, as well as a clear criterion for choosing the dimension d, on which the result crucially depends from our experience. 5.3.3. Discriminant Analysis of Canonical Correlations (DCC) DCC (Kim et al., 2007) can be understood as a nonparametric version of linear discrimination analysis using the Procrustes metric (6). The algorithm finds the discriminating direction w which maximize the ratio L(w) = w SB w/w Sw w, where Sb and Sw are the nonparametric between-class and within-class `covariance' matrices: ij S Sb = (Yi U - Yj V )(Yi U - Yj V )
Bi w

=

ij
Wi

(Yi U - Yj V )(Yi U - Yj V ) ,

where U and V are from (1). Recall that tr(Yi U - Yj V )(Yi U - Yj V ) = Yi U - Yj V 2 is the squared F Procrustes metric. However, unlike our method, Sb and Sw do not admit a geometric interpretation as true covariance matrices, and cannot be kernelized either. A main disadvantage of the DCC is that the algorithm iterates the two stages of 1) maximizing the ratio L(w) and of 2) computing Sb and Sw , which results in computational overheads and more parame-


Grassmann Discriminant Analysis

ters to be determined. This reflects the complication of treating the problem in a Euclidean space with a non-Euclidean distance.

tors from the data already have enough information and the smaller eigenvectors are spurious for discriminating the sub jects. 6.3. Testing Pose-Invariance with ETH-80 Database The ETH-80 (Leibe & Schiele, 2003) database consists of pictures of 8 ob ject categories (`apple', `pear', `tomato', `cow', `dog', `horse', `cup', `car'). Each category has 10 ob jects that belong to the category, and each ob ject is recorded under 41 different poses. Images were resized to 32 × 32 pixels (D = 1024) and normalized to have the same variance. For each category and each ob ject, we model the pose variations by a subspace of the size m = 1, ..., 5, spanned by the 1 to 5 largest eigenvectors from SVD. We evaluate the classification rate of the categories with ten-fold cross validation, holding out one ob ject instance of each category from the training set and using it for test. The recognition rates are also summarized in Fig. 2. The GDA1 also outperforms the other methods most of the time, but the cMSM performs better than GDA2 as m increases. The rates seem to peak around m = 4 and then decrease as m increases. This results is consistent with the observation that the eigenvalues from this database decrease more gradually than the eigenvalues from the Yale face database.

6. Exp eriments
In this section we test the Grassmann Discriminant Analysis for 1) a face recognition task and 2) an ob ject categorization task with real image databases. 6.1. Algorithms We use the following six methods for feature extraction together with an 1-NN classifier. 1) GDA1 (with Pro jection kernel), 2) GDA2 (with Binet-Cauchy kernel), 3) Min dist , 4) MSM, 5) cMSM, and 6) DCC. For GDA1 and GDA2, the optimal values of  are found by scanning through a range of values. The results do not seem to vary much as long as  is small enough. The Min dist is a simple pairwise distance which is not subspacebased. If Yi and Yj are two sets of basis vectors: Yi = {yi1 , ..., yimi } and Yj = {yj 1 , ..., yj mj }, then dMindist (Yi , Yj ) = mink,l yik - yj l 2. For cMSM and DCC, the optimal dimension l is found by exhaustive searching. For DCC, we have used two nearestneighbors for Bi and Wi in Sec. 5.3.3. Since the Sw and Sb are likely to be rank deficient, we first reduced the dimension of the data to N - C using PCA as recommended. Each optimization is iterated 5 times. 6.2. Testing Illumination-Invariance with Yale Face Database The Yale face database and the Extended Yale face database (Georghiades et al., 2001) together consist of pictures of 38 sub jects with 9 different poses and 45 different lighting conditions. Face regions were cropped from the original pictures, resized to 24 × 21 pixels (D = 504), and normalized to have the same variance. For each sub ject and each pose, we model the illumination variations by a subspace of the size m = 1, ..., 5, spanned by the 1 to 5 largest eigenvectors from SVD. We evaluate the recognition rate of sub jects with ninefold cross validation, holding out one pose of all subjects from the training set and using it for test. The recognition rates are shown in Fig. 2. The GDA1 outperforms the other methods consistently. The GDA2 also performs well for small m, but performs worse as m becomes large. The rates of the others also seem to decrease as m increases. An interpretation of the observation is that the first few eigenvec-

7. Conclusion
In this paper we have proposed a Grassmann framework for problem in which data consist of subspaces. By using the Pro jection metric and the Binet-Cauchy metric, which are derived from the Grassmann kernels, we were able to apply kernel methods such as kernel LDA to subspace data. In addition to having theoretically sound grounds, the proposed method also outperformed state-of-the-art methods in two experiments with real data. As a future work, we are pursuing a better understanding of probabilistic distributions on the Grassmann manifold.

References
Absil, P., Mahony, R., & Sepulchre, R. (2004). Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Appl. Math., 80, 199­220. Chang, J.-M., Beveridge, J. R., Draper, B. A., Kirby, M., Kley, H., & Peterson, C. (2006). Illumination face spaces are idiosyncratic. IPCV (pp. 390­396). Chikuse, Y. (2003). Statistics on special manifolds, lecture notes in statistics, vol. 174. New York: Springer. Edelman, A., Arias, T. A., & Smith, S. T. (1999). The


Grassmann Discriminant Analysis

Figure 2. Recognition rates of sub jects from Yale face database (Left), and classification rates of categories in ETH-80 database (Right). The bars represent the rates of six algorithms (GDA1, GDA2, Min Dist, MSM, cMSM, DCC) evaluated for m = 1, ...., 5 where m is the number of basis vectors for subspaces. The GDA1 achieves the best rates consistently, and the GDA2 also performs competitively for small m.

geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20, 303­353. Farag´, A., Linder, T., & Lugosi, G. (1993). Fast nearesto neighbor search in dissimilarity spaces. IEEE Trans. Pattern Anal. Mach. Intel l., 15, 957­962. Fukui, K., & Yamaguchi, O. (2003). Face recognition using multi-viewpoint patterns for robot vision. Int. Symp. of Robotics Res. (pp. 192­201). Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego, CA, USA: Academic Press Professional, Inc. Georghiades, A. S., Belhumeur, P. N., & Kriegman, D. J. (2001). From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intel l., 23, 643­660. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations (3rd ed.). Baltimore, MD, USA: Johns Hopkins University Press. Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels. IEEE Trans. Pattern Anal. Mach. Intel l., 27, 482­492. Hein, M., Bousquet, O., & Sch¨lkopf, B. (2005). Maximal o margin classification for metric spaces. J. Comput. Syst. Sci., 71, 333­359. Kim, T.-K., Kittler, J., & Cipolla, R. (2007). Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal. Mach. Intel l., 29, 1005­1018. Kondor, R. I., & Jebara, T. (2003). A kernel between sets of vectors. Proc. of the 20th Int. Conf. on Mach. Learn. (pp. 361­368). Leibe, B., & Schiele, B. (2003). Analyzing appearance and contour based methods for ob ject categorization. CVPR, 02, 409. Ong, C. S., Mary, X., Canu, S., & Smola, A. J. (2004). Learning with non-positive kernels. Proc. of 21st Int. Conf. on Mach. Learn. (p. 81). New York, NY, USA: ACM.

Pekalska, E., Paclik, P., & Duin, R. P. W. (2002). A generalized kernel approach to dissimilarity-based classification. J. Mach. Learn. Res., 2, 175­211. Sakano, H.; Mukawa, N. (2000). Kernel mutual subspace method for robust facial image recognition. Proc. of Int. Conf. on Know ledge-Based Intel l. Eng. Sys. and App. Tech. (pp. 245­248). Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Trans. Amer. Math. Soc., 44, 522­536. Sch¨lkopf, B., & Smola, A. J. (2001). Learning with kero nels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA, USA: MIT Press. Shakhnarovich, G., John W. Fisher, I., & Darrell, T. (2002). Face recognition from long-term observations. Proc. of the 7th Euro. Conf. on Computer Vision (pp. 851­868). London, UK. Turk, M., & Pentland, A. P. (1991). Eigenfaces for recognition. J. Cog. Neurosc., 3, 71­86. Vishwanathan, S., & Smola, A. J. (2004). Binet-cauchy kernels. Proc. of Neural Info. Proc. Sys.. Wang, L., Wang, X., & Feng, J. (2006). Subspace distance analysis with application to adaptive bayesian algorithm for face recognition. Pattern Recogn., 39, 456­464. Wolf, L., & Shashua, A. (2003). Learning over sets using kernel principal angles. J. Mach. Learn. Res., 4, 913­ 931. Wong, Y.-C. (1967). Differential geometry of Grassmann manifolds. Proc. of the Nat. Acad. of Sci., Vol. 57, 589­ 594. Yamaguchi, O., Fukui, K., & Maeda, K. (1998). Face recognition using temporal image sequence. Proc. of the 3rd. Int. Conf. on Face & Gesture Recognition (p. 318). Washington, DC, USA: IEEE Computer Society. Zhou, S. K., & Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel hilbert space. IEEE Trans. Pattern Anal. Mach. Intel l., 28, 917­929.