Kernels on Attributed Pointsets with Applications

Mehul Parsana mehul.parsana@gmail.com Chiranjib Bhattacharyya chiru@csa.iisc.ernet.in

Sourangshu Bhattacharya sourangshu@gmail.com K. R. Ramakrishnan krr@ee.iisc.ernet.in

Abstract
This paper introduces kernels on attributed pointsets, whi ch are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attr ibute similarity and the other evaluating shape similarity. Shape similarity fu nction is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results.

1

Introduction

In recent times, one of the major challenges in kernel methods has been design of kernels on structured data e.g. sets [9, 17, 15], graphs [8, 3], strings, auto mata, etc. In this paper, we propose kernels on a type of structured objects called attributed pointsets [18]. Attributed pointsets are points embedded in a euclidean space with a vector of attributes attached to each point. The embedding of points in the euclidean space yields a notion of neighborhood of each point which is exploited in designing new kernels. Also, we describe the notion of similarity between pointsets which model many real life scenarios and incorporate it in the proposed k ernels. The main contribution of this paper is definition of two different kernels on neighborhoods. These neighborhood kernels are then used to define kernels on the en tire pointsets. The first kernel treats the neighborhoods as sets of vectors for calculating the simila rity. Second kernel calculates similarity in shape of the two neighborhoods. It is motivated using spectral graph matching techniques [16]. We demonstrate practical applications of the kernels on the well known task of face recognition [20], and two other novel tasks of tagging photo albums and annotation of shots in video sequences. For the face recognition task, we test our kernels on benchmark d atasets and compare their performance with state-of-the-art algorithms. Our kernels outperform the existing methods in many cases. The kernels also perform according to expectation on the two novel applications. Section 2 defines attributed pointsets and contrasts it with related notions . Section 3 proposes two kernels and section 4 describes experimental results.

2

Definition and related work

An attributed pointset [18, 1] (a.k.a. point pattern) X is sets of points in Rk with attributes or labels (real vectors in this case) attached to each point. Thus, X = {(xi , di )|i = 1 . . . n}, where xi  Ru and di  Rv , l being the dimension of the attribute vector. The number of po ints in a pointset, n, is variable. Also, for practical purposes pointsets with u = 2, 3 are of interest. The construct of pointsets are richer than sets of vectors [17] because of t he structure formed by embedding of the points in a euclidean space. However, they are less general than attributed graphs because all 1


Figure 1: Correspondences implicitly found by sum and neigh borhood kernels

attributed graphs cannot be embedded onto a euclidean space. Pointsets are useful in several domains including computer vision [18], computational biology [5], etc. The notion of similarity between pointsets is also different from those between sets of vectors, or graphs. The main aspect of similarity is that there should be correspondences (1-1 mappings) between the points of a pointset such that the relative positions of corresponding point are same. Also the attribute vectors of the matching points should be similar. In case of sets of vectors, the kernel function captures the similarity between aggregate properties of the two sets, such as the principle angles between spanned subspaces [17], or distance between the distributions generating the vectors [9]. Kernels on graphs try to capture similarity in the graph topology by comparing the number of similar paths [3], or comparing steady state distr ibutions on of linear systems on graphs [8]. For example, consider recognizing faces using local descri ptors calculated at some descriptor points (corner points in this case) on the face. It is necessary that subsets of descriptor points found in two images of the same face should be approximately superimposable (slight changes may be due to change of expression) and that the descriptor values for the corresponding points should be roughly same to ensure similar local features. Thus, a face can be mod eled as an attributed pointset X = {(xi , di )|i = 1 . . . n}, where xi  R2 is the coordinate of ith descriptor point and di  Rv is the local descriptor vector at the ith descriptor point. Similar arguments can be provided for any object recognition task. A local descriptor based kernel was proposed for object reco gnition in similar setting in [12]. Suppose X A = {(xA , dA )|i = 1 . . . nA } and X B = {(xB , dB )|i = 1 . . . nB } are two pointsets. The i i i i nB nA normalized sum kernel [12] was defined as KN S (X A , X B ) = nA1 B i=1 j =1 (K(dA , dB ))p , i j n where K(dA , dB ) is some kernel function on the descriptors. It was argued in [ 12] that raising j i the kernel to a high power p approximately calculates similarity between matched pairs of vectors. Using the RBF kernel KRB F (x, y ) = e- normalized sum kernels as: KN S (X A , X B ) =
x-y 2 2

, and adjusting the parameter p in  , we get the

n n 1 iA jB KRB F (dA , dB ) i j nA nB =1 =1

(1)

choosing F = 11T (all entries 1) and multiplying the kernel by n2 1 2 and using KRB F as the A nB kernel on vectors, we get back the kernel defined in (1). The no rmalized sum kernel is used as the basic kernel for development and validation of the new kerne ls proposed here. In the next section, we incorporate position xi of the points using the concept of neighborhood. 2

Observe that this kernel doesn't use the in formation in xi anywhere, and thus is actually a kernel on a set of vectors. In fact, this kernel canrbe derived as a bpecial case of the set kernel proposed s T^ ^r ecomes K(A, B ) = i k (ai , bj )fij in [15]. The kernel K(A, B ) = trace (A Gr B )F j r ^ for Gr = I and F = Fr (whose entries are fij ) should be positive semidefinite [15]. Thus,


3
3.1

Kernels
Neighborhood kernels

The key idea in this section is to use spatially co-occurring points of a point to improve the similarity values given by the kernel function. In other words, we hypothesize that similar points from two pointsets should also have neighboring points which are similar. Thus, for each point we define a neighborhood of the point and weight the similarity between each pair of points with the similarity between their neighborhoods. The k -neighborhood Ni of a point (xi , di ) in a pointset X is defined as the set of points (including itself) that are closest to it in the embedding euclidean space. So, Ni = {(xj , dj )  X | xi - xj  xi - xl (xl , dl )  Ni and |Ni | = k }. The neighborhood kernel between two points (xA , dA ) i i and (xB , dB ) is defined as: j j KN ((xA , dA ), (xB , dB )) = KRB F (dA , dB )× i i j j i j 1 |NiA ||NjB | A A A B xs ,ds )Ni xB ,dB )Nj t t ( ( KRB F (dA , dB ) s t (2) The neighborhood kernel (NK) between two pointsets X A and X B is thus defined as: KN K (X A , X B ) = iA jB 1 B KN ((xA , dA ), (xj , dB )) × i i j nA nB =1 =1
n n

(3)

It is easy to see that KN K is a positive semidefinite kernel function. Even though KN K is a straightforward extension, it considerably improves accuracy of KN S . Figure 1 shows values of KN S and KN K for 4 pairs of point from two pointsets modeling faces. Dark b lue lines indicate best matches given by KN S while bright blue lines indicate best matches by the KN K . In both cases, KN K gives the correct match while the KN S fails. Computational complexity of KN K is O(k 2 n2 ), k being neighborhood size and n number of points. The next section proposes a kernel which us es positions of points (xi ) in a neighborhood more strongly to calculate similarity in shape. 3.2 Spectral Neighborhood Kernel

The kernel defined in the previous section still uses a set of vectors kernel for finding similarity between the neighborhoods. Here, we are interested in a kern el function which evaluates the similarity in relative position of the corresponding points. Si nce the neighborhoods being compared are of fixed size, we assume that all points in a neighborhood have a corresponding point in the other. Thus, the correspondences are given by a permutation of poin ts in one of the neighborhoods. This problem can be formulated as the weighted graph matching pro blem [16], for which spectral method is one of the popular heuristics. We use the features given by spectral decomposition of adjacency matrix of the neighborhood to define a kernel function. Given a neighborhood Ni we define its adjacency matrix Ai as Ai (s, t) = x -x  -st e , s, t|(xs , ds ), (xt , dt )  Ni , where  is a parameter. Given two neighborhoods NiA and NjB , we are thus interested in a permutation  of the basis of adjacency matrix of one of the neighborhoods (say NjB ), such that AA -  (AB ) F is minimized, . F being the frobenius i j norm of a matrix. It is well known that a matrix can be fully reconstructed from its spectral decomposition. Also, in the k n T case that fewer eigenvectors are used, the equation A - i=1 i i i 2 = j =k+1 2 , suggests j F that eigenvectors corresponding to the higher eigenvalues will give better reconstruction. We use one eigenvector corresponding to largest eigenvalue. Thus, th e approximate adjacency matrix becomes T ^ A = 1 1 1 . ^ Let   be the optimal permutation that minimizes AA -  (AB ) F . Note that here  applied on a ^i j matrix implies permutation of the basis. It is easy to see that same permutation is induced on basis B A B B of the eigenvectors j (1). Call fiA = |i (1)| and fj = |j (1)|, the spectral projection vectors A B corresponding to neighborhoods NiA and NjB . Here i (1), j (1) are eigenvectors corresponding 3


^^ to largest eigenvalue of AA , AB , and | (1)| is the vector of absolute values of components of  (1). i j f (s) can be thought of as projection of the sth point in the corresponding neighborhood on R1 . It is B equivalent to seek a permutation   which minimizes fiA -  (fj ) , for comparing neighborhoods A B Ni and Nj . The resulting similarity score is:
B S (NiA , NjB ) = max T - fiA -  (fj ) 2 2  

(4)

where, T is a threshold for converting the distance measure to simila rity, and  is the set of all permutations. However, this similarity function is not necessarily positive semidefinite.
B To construct a positive semidefinite kernel giving similari ty between the vectors fiA and fj , we use the convolution kernel technique [7] on discrete stru ctures. Let x  X be a composite object formed using parts from X1 , . . . , Xm . Let R be a relation over X1 × · · · × Xm × X such that R(x1 , . . . , xm , x) is true if x is composed of x1 , . . . , xm . Let R-1 (x) = (x1 , . . . , xm )  X1 × · · · × Xm |R(x1 , . . . , xm , x) = true and K1 , . . . , Km be kernels on X1 , . . . , Xm , respectively. The convolution kernel K over X is defined as: im ( K(x, y ) = Ki (xi , yi ) (5)

Haussler [7] showed that if K , . . . , K

1

x1 ,...,xm )R-1 (x),(y1 ,...,ym )R-1 (y ) =1 m

are symmetric and positive semidefinite, so is K.

For us, let X be the set of all neighborhoods and X1 , . . . , Xm be the sets of spectral projections of all points from all the neighborhoods. Here, note that even if the same point appears in different neighborhoods, the appearances will be considered t o be different because the projections are relative to the neighborhoods. Since, each neighborhoo d has size k , in our case m = k . The relation R is defined as R(f (1), . . . , f (k ), NiA ) is true iff the vector (f (1), . . . , f (k )) =  (fiA ) for some permutation  . In other words, R(f (1), . . . , f (k ), NiA ) is true iff f (1), . . . , f (k ) are spectral projections the points of neighborhood NiA ). Also, let Ki , i = 1 . . . k all be RBF kernels with the same parameter  . Thus, from the above equation, the convolution kernel becomes A B - fi - (fj ) 2 - 1 Pl   A B 2 B  e  l = 1 ( f i ( l ) -f j (  ( l ) ) ) = k ! K(NiA , Nj ) = k ! e . Dividing by the   constant (k !)2 , we get kernel KS N as: A B - fi - (fj ) 2 1  (6) e KS N (NiA , NjB ) = k!


The spectral kernel (SK) KS K between two pointsets X A and X B is thus defined as: n n 1 iA jB KS K (X A , X B ) = KRB F (dA , dB )KS N (NiA , NjB ) i j nA nB =1 =1 Following theorem relates KS N (NiA , NjB ) to S (NiA , NjB ) (eqn 4).

(7)

Theorem 3.1 Let Ni and Nj be two sub-structures with spectral projection vectors f i and f j . For large enough value of T such that all points are matched. e-T S (Ni ,Nj ) e lim KS N (Ni , Nj )) =  0 k! Proof: Let   be the permutation that gives the optimal score S (Ni , Nj ). By definition, eS (Ni ,Nj ) = i  j eT e- f - (f ) 2 .
- f i - (f j ) 2  1  ) lim 0 (KS N (Ni , Nj )) = lim 0 ( k! e (l) -1  i j i  j i  j 1  ( f - (f ) 2- f - (f ) = k! e- f - (f ) 2 lim 0 (1 + \{  } e i  j 1 = -! e- f - (f ) 2 k

2) 

)

C omputational complexity of this kernel is O(k !n2 ), where k is neighborhood size and n is no. of descriptor points. However, since in practice only small neighborhood sizes are considered, the computation time doesn't become prohibitive. 4


1-NN PCA LEM AMM Face-ARG Sum(eq (1)) NK (eq (3)) SK (eq (7))

Table 1: Smile 96.3% 94.1% 78.6% 96.0% 97.8% 96.19% 98.09% 99.04%

Recognition accuracy on AR face dataset (section 4.1) Angry Scream Glasses Scarf Left-Light Right-Light 88.9% 57.0% 48.1% 3.0% 22.2% 17.8% 79.3% 44.4% 32.9% 2.2% 7.4% 7.4% 92.9% 31.3% 74.8% 47.4% 92.9% 91.1% 96.0% 56.0% 80.0% 82.0% NA NA 96.3% 66.7% 80.7% 85.2% 98.5% 96.3% 95.23% 83.80% 89.52% 60.00% 86.66% 80.95% 98.09% 85.71% 94.28% 65.71% 92.38% 86.66% 99.04% 86.66% 93.33% 65.71% 90.47% 84.76%

4

Experimental Results

In order to study the effectiveness of proposed kernels for p ractical visual tasks, we applied them on three problems. Firstly, the kernels were applied to the w ell known problem of face recognition [20], and results on two benchmark datasets (AR and ORL) were compared to existing state-of-theart methods. Next we used the spectral kernel to tag images in personal photo albums using faces of people present in them. Finally, the spectral kernel was u sed for annotation of video sequences using faces of people present. Attribute For face recognition, faces were modeled as attributed poin tsets using local gabor descriptors [10] calculated at the corner points using Harris corne r point detector [6]. At each point, gabor despite for three different scales and four different orien tations were calculated. Descriptors for 5 points (4 pixel neighbors and itself) were used for each of th e 12 combinations, making a total of 60 descriptors per point. For image tagging and video annota tion, faces were modeled as attributed pointsets using SIFT local descriptors [11], having 128 descriptors per point. The kernels were implemented in GNU C/C++. LAPACK [2] was used for calculation of eigenvectors and GNU GSL for calculation of permutations. LIBSVM [4] was used as the SVM based classifier for classifying pointsets. The face detector provided in OpenCV was used for detecting faces in album images and video frames. Dataset The AR dataset [13] is composed of color images of 135 people (75 men and 60 women). The DB includes frontal view images with different facial expressions, illumination conditions, and occlusion by sunglasses and scarf as shown in figure 2-a. After removing persons with corrupted images or missing any of the 8 types of required images, a tota l of 105 persons (56 men and 49 women) were selected. All the images were converted to greys cale and rescaled to 154 × 115 pixels. The ORL dataset is composed of 10 images for each of the 40 persons. The images have minor variations in pose, illumination and scale. All the 400, 112 × 92 pixel images were used for experiments. Figure 2-b gives representative images from t he ORL dataset.

a Figure 2: Representative images from AR and ORL datasets 4.1 Face Recognition in AR face DB

b

The kernels proposed in this paper, were tested pointsets de rived from images in AR face DB. Face recognition was posed as a multiclass classification problem, and SVMs were along with the proposed kernels. The AR face DB is a standard benchmark dataset, on which a recent comparison of state of the art methods for face recognition has been give n in [14]. In table 1, we have restated the results provided in [14] along with the results of our ker nels. All the results reported in table 1 have been obtained using one normal (no occlusion or change of expression) face image as the training set. 5


Table 2: Recognition accuracy on ORL dataset (section 4.2) # of training images  Sum (eq (1)) NK (eq (3)) SK (eq (7)) 1 70.83% 71.38% 71.94% 3 92.50% 93.57% 93.92% 5 98.00% 98.00% 98.00%

It can be seen that for all the images showing change of expression (Smile, Angry and Scream), the pointset kernels outperform existing methods. Also, in case of occlusion of face by glasses, the pointset kernels give better results than existing methods . However, in case of occlusion by scarf, the kernel based method do not perform as well as the Face-ARG or AMM. This failure is due to introduction of a large number of points in the scarf themsel ves. It was observed that about 50% of the descriptor points in the faces having scarfs were in the s carf region of the image. Summing the similarities over such a large number of extra points makes t he overall kernel value noisy. The proposed approach doesn't perform better than existing methods on images taken under extreme variation in lighting conditions. This is due to the fact that values of the local descriptors change drastically with illumination. Also, some of the corner poi nts disappear under different lighting condition. However, performance of the kernels is comparab le to the existing methods, thus demonstrating the effectiveness of modeling faces as attributed pointsets. 4.2 Recognition performance on ORL Dataset

Real life problems in face recognition also show minor varia tions in pose, which are addressed by testing the kernels on images in the ORL dataset. The problem was posed as a multiclass classification problem and SVM was used along with the kernels for classification. Table 2 reports the recognition accuracies of all the three kernels for two diff erent values of parameters, and for 1, 3 and 5 training images. It can be seen that even with images showing minor variations in pose, the proposed kernels perform reasonably well. Also, due to change in pose the relative position of points in the pointsets change. This is reflected in the fact that improvement due to addition of position information in kernels is minor as compared to those shown in AR dataset. For higher n umber of training images, the performance of all the kernels saturate at 98%. 4.3 Tagging images in personal albums based on faces

The problem of tagging images in personal albums with names of people present in them, is a problem of high practical relevance [19]. The spectral kernels w ere used solve this problem. Images from publicly available sources like http://www.flickr.com 1 were used for experimentation. Five personal albums having 20 - 55 images each were downloaded and many images had upto 6 people. Face detector from openCV library was used to autom atically detect faces in images. Detected faces are cropped and resized to 100 × 100 px resolution. 47 - 265 such faces detected from each album. To the best of our knowledge, there are no openly available techniques to benchmark our method against. Due to non-availability of training data, the problem of image tagging was posed as a clustering problem. Faces detected from the images were represented as attributed pointsets using SIFT local descriptors, and spectral kernel was evaluated on them. A th resK ld based clustering scheme was ho (x, x) + K (y , y ) - 2  k (x, y )). used on the distance metric induced by the kernel (d(x, y ) = Ideally, each cluster thus obtained should represent a pers on and images containing faces from a given cluster should be tagged with the name of that person. Table 3 reports results from tagging experiments for five alb ums. No. of people identified reports the number clusters having more than one faces, as singleton cluster will always be correct for that person. Thus, people appearing only once in the entire album are not reported, which reduce the no. of identified people. % identified and % false +ve are averaged over all clusters detected in the
1

We intend to make the dataset publicly available if no copyrights are violated

6


Table 3: Face based album tagging Album no. 1 2 3 4 5 No. of people (Actual) (Identified) 2 14 6 8 4 4 2 3 2 % Identified 90% 84% 66.66% 83.33% 80.00% % False +ve 0% 10.52% 8.33% 19.44% 14.70%

Figure 3: Representative cluster from tagging of album album, and are calculated for each cluster as: % identif ied =
f alse +v es in the cluster T otal no. of f aces in the cluster . N o. of cor r ect f aces in the cluster T otal no. of f aces of the per son

and

% f alse + v e = It can be seen that the kernel performs reasonably well on the dataset. Figure 3 shows a representative cluster with the first 8 images as true +ves and rest as false +ves. 4.4 Video annotation based on faces

The kernels were also used to perform video shot annotation b ased faces detected in video sequences. Experimentation was performed on videos from "New s and Public affair" section of www.archive.org and music videos from www.youtube.com. Video was sampled at 1 frame per second and experimental methodology was similar sectio n 4.3 was used on the frames. Figure 4 shows two representative shots from corresponding to two candidates from "Election 2004, presidential debate part 2", and one from "Westlife- Season s in the Sun" video. The faces annotating the shots are shown in the left as thumbnails. It may be noted that for videos, high pose variation did not reduce accuracy of recognition due to gradual changing o f pose. The results on detecting shots were highly encouraging, thus demonstrating the varied app licability of proposed attributed pointset kernels.

5

Conclusion

In this article, we propose kernels on attributed pointsets . We define the notion of neighborhood in an attributed pointset and propose two new kernels. The fir st kernel evaluates attribute similari-

Figure 4: Keyframes of a few shots detected with annotation 7


ties between the neighborhoods and uses the co-occurrence i nformation to improve the performance of kernels on sets of vectors. The second kernel uses the posi tion information more strongly and matches the shapes of neighborhoods. This kernel function is motivated from spectral graph matching techniques. The proposed kernels were validated on the well known task on face recognition on two popular benchmark datasets. Results show that the current kernels p erform competitively with the state-ofthe-art techniques for face recognition. The spectral kern el was also used to perform two real life tasks of tagging images in personal photo albums and annotating shots in videos. The results were encouraging in both cases.

References
[1] Helmut Alt and Leonidas J. Guibas. Discrete geometric shapes: Matching, interpolation, and approximation A survey. Technical Report B 96-11, 1996. [2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. [3] Karsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74­81, Washington, DC, USA, 2005. IEEE Computer Society. [4] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. [5] Ingvar Eidhammer, Inge Jonassen, and William R. Taylor. Structure comparison and structure patterns. Journal of Computational Biology, 7(5):685­716, 2000. [6] C. Harris and M.J. Stephens. A combined corner and edge de tector. In Proc. of Alvey Vision Conf., 1988. [7] David Haussler. Convolution kernels on discrete struct ures. Technical report, University of California, Santa Cruz, 1999. [8] Koji Tsuda Hisashi Kashima and Akihiro Inokuchi. Marginalized kernels between labeled graphs. In Twentieth International Conference on Machine Learning (ICML), 2003. [9] Risi Kondor and Tony Jebara. A kernel between sets of vectors. In Twentieth International Conference on Machine Learning (ICML), 2003. [10] Tai Sing Lee. Image representation using 2d gabor wavelets. IEEE TPAMI, 18(10):959­971, 1996. [11] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision, 20:91­110, 2003. [12] Siwei Lyu. Mercer kernels for object recognition with l ocal features. In IEEE CVPR, 2005. [13] A.M. Martinez and R. Benavente. The ar face database. CVC Technical Report, 24, 1998. [14] Bo Gun Park, Kyoung Mu Lee, and Sang Uk Lee. Face recognition using face-arg matching. IEEE TPAMI, 27(12):1982­1988, 2005. [15] Amnon Shashua and Tamir Hazan. Algebraic set kernels with application to inference over local image representations. In Neural Information Processing Systems (NIPS), 2004. [16] Shinji Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE transactions on pattern analysis and machine intelligence, 10(5):695­703, 1988. [17] Lior Wolf and Amnon Shashua. Learning over sets using ke rnel principal angles. Journal of Machine Learning Research, (4):913­931, 2003. [18] Haim J. Wolfson and Isidore Rigoutsos. Geometric hashing: An overview. IEEE Comput. Sci. Eng., 4(4):10­21, 1997. [19] L. Zhang, L. Chen, M. Li, and H. Zhang. Automated annotation of human faces in family albums, 2003. [20] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Comput. Surv., 35(4):399­458, 2003.

8