SIGIR 2007 Proceedings Poster Dimensionality Reduction for Dimension-specific Search Zi Huang, Heng Tao Shen, Xiaofang Zhou School of ITEE University of Queensland, Australia {huang,shenht,zxf}@itee.uq.edu.au Knowledge Media Institute The Open University, United Kingdom {d.song,s.rueger}@open.ac.uk Dawei Song, Stefan Rüger ABSTRACT Dimensionality reduction plays an imp ortant role in efficient similarity search, which is often based on k-nearest neighb or (k-NN) queries over a high-dimensional feature space. In this pap er, we introduce a novel typ e of k-NN query, namely conditional k-NN (ck-NN), which considers dimension-sp ecific constraint in addition to the inter-p oint distances. However, existing dimensionality reduction methods are not applicable to this new typ e of queries. We prop ose a novel Mean-Std(standard deviation) guided Dimensionality Reduction (MSDR) to supp ort a pruning based efficient ck-NN query processing strategy. Our preliminary exp erimental results on 3D protein structure data demonstrate that the MSDR method is promising. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models; H.2.8 [Database Applications]: Scientific databases General Terms: Algorithms Curse of dimensionality. Similarity search over a large scale complex data rep ository is computationally intensive, mainly due to the size of data and the high dimensionality of the feature space, known as the "curse of dimensionality" [2]. To alleviate the problem, dimensionality reduction techniques have b een explored for finding a lower dimensional approximation of the original space. However, existing dimensionality reduction methods, such as Principle Comp onent Analysis (PCA), Singular Value Decomp osition (SVD) and Locality Preserving Pro jections (LPP)[3], are not applicable to ck-NN since they consider the global correlation of dimensions among all p oints and neglect the difference along each individual dimension b etween two p oints. Consequently, they are not sufficient to supp ort ck-NN queries. In this pap er, we formulate the conditional k-NN problem and prop ose a novel Mean-Std guided Dimensionality Reduction (MSDR) to facilitate an efficient ck-NN query processing strategy. 2. CONDITIONAL K -NN SEARCH Definition1:Dimension-specific Similarity Measure Given two objects represented by their feature vectors A = (a1 , a2 , ..., aD ) and B = (b1 , b2 , ..., bD ). A and B are similar (denoted as A B ), if i 1..D, ai bi ), where D is = = the dimensionality of the feature space. " " means "equal to within a tolerance ", where = 0 = implies a rigid matching (i.e., the two ob jects are identical); otherwise a semi-rigid matching. However, one p otential problem is that, dep ending on the value of , there can b e too many similar ob jects found. It is often necessary to add a global similarity measure, normally based on the Euclidian distance b etween ob jects. In summary, given a query ob ject, the problem we investigate here is to find the k most similar objects from the dataset. The similarity b etween two ob jects is measured by the Euclidean distance, sub ject to the maximum allowable variation on each dimension. 1. INTRODUCTION The k nearest neighb or (k-NN) similarity search plays an central role in a wide range of applications, such as multimedia retrieval, molecular biology, medical imaging. Data ob jects are represented by automatically extracted content features which are p oints (vectors) in a high dimensional space. Similarity query processing is to find the data ob jects similar to a query, often the nearest k neighb oring p oints of the query ob ject in the high dimensional space, by measuring distance (often Euclidean distance) b etween each p oint in the database and the query p oint. Conditional k-NN. Recently, there has b een an emerging demand of a new typ e of queries, which take into account not only inter-p oint distances (as in the conventional k - N N ), but also certain local constraints that two ob jects must match within certain tolerance threshold along certain individual dimensions. This is meaningful for many practical applications. For example, in protein structure analysis, two structures which are close in terms of their Euclidean distance but have a very large difference along certain single dimension may p otentially lead to completely different biological functions [6]. We refer such typ e of query to as conditional k-NN query (ck-NN query). Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 3. MEAN-STD GUIDED DIMENSIONALITY REDUCTION (MSDR) MSDR first analyzes the statistics of each dimension globally. Those dimensions having the similar statistics are then summarized into single ones. For each p oint, each of its summarized dimension is in turn represented by the local statistic of its original dimensions. The statistical parameters we use are mean () and standard deviation ( ). We outline the algorithm of MSDR as follows. Compute and : MSDR first computes and of val- 849 SIGIR 2007 Proceedings Poster ues alone each dimension for all p oints and represents each dimension as a 2D p oint (, ) . The original space X is then transformed to a 2D - space with D p oints corresp onding to the original D dimensions. Clustering: The K-means algorithm is employed in - space to classify the D p oints into D clusters (D = K ), each of which is describ ed by its centroid c. Generate the subspace: It is natural to combine the dimensions in each cluster into one, resulting in a new D w dimensional space X . Each ob ject in X is mapp ed onto X here its value on each new dimension is the mean value of the dimensions in X that form the new one. ck-NN QUERY PROCESSING An intuitive way for ck-NN query processing is to prune the data space by first p erforming conditional pruning along individual dimensions to form a candidate set, and then further prune the candidate set by the global similarities, or vice versa. However, it will generate a relatively large numb er of intermediate candidates which is strongly undesirable for large scale data. We prop ose a novel hybrid ck-NN query processing strategy which coop erates conditional pruning and k-NN pruning concurrently. The basic index structure is to maintain a separate table (ob jects-values) for each dimension. The tables are accessed one by one in certain order. After a dimension is accessed, the lower and upp er b ounds of candidates are up dated corresp ondingly. The candidates are then sorted by their upp er b ounds in an ascending order. For each candidate, it is first checked by the condition rule and removed from the candidate set if the condition is violated, followed by k-NN pruning. If the current candidate's lower b ound is greater than the kth largest upp er b ound, it can b e safely pruned. In the next iteration, the same process is p erformed and the candidate set is further reduced, until all dimensions have b een processed since the measure of similar patches requires pair-wise comparisons on all dimensions. Due to the page limit, we will not give the formal description of upp er/lower b ound in this pap er. 1 0.9 0.8 0.7 Precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 45 50 Dimensionality PP (Pruning Power) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 Dimensionality Hyper CKNN KNN Pruning Conditional Pruning (a) dim vs. precision (b) PP Comparison 4. Figure 1: Effect of and D . Fig.1(a) shows the average precision increases as more dimensions are retained. When D reaches 8, the precision increases sharply to more than 70% while the new space is only 8/53 of the original space. When D is greater than 8, the precision then increases slowly b efore finally reaching 100%.The result confirms that MSDR can achieve reasonable quality of results even when the dimensionality is reduced to b e a very small numb er. Fig.1(b) depicts the pruning p ower comparison of three methods, from which we can observe that our hybrid ck-NN method achieves the b est p erformance which prunes more space than the rest two methods by large p ercentages. 6. CONCLUSION In this pap er, we prop ose a new query typ e called ckNN query and corresp ondingly a novel method MSDR to supp ort efficient ck-NN search. The results demonstrate an encouraging p erformance of our method. More extensive p erformance evaluation is currently b eing conducted on protein structure as well as image data and will further involve comparison with human relevance judgments. 7. ACKNOWLEDGEMENTS This work was funded in part by the UK's Engineering and Physical Sciences Research Council (EPSRC) through an Overseas Travel Grant (EP/E037402/1), and the Europ ean Union Sixth Framework Programme (FP6) through the integrated pro ject Pharos (IST-2006-045035). 5. EXPERIMENTS This section rep orts our preliminary p erformance study on our MSDR algorithm. As mentioned previously, the existing dimensionality reduction methods[3] and high-dimensional indexing methods [2] are not applicable to the ck-NN problem, and thus not directly comparable with the prop osed MSDR approach. MSDR derives a lower dimensional space X and therefore will lose information. Average precision is used as the effectiveness indicator, with ck-NN search results from the original space as ground truth. The efficiency of our hybrid ck-NN algorithm is measured by the Pruning Power (PP) which is defined as the ratio of the numb er of pruned ob jects to the total numb er of ob jects. Clearly, a larger PP corresp onds to a more p owerful pruning strategy, hence a faster resp onse. We conduct exp eriments on 3D protein structure data. The feature space is constructed based on a compact data representation model, which represent each structure as a high dimensional p oint[5][4]. From a total numb er of 1,100 sample protein structures in the Protein Data Bank [1], we build a dataset of 2,207,018 53-dimensional p oints (feature vectors). 100 p oints are randomly selected as queries. k is set to 10. The distance tolerance along each dimension is set to b e 1.5° to b e biologically meaningful. A 8. REFERENCES [1] Protein data bank. http://www.rcsb.org/p db/. [2] C. B¨hm, S. Berchtold, and D. Keim. Searching in o high-dimensional spaces: Index structures for improving the p erformance of multimedia databases. ACM Computing Surveys, pages 33(3):322­373, 2001. [3] X. He, D. Cai, H. Liu, and W.Y. Ma. Locality preserving indexing for document representation. In SIGIR, pages 96­103, 2004. [4] Z. Huang, X. Zhou, H.T. Shen, and D. Song. 3d protein structure matching by patch signatures. In DEXA, pages 528­537, 2006. [5] Z. Huang, X. Zhou, D. Song, and P. Bruza. Dimensionality Reduction in Motif-Signature Based Protein Structure Matching. In ADC, pages 89­97, 2006. [6] R.V. Spriggs, P.J. Artymiuk, and P. Willett. Searching for patterns of amino acids in 3D protein structures. J Chem Inf Comput Sci., 43(2):412­421, 2003. 850