WWW 2007 / Poster Paper Topic: Semantic Web Image Annotation by Hierarchical Mapping of Features Qiankun Zhao {qzhao, Prasenjit Mitra C Lee Giles College of Information Sciences and Technology Pennsylvania State University, University Park, PA pmitra, giles}@ist.psu.edu ABSTRACT In this pap er, we prop ose a novel approach of image annotation by constructing a hierarchical mapping b etween lowlevel visual features and text features utilizing the relations within and across b oth visual features and text features. Moreover, we prop ose a novel annotation strategy that maximizes b oth the accuracy and the diversity of the generated annotation by generalizing or sp ecifying the annotation in the corresp onding annotation hierarchy. Exp eriments with 4500 scientific images from Royal Society of Chemistry journals show that the prop osed annotation approach produces satisfactory results at different levels of annotations. annotation strategy to maximize the diversity and accuracy of the predicted annotations based on the hierarchical mapping model. 2. HIERARCHICAL IMAGE CLUSTERING First, we prop ose to take into account the relations among features within the visual dimension and textual annotation to build two cluster hierarchies. An image usually contains multiple ob jects and the correlations among ob jects is exp ected to improve the annotation. We use hierarchical clustering b ecause each cluster in the cluster hierarchy is exp ected to b e characterized by a subset of distinguishing features of these images in the cluster. The characteristics of an image is the sum of the distinguishing features of all image clusters, which the image b elongs to. Images are clustered into two hierarchies based on the following visual features and textual annotation features, resp ectively. For each image, the color, texture, and shap e features are extracted as the visual features. The color features consist of 32 color histogram and cumulative histogram features, 36 gray-level co-occurrence features extracted using the cooccurrence image matrix. Texture features are extracted by calculating the means and variations of the filtered image regions on 8 orientations at 6 scales. Shap e features include edge-map-based features and line features. Figure caption, text references, and surrounding text in the scientific pap ers are extracted as image textual features. The text segments are takenized, and part-of-sp eech tags are added, stop words are removed, stemming of words is also applied. As a result, for each typ e of text annotation, a term vector is constructed for the corresp onding image. Then, the term frequency and inverted image frequency is used as the weight of each term in the vector. To explore the hierarchical relations b etween images, we prop ose to represent the set of images as a graph, G = (V , E ), where each vertex represents an image, each edge denotes the similarity b etween the pair of images it connects. Then, the graph partition algorithm prop osed by Shi and Malik [3] is applied to cluster images into small groups, whereas the hierarchical relations are constructed. For b oth the visual feature-based clustering and the annotation-based clustering, each image is represented as a vector of features and the cosine similarity b etween the vector representations is taken as the similarity b etween two images. Categories and Subject Descriptors H.4.8 [Image Processing and Computer Vision]: Scene Analysis ob ject recognition; H.3.3 [Information Storage and Retrieval]: Information Retrieval search process General Terms Algorithm, Exp eriment, Performance Keywords Image Annotation, Hierarchical Relation, Feature Mapping 1. INTRODUCTION Automatic image annotation is an imp ortant problem given the fact that annotation based image retrieval outp erforms the content-based image retrieval [1] and only few of the images on the web are annotated. Most existing annotation approaches prop osed to learn the mapping b etween low-level visual features and keywords using co-occurrence, correlation, and probabilistic models [2, 4]. However, most of the existing image annotation approaches ignored the relations b etween features within the visual features and textual annotations. That is, the visual or textual context, which is reflected by relations within visual/textual features, plays an imp ortant role in determining the mapping model b etween visual and textual features. In this pap er, we prop ose to construct a hierarchical mapping model b etween visual and textual features of images by exploring relations b etween features within and across the visual and textual dimensions. More imp ortantly, we prop ose a novel Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 3. CONSTRUCTION OF MAPPING MODEL Given an image cluster in the hierarchy, there are two typ es of features: discriminative features and non-discriminative 1237 WWW 2007 / Poster Paper features. We prop ose the conditional Kullback-Leibler divergence metrics to measure the discriminative p ower of feature subsets with resp ect to the child image cluster and the parent image cluster. By maximizing the conditional KLdivergence, for each image cluster in the image hierarchy, the corresp onding discriminative features can b e extracted. The conditional Kullback­Leibler divergence is defined as: DKL (P Q|fi ) = image Topic: Semantic Web prediction C ls 6 University Research Phd Chemistri Professor C ls 6 University Research Phd Chemistri Professor Cls 24 Bind Histo Molar Curv Detect Cls 24 Bind Ternary Molar Trend Function Cls 29 Molecu Acti Equa Bond Dimethyl Cls 29 Carbene Bond Equa Polymerizations Dimethyl - Then, to calculate the strength of the links b etween image clusters in the two image hierarchies, initially they are connected based on the common images. Basically, the weights of the links are measured by the mutual information of two clusters X and Y . Note that here b oth X and Y are represented by the discriminative features rather than the entire set of visual or textual features. The idea is: the larger the mutual information is, the stronger the correlation is b etween the corresp onding discriminative features. For a given visual/textual cluster, we rank the corresp onding textual/visual clusters based on the values of the mutual information. Figure 1: Examples of Image Annotations Figure 11 presents three image examples with the original and predicted cluster IDs and keyword annotations( = 0.3, = 0.7). Three typ es of images are used as representatives. The results show that in the image cluster ID level, our prediction is more accurate than in the keyword level. The predicted cluster ID and the annotation cluster ID are the clusters with the maximum similarity. The list of keywords are the top-5 keywords with largest sums of weights. The following exp eriments have b een conducted: (1) using visual features to predict the annotation clusters, (2) using visual features to predict the detail annotation keywords; (3) annotation of visual images with partial knowledge such as cluster ID, top-1 keyword, and top-2 keywords. The results are shown in Table 1. It can b e observed that our annotation approach can produce stratificatory results at b oth the cluster ID and keywords levels. Partial annotation can improve the quality of full annotation generated by our algorithm. The cluster ID improved the quality a bit, whereas the first keyword improved the annotation substantially and most significantly. Partial Annotation No Annotation Cluster ID Cluster ID, Top-1 Keyword Cluster ID, Top-2 Keywords Precision 0.81 0.84 0.88 0.91 Recall 0.79 0.81 0.84 0.89 4. ANNOTATION STRATEGY Given the mapping model b etween visual features and textual annotation features, the goal of image annotation is to provide as complete and diverse as p ossible and as accurate as p ossible annotations. Based on the mapping model, there will b e a ranked list of textual annotation clusters that corresp ond to a given image. One goal of the image annotation is to produce as diverse as p ossible annotations, We define the diversity of annotation as: Ad = M ax ( Dist(ai , aj )) a i =aj A where Ad is the diversity of the annotation results, Dist(ai , aj ) is the distance b etween two annotation clusters and is defined as Dist(ai , aj ) = min(|ai | - |ac | + |aj | - |ac |), where |ai | is the depth of the cluster in the hierarchy and ac is the common ancestor of ai and aj . To maximize the accuracy of the predicted annotation, b oth the strength of the relation b etween the predicted annotation cluster and the image cluster and the depth of the corresp onding annotation clusters are taken into account. The accuracy of annotation is defined as: Aa = M ax ( Table 1: Performance of Partial Annotation a 6. CONCLUSION In this pap er, we prop ose the first approach of image annotation by utilizing not only the correlations b etween features but also the correlations b etween features within the same modality. We prop ose a novel annotation prediction method that maximizes the diversity and accuracy. Exp eriments with real data show that the prop osed image annotation approach produces satisfactory results. I (imag e, ai ) × |ai |) i A where Aa is the accuracy of the annotation results, I (imag e, ai ) is the strength b etween the annotation cluster and the imag e, which is represented as the visual image cluster. The final annotation will b e based on the combination of the diversity and the accuracy of the annotation results. A = M a x ( · A d + · A a ) where and are the weights of the diversity and accuracy, and + = 1. 7. REFERENCES [1] T. A. S. Coelho, et al. Image retrieval using multiple evidence ranking. IEEE T KDE, 16(4):408­417, 2004. [2] V. Lavrenko, et al. A model for learning the semantics of pictures. In NIPS, 2004. [3] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE T PAMI, 22(8):888­905, 2000. [4] R. Zhang, et al. A probabilistic semantic model for image annotation and multi-modal image retrieva. In ICCV , 846­851, 2005. 1 5. PERFORMANCE EVALUATION To evaluate the p erformance our prop osed image annotation approach, training and testing images are from the Royal Society of Chemistry. We extracted 4500 images, 4000 of them are used as training data and 500 are used for testing. Images are from the Royal Society of Chemistry 1238 real annotation p(x|fi ) log p(x|fi ) dx q (x|fi )