SIGIR 2007 Proceedings

Session 5: Image Retrieval

Hierarchical Classification for Automatic Image Annotation
Jianping Fan
Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Yuli Gao
Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Hangzai Luo
Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

jfan@uncc.edu

ygao@uncc.edu

hluo@uncc.edu

ABSTRACT
In this pap er, a hierarchical classification framework has b een prop osed for bridging the semantic gap effectively and achieving multi-level image annotation automatically. First, the semantic gap b etween the low-level computable visual features and the users' real information needs is partitioned into four smaller gaps, and multiple approaches are prop osed to bridge these smaller gaps more effectively. To learn more reliable contextual relationships b etween the atomic image concepts and the co-app earances of salient ob jects, a multimodal boosting algorithm is prop osed. To enable hierarchical image classification and avoid inter-level error transmission, a hierarchical boosting algorithm is prop osed by incorp orating concept ontology and multi-task learning to achieve hierarchical image classifier training with automatic error recovery. To bridge the gap b etween the computable image concepts and the users' real information needs, a novel hyperbolic visualization framework is seamlessly incorp orated to enable intuitive query sp ecification and evaluation by acquainting the users with a good global view of largescale image collections. Our exp eriments on large-scale image databases have also obtained very p ositive results. Categories and Sub ject Descriptors I.4.8 [Image Processing and Computer Vision]: Scene Analysis-object recognition. General Terms Algorithms, Measurement, Exp erimentation Keywords Hierarchical Image Classification, Automatic Image Annotation, Concept Ontology, Multi-Modal Boosting, Hierarchical Boosting, Hyp erb olic Visualization.

tomatic image annotation and semantic image retrieval via keywords [1-2]. However, the p erformance of such keywordbased image retrieval approach largely dep ends on three inter-related issues: (1) representative visual features for characterizing diverse visual prop erties of images [3]; (2) effective algorithms for classifier training and automatic image annotation [4-5, 20-21]; (3) suitable system interfaces for intuitive query sp ecification and evaluation [1-2]. To address the first issue, the underlying visual patterns for feature extraction should b e able to characterize the image semantics at the ob ject level effectively and efficiently [3-5]. To address the second issue, robust techniques for image classifier training are needed to bridge the semantic gap successfully [1-2, 20-21]. Because one single image may contain different meanings at multiple semantic levels, hierarchical image classification is strongly exp ected for achieving multi-level image annotations [20-21]. Hierarchical image classification can provide at least two advantages: (1) The classifiers for the high-level image concepts can effectively b e learned by combining the classifiers for the relevant image concepts at the lower levels of the concept ontology (i.e., low-level image concepts with smaller within-concept variations of visual principles); (2) The computational complexity for training the classifiers for large amounts of image concepts can significantly b e reduced through exploiting the strong correlations b etween the image concepts. The ma jor problem with such hierarchical approach is that the classification errors may b e transmitted among different concept levels (i.e., inter-level error transmission) [20]. To address the third issue, there is an urgent need to develop new visualization framework, so that users can visually b e acquainted with what keywords are used to annotate and index the images and can b e used for query sp ecification.

1.

INTRODUCTION

As high-resolution digital cameras b ecome more affordable and widespread, p ersonal collections of digital images are growing exp onentially. Thus, image classification b ecomes increasely imp ortant and necessary to supp ort au-

2. CONCEPT ONTOLOGY CONSTRUCTION
As mentioned ab ove, classifying images into the most relevant image concepts at different semantic levels is one promising solution to enable automatic multi-level image annotation. Motivated by this observation, we have prop osed a novel scheme by incorp orating the concept ontology for image concept organization and hierarchical image classifier training and visualizing large-scale image collections. Following the idea of WordNet [11], a hierarchical network is used as the representation of the concept ontology. In this network, each node represents either one image concept at one certain semantic level or one sp ecific salient ob ject class. We define the former nodes as the concept nodes b ecause they represent the semantics of the whole image, and the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23-27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

111


SIGIR 2007 Proceedings

Session 5: Image Retrieval

the contextual dep endency  (Ci , Cj ) [11]: leng th(Ci , Cj ) (2) 2D where leng th(Ci , Cj ) is the length of the shortest path b etween two text terms Ci and Cj on the WordNet, and D is the maximum depth of the WordNet. The association b etween the given image concepts Ci and Cj is then determined by:  (Ci , Cj ) = - log  (C i , C j ) =  (C i , C j ) (C i , C j ) Figure 1: Concept ontology for hierarchical organization of large amounts of image concepts. latter ones are defined as content nodes b ecause they represent the semantics of salient ob jects which are the significant comp ounds of an image. The concept nodes at the first level of the concept ontology are defined as atomic image concept nodes , which are used to represent the image concepts with the most sp ecific sub jects. They can further b e assigned to the most relevant image concepts at the higher level of the concept ontology, which are used to interpret more general sub jects of image contents with larger withinconcept variations of visual prop erties. In this pap er, we have develop ed a semi-automatic scheme for concept ontology construction. We use our work on constructing the concept ontology for LabelMe1 as an example to depict our algorithm: (1) Lab els in LabelMe contain text information of dominant salient ob jects as well as their contours and locations, but there are no explicit lab els at the image concept levels [8]. Each image is stored in a folder with its name strongly indicating the concepts of the containing images. First, we remove all the most common stop words, date and other uninformative words automatically from the folders' names. Then the remaining meaningful words are separated automatically by using standard text analysis techniques to determine the basic vocabulary of image concepts (i.e., text terms for interpreting the relevant image concepts). (2) Latent Semantic Analysis (LSA) is used to group the similar text terms and identify the most significant image concepts [12]. The results of LSA are fuzzy clusters of text terms with same sense, where each cluster describ es one significant image concept. (3) The contextual and logical relationships among these significant image concepts are obtained automatically, where b oth their joint probability and their contextual dep endency are integrated to formulate a new measurement for determining the concept associations effectively. The joint probability (Ci , Cj ), b etween the text terms for interpreting the corresp onding image concept Ci and Cj , is directly obtained from the relevant image annotations: (Ci , Cj ) = log P (C i , C j ) P (C i )P (C j ) (1) (3)

The value of (Ci , Cj ) increases with the strength of semantic relationship b etween Ci and Cj . Thus each image concept is automatically linked with the most relevant image concepts with the highest value of the association (·, ·). (4) The concept ontology that is learned automatically is further evaluated and modified to express the real contextual relationships among the image concepts more precisely. One concept ontology for our test image sets is given in Fig. 1. After the concept ontology is constructed, it is further incorp orated to enable more effective image lab eling and reduce the hand-lab eling cost for image classifier training. Thus only the training images for the atomic image concepts at the first level of the concept ontology are lab eled manually. Because the contextual and logical relationships b etween the atomic image concepts and the image concepts at the higher levels of the concept ontology are accurately characterized by the underlying concept ontology, the keywords for interpreting the relevant image concepts at the higher semantic levels can b e propagated automatically and b e added as the lab els for the training images. Incorp orating the concept ontology for automatic lab el propagation can reduce the hand-lab eling cost significantly.

3. BRIDGING SEMANTIC GAP
The CBIR community has long struggled to bridge the semantic gap from successful low-level feature extraction to high-level human interpretation of image semantics, thus bridging the semantic gap is of crucial imp ortance for achieving more effective image retrieval [1-2]. Our essential goal for image analysis is to provide more precise image content representation that allows more effective solutions for image classification, indexing and retrieval by bridging the semantic gap. In this pap er, we have develop ed a numb er of comprehensive techniques to bridge the semantic gap by: (a) using salient ob jects to achieve more precise image content representation, and the salient ob jects are defined as the salient image comp onents that are roughly related to the real world physical ob jects in an image [5]; (b) developing new machine learning tools to incorp orate concept ontology and multi-task learning for exploiting the strong correlations b etween the image concepts to b oost hierarchical image classifier training; (c) incorp orating hyp erb olic visualization to bridge the gap b etween the computable image concepts and the users' real needs by visually acquainting the users with a good global view of large-scale image collections. To enable computational interpretation of image semantics, a hierarchical scheme is prop osed to bridge the semantic gap (b etween the users' real needs and the lowlevel computable visual features) in four steps as shown in Fig. 2: (1) The gap b etween the salient image comp onents (i.e., real world physical ob jects in an image) and the lowlevel computable visual features (i.e., Gap 1) is bridged by

where P (Ci , Cj ) is the frequency for the co-occurrence of the relevant text terms Ci and Cj , P (Ci ) and P (Cj ) are the frequencies for the individual occurrences of the text terms Ci and Cj . WordNet is used as the priority set to accurately quantify
1 The database can b e downloaded from the following URL: http://people.csail.mit.edu/brussel l/research/LabelMe

112


SIGIR 2007 Proceedings

Session 5: Image Retrieval

High-Level Image Concept 1

High-Level Image Concept i

High-Level Image Concept N Gap 3

Atomic Image Concept

Beach Multi-Modal Boosting

Atomic Image Concept 1

Atomic Image Concept j

Atomic Image Concept M Gap 2

Different Patterns for Co-appearances of Salient Objects Different Feature Subsets

Water & Sky Sand & Water

Sand, Water, Sky, & Tree Gaobor Energy Deviation

Type 1 Salient Object

Type k Salient Object

Type Q Salient Object Gap 1

R,G,B colors

Gabor Energy

L,U,V colors

Low-Level Color Features

Low-Level Texture Features (a) Users' Real Information Needs a. Query Specification b. Query Result Evaluation

Low-Level Energy Features

Figure 3: The contextual relationship between the atomic image concept and the co-appearances of salient ob jects and their low-level visual features. ferent typ es of visual features, and thus 83-dimensional visual features are extracted to characterize the diverse visual prop erties of images. These 83-dimensional visual features are automatically partitioned into 9 homogeneous feature subsets as shown in Fig. 3: 3-dimensional R,G,B average color; 4-dimensional R,G,B color variance; 3-dimensional L,U,V average color; 4-dimensional L,U,V color variance; 2-dimensional average & standard deviation of Gab or filter bank channel energy; 30-dimensional Gab or average channel energy; 30-dimensional Gab or channel energy deviation; 2dimensional Tamura texture features (coarse & contrast), and 5-dimensional angel histogram derived from Tamura texture. Because each feature subset is used to characterize certain visual prop erty of images, more suitable kernel function can b e selected.

Gap 4 System Information Interpretation a. Keywords for Concept Interpretation b. Similar Images under Same Concept (b)

Figure 2: The flowchart for bridging the semantic gap
hierarchically.

using salient ob jects [5] for image content representation. The salient ob jects are defined as the salient image comp onents that capture the most significant p erceptual prop erties linked to the semantic meaning of the corresp onding physical ob jects in an image. Using salient ob jects for image content representation can provide at least four significant b enefits: (a) The salient ob jects can effectively characterize the most significant p erceptual prop erties of the relevant real world physical ob jects in an image [5], thus they can b e used as building blocks to increase the expressiveness of intermediate image semantics. (b) The salient ob jects are not necessarily the accurate segmentation of the real world physical ob jects in an image [5], thus b oth the computational cost and the detection error rate are reduced significantly. (c) It is able to achieve a good balance b etween the computational complexity, the detection accuracy, and the effectiveness for interpreting the intermediate image semantics at the ob ject level. (d) Similar images are not necessarily similar in all their salient image comp onents, thus partitioning the images into a set of salient ob jects can also supp ort partial image matching and achieve more effective image classification and retrieval. (2) The gap b etween the atomic image concepts and the salient ob jects (i.e., Gap 2) is bridged by using multi-modal boosting to exploit the strong correlations (i.e., contextual relationships) b etween the atomic image concepts and the co-app earances of the relevant salient ob jects. For example, the app earance of the atomic image concept "b each" is strongly related to the co-app earances of the salient ob jects, such as "sand field," "water," "tree," and "sky." (3) The gap b etween the high-level image concepts and the atomic image concepts (i.e., Gap 3) is bridged by incorp orating concept ontology and multi-task learning to exploit their strong inter-concept correlations to b oost hierarchical image classifier training. (4) The gap b etween the computable image concepts for image semantics interpretation and the users' real needs (i.e., Gap 4) is bridged by using hyp erb olic visualization to visually acquaint the users with what keywords are used to annotate, index and access the images. After the salient ob jects are extracted, the original images are decomp osed into a set of salient ob jects. It is well-known that the diverse visual similarity b etween the images can b e characterized more effectively and efficiently by using dif-

4. IMAGE CLASSIFIER TRAINING
We have prop osed a novel scheme by incorp orating concept ontology for hierarchical image classifier training.

4.1 Multi-Modal Boosting
Because of the diversity of the visual similarity b etween the semantically-similar images, we have prop osed a new framework to interpret the contextual relationships b etween the atomic image concepts and the co-app earances of the relevant salient ob jects. As shown in Fig. 3, the co-app earances of multiple salient ob jects are used to interpret the app earances of the relevant atomic image concepts, and such contextual relationships are well defined by our concept ontology. Due to the diversity and richness of the patterns of the co-app earances of salient ob jects (i.e., different images may consist of different numb ers and typ es of salient ob jects), it is very attractive to learn the classifiers for all these p otential co-app earance patterns and integrate them for achieving more reliable image classification. If one certain atomic image concept Cj is relevant to n classes of salient ob jects, the total numb er M of such coapp earance patterns of all these n salient ob ject classes can b e determined by: M= in
=2 i Cn = 2n - n - 1

(4)

For each of these M co-app earance patterns, one SVM image classifier can b e trained to interpret the contextual relationship b etween the given atomic image concept Cj and the relevant co-app earance pattern. In addition, the visual prop erty for each of these M co-app earance patterns is further characterized by 9 homogeneous feature subsets. For one certain co-app earance pattern, the weak classifiers for all these 9 homogeneous feature subsets are integrated

113


SIGIR 2007 Proceedings

Session 5: Image Retrieval

to b oost the corresp onding ensemble classifier [6]: T ,T9 t j9 j j tj i t ft (X ) j = 1 fCj (X ) = sig n t
=1 =1 =1 =1

(5)

j where ft (X ) is the weak classifier for the j th homogeneous feature subset at the tth b oosting iteration, and T = 50 is the total numb er of b oosting iterations. For the given atomic image concept Cj , its ensemble classifier is determined by a multi-modal boosting of the classifiers for all these M p otential co-app earance patterns:

for SVM image classifier training, a common regularization term W0 of the SVM image classifier is used to represent and quantify the transferable knowledge and common features among the SVM image classifiers for the sibling image concepts under the same parent node. The weak classifier for the atomic image concept Cj can b e defined as [10]: fCj (X ) = WjT X + b (8)

fCj (X ) =

iM
=1

i pi (X )fCj (X )

(6)

where pi (X ) is the p osterior distribution of the ith classifier i fCj (X ) to b e combined, and pi (X ) is determined by [18]: p i (X ) =
i exp(fCj (X )) M i i=1 exp(fCj (X ))

(7)

where Wj = W0 + Vj , W0 is the common regularization term shared b etween the classifiers for the sibling atomic image concepts under the same parent node, and Vj is the sp ecific regularization term for the atomic image concept Cj . Given the lab eled training samples for L sibling atomic image concepts under Ck :  = {Xij , Yij |i = 1, · · · , N ; j = 1, · · · , L}, training multiple classifiers for the sibling atomic image concepts under the same parent node Ck is then transformed into a joint optimization problem: CL N ( ji jL min ij + 1 Vj 2 + 2 W0 2 9)
=1 =1 =1

By incorp orating high-dimensional multi-modal visual features and various co-app earance patterns for image semantics characterization, our multi-modal b oosting technique is able to handle the huge diversity of within-concept visual prop erties, and it can further provide a natural way to select the most representative feature subsets and the most suitable kernel functions for each aotmic image concept.

sub ject to: N 1 L=1 : Yij (W0 + Vj ) · Xij + b  1 - ij , i= j ij  0 where ij  0 represents the training error rate, L > 0 is the total numb er of atomic image concepts under the same parent node, 1 and 2 are p ositive regularization parameters, C is the p enalty term. The dual optimization problem for Eq. (9) is to determine the optimal j by: i
max   j L iN  ij

4.2 Hierarchical Boosting
Because of the inherent complexity of the task, automatic detection of the high-level image concepts with larger within-concept variations of visual prop erties is still b eyond the ability of the state-of-the-art techniques [1-2]. The image concepts are dep endent and such dep endencies can b e characterized effectively by the concept ontology. Unfortunately, most existing techniques for hierarchical image classifier training suffer from the problem of inter-level error transmission. Thus there is an urgent need to develop new scheme for hierarchical image classifier training with automatic error recovery. In this pap er, we have prop osed a novel algorithm by incorp orating the concept ontology for hierarchical image classifier training. First, the concept ontology is used to identify the related tasks, e.g., training the classifiers for the sibling image concepts under the same parent node. Second, such task relatedness is used to determine the transferable knowledge and common features among the classifiers for the sibling image concepts to generalize their classifiers significantly from fewer training images. Because the classifiers for the sibling image concepts under the same parent node are used to characterize b oth their individual visual prop erties and the common visual prop erties for their parent node, their outputs are strongly correlated according to the new task (i.e., learning a biased classifier for their parent node). For a given second-level image concept Ck , its child image concepts (i.e., the sibling atomic image concepts under Ck ) are strongly correlated and share some common visual prop erties for their parent node Ck , thus multi-task learning can b e used to train their classifiers simultaneously [9-10]. Because the related tasks are characterized effectively by the concept ontology, our hierarchical classifier training algorithm can provide a good environment to enable more effective multi-task learning. To integrate multi-task learning

=1 =1

 LN LN  1j i h l - ih Yih j l Yj l Kj h (Xih , Xj l )  2 =1 =1 =1 =1 (10)

sub ject to: N 1 L=1 : 0  ij  C, i= j j L iN
=1 =1

ij Yij = 0

where Kj h (·, ·) is the underlying kernel function. By exploiting the transferable knowledge and common features for image classifier training, our multi-task learning algorithm is able to handle the inter-concept visual similarity effectively. The common regularization term W0 for the sibling atomic image concepts is further treated as a prior regularization term to bias the SVM classifier for their parent node. Setting such prior regularization term is able to exploit the inter-concept correlations b etween the SVM classifiers according to the new task and can reduce the training cost significantly. Based on such prior regularization term, a biased classifier for their parent node is trained effectively by using few new training images. Thus the biased classifier for their parent node Ck is determined by: 1 ( lm T W - W0 2 +  min [1 - Yl (W · Xl + b)] 11) 2 =1 where W0 is the common regularization term for the sibling atomic image concepts under Ck , (Xl , Yl ), l = 1, · · · , m are the new training samples for learning the biased classifier for Ck . The dual problem for Eq. (11) is solved by: ( 1m m lh lm T l h Yl Yh XlT Xh - l (1 - Yl W0 Xl ) min 2 =1 =1 =1 12)

114


SIGIR 2007 Proceedings

Session 5: Image Retrieval

Figure 4: The comparison results between our hierarchical bo osting algorithm, multi-class bo osting and multitask bo osting for four sibling image concepts.

Figure 5: The comparison results between our hierarchical bo osting algorithm, multi-class bo osting and multitask bo osting for four sibling image concepts.

sub ject to: m 1 : 0  l  C, l= lm
=1

l Yl = 0

The optimal solution of Eq. (12) satisfies: W = W0 + lm
=1

l Yl Xl

(13)

Thus the bias classifier for the given second-level image concept Ck is obtained as: fCk (X ) = W T X + b (14)

To learn the ensemble classifier for the given second-level image concept Ck , a novel hierarchical boosting scheme has b een develop ed by combining its biased classifier with the classifiers for its child image concepts. Unfortunately, all the existing b oosting techniques can only combine the weak classifiers that are learned in different ways (i.e., different input spaces) but for the same task [6], and they did not include the regularization b etween different tasks which is very essential for hierarchical image classifier training. We have develop ed a new scheme for multi-task classifier combination (called hierarchical boosting) that is able to integrate the classifiers trained for multiple tasks and leverage their distinct strengths and exploit the strong correlations of their outputs according to the new task. Our hierarchical b oosting scheme can search an optimal combination of these multi-task classifiers by sharing their transferable knowledge and common features according to the new task (i.e., learning the ensemble classifier for their parent node Ck ), and thus it is able to generalize the ensemble classifier significantly while reducing the computational complexity dramatically. For the given second-level image concept Ck , the final prediction of its ensemble classifier can b e obtained by a logistic boosting of the predictions of its biased classifier and the classifiers for its child image concepts [18]: HCk (X ) =
L+1 h =1

Our hierarchical b oosting algorithm can significantly outp erform the traditional techniques such as multi-class b oosting and multi-task b oosting [7-8]. The multi-class b oosting techniques do not explicitly exploit the transferable knowledge and common features among the classifiers to enhance their classification p erformance [7]. The multi-task b oosting algorithms have recently b een prop osed to enable multi-class ob ject detection by sharing the common features among the classifiers [8]. Rather than incorp orating the transferable knowledge and common features to learn a biased classifier, the ensemble classifier for each ob ject class is simply comp osed by the classifiers that are trained for all the pair-wise ob ject class combinations [8], thus the strong inter-concept correlations b etween the classifiers cannot b e exploited effectively. On the other hand, our hierarchical b oosting algorithm can integrate the transferable knowledge and common features to enhance all these single-task classifiers at the same semantic level simultaneously, exploit their strong inter-concept correlations to learn a biased classifier, and b oost an ensemble classifier for their parent node with higher discrimination p ower. As shown in Fig. 4 and Fig. 5, one can find that our hierarchical b oosting algorithm can significantly outp erform the traditional techniques such as multiclass b oosting and multi-task b oosting.

5. MULTI-LEVEL IMAGE ANNOTATION
After our hierarchical image classifiers are available, they are used to classify the images into the most relevant image concepts at different semantic levels. We carry out our exp erimental studies of our prop osed algorithms by using three public image databases: LabelMe, Corel Images, Google Images. For Lab elMe image database, it contains more than 25,000 images and our exp eriments are done on a snapshot of this database downloaded at April 2006. For Corel image database, we have also included 2800 images for natural scenes. For Google Images, we have included 9000 natural images. Table 1 gives the average accuracy of our hierarchical image classifiers for some image concepts. In our hierarchical image classification scheme, the initial classification of a test image is critical b ecause the classifiers at the subsequent levels cannot recover from the misclassification of the test image that may occur at a higher concept level, and this misclassification can b e propagated to the terminal concept node (i.e., inter-level error transmission) [20]. We have integrated two innovative solutions seamlessly to address such inter-level error transmission problem: (1) en-

ph (Ch )fCh (X )

(15)

By exploiting the strong inter-concept correlations for hierarchical image classifier training, our hierarchical b oosting algorithm is able to learn the classifiers for large amounts of image concepts simultaneously.

115


SIGIR 2007 Proceedings

Session 5: Image Retrieval

Table 1: The classification accuracy sion/recall) for some image concepts. concepts mountain view b each 80.6% /85.6% 90.4% /90.6% concepts sailing skiing 85.8% /84.6% 83.6% /84.2% concepts ocean view waterway 82.3% /81.5% 85.4% /85.7% concepts shopping office 83.6% /84.2% 89.8% /88.7% concepts sidewalk corridor 85.4% /84.9% 86.4% /85.8% concepts home avenus 90.5% /91.2% 92.3% /95.2%

(i.e., precigarden 89.5% /88.3% desert 79.5% /74.7% prairie 80.5% /82.4% bathroom 86.5% /87.3% kitchen 89.6% /90.2% campus 90.2% /91.5%

h(Ck ) for the optimal classification path (from one certain higher level image concept Ck to the most relevant lower level image concept Cj ) is defined as [17]: h(Ck ) = p(Ck ) + g (Cj ), g (Cj ) = max{p(Ci )|i = 1, · · · , L} (16) where p(Ck ) is the p osterior probability for the given test image to b e classified into the current image concept Ck at the higher level of the concept ontology, p(Ci ) is the p osterior probability for the given test image to b e classified into the child image concept Ci of Ck , g (Cj ) is the maximum p osterior probability for the given test image to b e classified into the most relevant child concept node Cj . Thus a good path should achieve higher classification accuracy for b oth the high-level concept node and the most relevant child concept node. By using the overall probability, it is able for us to detect the incorrect classification path early and take appropriate actions for automatic error recovery. It is imp ortant to note that once a test image is classified, the keywords for interpreting the salient ob ject classes (what are visible in the images) and the relevant image concepts (what the images are ab out and what can b e evoked by the visible salient ob jects) at different levels of the concept ontology b ecome the keywords for interpreting its semantics more sufficiently. Our multi-level image annotation scheme is very attractive to supp ort semantic image retrieval via keywords such that the naive users can have more flexibility to sp ecify their query concepts via various keywords at different semantic levels. Our exp erimental results on hierarchical image classification and multi-level annotation are given in Fig. 6. From our exp erimental results, one can find that our prop osed hierarchical image classification scheme is able to achieve more sufficient image annotations.

Figure 6: Multi-level image annotation results both at
the ob ject level and at multiple concept levels.

6. HYPERBOLIC IMAGE VISUALIZATION
For naive users to harvest the research achievements of CBIR community, it is very imp ortant to develop more comprehensive framework for intuitive query sp ecification and evaluation, but it is also a problem without a good solution so far. The problem, in essence, is also ab out how to present a good global view of large-scale image collections to users [13-16], so that users can easily sp ecify their queries. Therefore, there is a great need to generate the overall information of large-scale image collections conceptually and incorp orate the concept ontology to organize and visualize such concept-oriented overall information more effectively and intuitively. To achieve multi-modal representation of concept ontology, each concept node on the concept ontology is jointly characterized by: keyword to interpret its semantics, most representative images to display its concept-oriented summary, decision principles (i.e., supp ort vectors and imp ortance factors for classifier combination) to characterize its feature-based principal prop erties, and contextual relationships b etween the relevant image concepts. We have develop ed a novel scheme to generate the conceptoriented summarization of large-scale image collections. For one given image concept on the concept ontology, three typ es of images are automatically selected to generate its conceptoriented summary: (a) Images which locate on the decision b oundaries of the SVM image classifier; (b) Images which have higher confidence scores in the classification procedure; (c) Images which locate at the centers of dense areas and

hancing the classifiers for the image concepts at the higher levels of the concept ontology, so that they can have higher discrimination p ower (i.e., classifying the images more accurately); (2) integrating new protocols that are able to detect such misclassification path early and take appropriate actions for automatic error recovery. Three significant resp ects of our hierarchical image classification scheme are able to address the inter-level error transmission problem effectively: (a) The transferable knowledge and common features can b e shared among the classifiers for the sibling image concepts at the same semantic level of the concept ontology to maximize their margins and enhance their discrimination p ower significantly, so that the test images can b e classified more accurately at the b eginning, i.e., image concepts at the higher levels of the concept ontology. By exploiting the strong correlations b etween the classifiers for their child image concepts, our hierarchical b oosting scheme is able to learn more reliable ensemble classifiers for the high-level image concepts. (b) The classification decision for each test image is determined by a voting from multiple multi-task classifiers for the sibling image concepts at the same semantic level to make their errors to b e transparent. (c) An overal l probability is calculated to determine the b est path for hierarchical image classification. For a given test image, an optimal classification path should provide maximum value of the overall probability among all the p ossible classification paths. The overall probability

116


SIGIR 2007 Proceedings

Session 5: Image Retrieval

can b e used to represent large amounts of semantically similar images for the given image concept. To obtain such representative images, kernel PCA is used to cluster the semantically similar images for the given image concept into multiple significant groups [19]. Our approach for concept ontology visualization exploits hyp erb olic geometry [16]. The hyp erb olic geometry is particularly well suited to graph-based layout of large-scale concept ontology, and it has "more space" than Euclidean geometry. The essence of our approach is to pro ject the concept ontology onto a hyp erb olic plane according to the contextual relationships b etween the image concepts, and layout the concept ontology by mapping the relevant concept nodes onto a circular display region. Thus our concept ontology visualization framework takes the following steps: (a) The image concept nodes on the concept ontology are pro jected to a hyp erb olic plane according to their contextual relationships, and such pro jection can accurately preserve the original contextual relationships b etween the image concept nodes. (b) Poincar´ disk model [16] is used to map the concept nodes e on the hyp erb olic plane to a 2D display coordinate. Each concept node on the graph is assigned a location z = (x, y ) within the unit disk, which represents the Poincar´ e coordinates of the corresp onding image concept node. By treating the location of the image concept node as a complex numb er, we can define such a mapping as the linear fractional transformation [16]: zt = z + P ¯ 1 + P z (17)

where P and  are the complex numb ers, |P | < 1 and || ¯ = 1, and P is the complex conjugate of P . This transformation indicates a rotation by  around the origin following by moving the origin to P (and -P to the origin). After the hyp erb olic visualization of the concept ontology is available, it can b e used to enable interactive exploration and navigation of large-scale image collections at the concept level via change of focus. The change of focus is implemented by changing the mapping of the image concept nodes from the hyp erb olic plane to the unit disk for display, and the p ositions of the image concept nodes in the hyerb olic plane need not to b e altered during focus manipulation. Users can change their focus of image concepts by clicking on any visible image concept node to bring it into focus at the center, or by dragging any visible image concept node interactively to any other location without losing the contextual relationships b etween the image concept nodes, where the rest of the layout of the concept ontology transforms appropriately. Thus our hyp erb olic framework for concept ontology visualization has demonstrated the remarkable capabilities for interactively exploring large-scale image collections at the concept level. By supp orting change of focus, our hyp erb olic visualization framework can theoretically display unlimited numb er of image concepts in a 2D unit disk. Moving the focus p oint over the display disk unit is equivalent to translating the concept ontology on the hyp erb olic plane, such change of focus can provide a mechanism for controlling which p ortion of the concept ontology receives the most space and changing the relative amount of the image concept nodes for current focus. Through such change of focus on the display disk unit for concept ontology visualization and manipulation, it is able for users to interactively

explore and navigate large-scale image archives at the concept level. Therefore, users can always see the details of the regions of interest by changing the focus. Different views of the layout results of our concept ontology visualization are given in Fig. 7. By changing the focus p oints, our hyp erb olic framework for concept ontology visualization can provide an effective solution for interactive exploration of large-scale image collections at the concept level. In addition, users can visually b e acquainted with what keywords are used to annotate and index the images, and thus they can easily and intuitively sp ecify their queries with b etter knowledge of large-scale image collections. Because the concept ontology is used to represent the abstract information of large-scale image collections, the amounts of the returned images for such keyword-based queries may b e very large and it is too exp ensive for users to look for some particular images effectively. Our solution for this problem is to pro ject the returned images onto a 2D visualization space according to their kernel-based visual similarity distances. In addition, the visual similarity relationships b etween the returned images are explained more intuitively by using hyp erb olic visualization, thus users can easily judge the relevance b etween the returned images and interactively browse large amounts of returned images according to their visual similarity. Such interactive exploration can also allow users to zoom into regions of interest, obtain some additional images of interest that may not b e found by using 1D or 2D top list (some interesting images which are loosely relevant to users' queries), and enable fortunate discoveries of unexp ected images by accident. For a given returned image I , we set  to b e the hyp erb olic distance b etween I and the center of the hyp erb olic plane for image pro jection, and r b e the distance b etween I and the center of the display unit circle. The relationship b etween their derivatives is describ ed by: d = 2 · dr 1 - r2 (18)

An example of this hyp erb olic visualization is shown in Fig. 8, where the returned images for the query "nature scene" are layouted according to the global color histogram by using Kernel PCA pro jection [19]. One can observe that such 2D hyp erb olic visualization of the returned images can provide more intuitive interpretation of their visual similarity, where the similar images are closer according to their kernel-based visual similarity. Therefore, users are allowed to manipulate not only the images, but also their visual similarity relationships. By allowing users to zoom into regions of interest via changing the focus, our algorithm can easily supp ort visualization of large amounts of returned images.

7. CONCLUSIONS
In this pap er, we have prop osed a novel algorithm for automatic multi-level image annotation via hierarchical classification. A novel multi-modal b oosting algorithm is prop osed to achieve more reliable interpretation of the contextual relationships b etween the atomic image concepts and the co-app earances of salient ob jects. To avoid the interlevel error transmission problem, a novel hierarchical b oosting algorithm is prop osed by incorp orating concept ontology and multi-task learning to b oost hierarchical image classifier training. Our hierarchical image classifier training algorithm is able to simultaneously learn the classifiers for large

117


SIGIR 2007 Proceedings

Session 5: Image Retrieval

Figure 7: Two different views of our hyperbolic visualization of large-scale concept ontology. amounts of image concepts with huge within-concept visual diversities and inter-concept visual similarities. A novel hyp erb olic visualization framework is seamlessly incorp orated to enable intuitive query sp ecification and similarity-based evaluation of large amounts of returned images. Our exp eriments on large-scale image database have also obtained very p ositive results.

8.

REFERENCES

[1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, "Content-based image retrieval at the end of the early years", IEEE Trans. on PAMI, 2000. [2] X-J. Wang, L. Zhang, F. Jing, W-Y. Ma, "AnnoSearch: Image auto-annotation by search", IEEE CVPR, 2006. [3] W-Y. Ma, B. S. Manjunath, "Texture features and learning similarity", IEEE CVPR, pp.425-430, 1996. [4] J. Jeon, R. Manmatha, "Using maximum entropy for automatic image annotation", CIVR, pp.24-32, 2004. [5] J. Fan, Y. Gao, H. Luo, G. Xu, "Automatic image annotation by using concept-sensitive salient ob jects for image content representation", ACM SIGIR, 2004. [6] Y. Freund, R.E. Schapire, "Exp eriments with a new b oosting algorithm", Proc. ICML, pp.148-156, 1996. [7] R.E. Schapire, Y. Singer, "BoosTexter: A b oosting-based system for text categorization", Machine Learning, vol.39, pp.135-168, 2000. [8] A. Torralba, K. P. Murphy, W. T. Freeman, "Sharing features: efficient b oosting procedures for multiclass ob ject detection", IEEE CVPR, 2004. [9] K. Yu, A. Schwaighofor, V. Tresp, W.-Y. Ma, H.J. Zhang, "Collab orative ensemble learning: Combining content-based information filtering via hierarchical Bayes", Proc. of Intl. Conf. on Uncertainty in Artificial Intelligence (UAI), 2003. [10] T. Evgeniou, C.A. Micchelli, M. Pontil, "Learning multiple tasks with kernel methods", Journal of Machine Learning Research, vol.6, pp.615-637, 2005. [11] G.A. Miller, "WordNet: A lexical database for English", Comm. of ACM, vol.38, no.11, 1999. [12] S. Deerwester, S.T. Dumais and R. Harshman, "Indexing by latent semantic analysis", Journal of the American Society of Information Science, 1990.

Figure 8: Hyperbolic visualization of the visual similarity of the returned images for query "nature scene".

[13] Y. Rubner, C. Tomasi, L. Guibas, "A metric for distributions with applications to image databases", IEEE ICCV, pp.59-66, 1998. [14] J.A. Walter, D. Webling, K. Essig, H. Ritter, "Interactive hyp erb olic image browsing-towards an integrated multimedia navigator", KDD, 2006. [15] G.P. Nguyen, M. Worring, "Similarity based visualization of image collections", AVIVDiLib, 2005. [16] J. Lamping, R. Rao, "The hyp erb olic browser: A focus+content technique for visualizing large hierarchies", Journal of Visual Languages and Computing, 1996. [17] D.E. Knuth, The Art of Computer Programming, vol. 3: Sorting and Searching, Addison-Wesley, 1978. [18] J. Friedman, T. Hastie, R. Tibshirani, "Additive logistic regression: a statistical view of b oosting", Annals of Statistics, vol.28, no.2, pp.337-374, 2000. [19] K. Grauman, T. Darrell, "The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features", MIT-CSAIL-TR-2006-20, MIT 2006. [20] J. Fan, H. Luo, Y. Gao, M.-S. Hacid, "Mining image databases on semantics via statistical learning", KDD, 2005. [21] K. Barnard and D. Forsyth, "Learning the semantics of words and pictures", IEEE ICCV, 2001.

118