Learning Hybrid Models for Image Annotation with Partially Labeled Data Anonymous Author(s) Abstract Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1 Introduction Image annotation, or image labeling, in which the task is to label each pixel or region of an image with a class label, is becoming an increasingly popular problem in the machine learning and machine vision communities [7, 14]. State-of-the-art methods formulate image annotation as a structured prediction problem, and utilize methods such as Conditional Random Fields [8, 4], which output multiple values for each input item. These methods typically rely on fully labeled data for optimizing model parameters. It is widely acknowledged that consistently-labeled images are tedious and expensive to obtain, which limits the applicability of discriminative approaches. However, a large number of partially-labeled images, with a subset of regions labeled in an image, or only captions for images, are available (e.g., [12]). Learning labeling models with such data would help improve segmentation performance and relax the constraint of discriminative labeling methods. A wide range of learning methods have been developed for using partially-labeled image data. One approach adopts a discriminative formulation, and treats the unlabeled regions as missing data [16], Others take a semi-supervised learning approach by viewing unlabeled image regions as unlabeled data. One class of these methods generalizes traditional semi-supervised learning to structured prediction tasks [1, 10]. However, the common assumption about the smoothness of the label distribution with respect to the input data may not be valid in image labeling, due to large intra-class variation of object appearance. Other semi-supervised methods adopt a hybrid approach, combining a generative model of the input data with a discriminative model for image labeling, in which the unlabeled data are used to regularize the learning of a discriminative model [6, 9]. Only relatively simple probabilistic models are considered in these approaches, without capturing the contextual information in images. Our approach described in this paper extends the hybrid modeling strategy by incorporating a more flexible generative model for image data. In particular, we introduce a set of latent variables that capture image feature patterns in a hidden feature space, which are used to facilitate the labeling task. First, we extend the Latent Dirichlet Allocation model (LDA) [3] to include not only input features but also label information, capturing co-occurrences within and between image feature patterns and object classes in the data set. Unlike other topic models in image modeling [11, 18], our model integrates a generative model of image appearance and a discriminative model of region labels. Second, the original LDA structure does not impose any spatial smoothness constraint to label prediction, yet incorporating such a spatial prior is important for scene segmentation. Previous approaches have introduced lateral connections between latent topic variables [17, 15]. However, 1 this complicates the model learning, and as a latent representation of image data, the topic variables can be non-smooth over the image plane in general. In this paper, we model the spatial dependency of labels by two different structures: one introduces directed connections between each label variable and its neighboring topic variables, and the other incorporates lateral connections between label variables. We will investigate whether these structures effectively capture the spatial prior, and lead to accurate label predictions. The remainder of this paper is organized as follows. The next section presents the base model, and two different extensions to handle label spatial dependencies. Section 3 and 4 define inference and learning procedures for these models. Section 5 describes experimental results, and in the final section we discuss the model limitations and future directions. 2 Model description The structured prediction problem in image labeling can be formulated as follows. Let an image x be represented as a set of subregions {xi }Nx1 . The aim is to assign each xi a label li from a i= categorical set L. For instance, subregion xi 's can be image patches or pixels, and L consists of object classes. Denote the set of labels for x as l = {li }Nx1 . A key issue in structured prediction i= concerns how to capture the interactions between labels in l given the input image. Model I. We first introduce our base model for capturing individual patterns in image appearance and label space. Assume each subregion xi is represented by two features (ai , ti ), in which ai describes its appearance (including color, texture, etc.) in some appearance feature space A and ti is its position on the image plane T . Our method focuses on the joint distribution of labels and subregion appearances given positions by modeling co-occurred patterns in the joint space of L × A. We achieve this by extending the latent Dirichlet allocation model to include both label and appearance. More specifically, we assume each observation pair (ai , li ) in image x is generated from a mixture of K hidden `topic' components shared across the whole dataset, given the position information ti . Following the LDA notation, the mixture proportion is denoted as , which is image-specific and shares a common Dirichlet prior parameterized by . Also, zi is used as an indicator variable to specify from which hidden topic component the pair (ai, li ) is generated. In addition, we use a to denote the appearance feature vector of each image, z for the indicator vector and t for the position vector. Our model defines a joint distribution of label variables l and appearance feature variables a given the position t as follows, iz Pb (l, a|t, ) = [ P (li |ai , ti , zi )P (ai |zi )P (zi |)]P (|)d (1) i where P (|) is the Dirichlet distribution. We specify the appearance model P (ai |zi ) to be position invariant but the label predictor P (li |ai , ti , zi ) depends on the position information. Those two components are formulated as follows, and the graphical representation of the model is shown in the left panel of Figure 1. (a) Label prediction module P (li |ai , ti , zi ). The label predictor P (li |ai , ti , zi ) is modeled by a probabilistic classifier that takes (ai , ti , zi ) as its input and produces a properly normalized distribution for li . Note that we represent zi in its `0-1' vector form when it is used as the classifier input. So if the dimension of A is M , then the input dimension of the classifier is M + K + 2. We use a MLP with one hidden layer in our experiments, although other strong classifiers are also feasible. (b) Image appearance module P (ai |zi ). We follow the convention of topic models and model the topic conditional distributions of the image appearance using a multinomial distribution with parameters zi . As the appearance features typically take on real values, we first apply k-means clustering to the image features {ai } to build a visual vocabulary V . Thus a feature ai in the appearance space A can be represented as a visual word v , and we have P (ai = v |zi = k ) = k,v . While the topic prediction model in Equation 1 is able to capture regularly co-occurring patterns in the joint space of label and appearance, it ignores spatial priors on the label prediction. However, spatial priors, such as spatial smoothness, are crucial to labeling tasks, as neighboring labels are usually strongly correlated. To incorporate spatial information, we extend our base model in two different ways as follows. 2 zi K ai li ti N D K zi-1 ai-1 li-1 zi ai li t zi+1 ai+1 li+1 D K zi-1 ai-1 li-1 zi ai li t zi+1 ai+1 li+1 D Figure 1: Left:A graphical representation of the base topic prediction model (Model I). Middle: Model II. Right: Model III. Circular nodes are random variables, and shaded nodes are observed. N is the number of image features in each image, and D denotes all the training data. Model II. We introduce a dependency between each label variable and its neighboring topic variables. In this model, each label value is predicted based on the summary information of topics within a neighborhood. More specifically, we change the label prediction model into the following form: j P (li |ai , ti , zN (i) ) = P (li |ai , ti , wj zj ), (2) N (i) where N (i) is a predefined neighborhood for site i, and wj is thej weight for the topic variable zj . We set wj exp(-|ti - tj |/ 2 ), and normalized to 1, i.e., N (i) wj = 1. The graphical representation is shown in the middle panel of Figure 1. This model variant can be viewed as an extension to the supervised LDA [2]. Here, however, rather than a single label applying to each input example instead there are multiple labels, one for each element of x. Model III. We add lateral connections between label variables to build a Conditional Random Field of labels. The joint label distribution given input image is defined as i i 1 P (l|a, t, ) = exp{ f (li , lj ) + log Pb (li |a, t, )}, (3) Z ,j N (i) a where Z is the partition function. The pairwise potential f (li , lj ) = ,b uab li ,a lj ,b , and the unary potential is defined as log output of the base topic prez iction model weighted by . Here is d P (li |ai , ti , zi )P (zi |a, t). This model the Kronecker delta function. Note that Pb (li |a, t, ) = i is shown in the right panel of Figure 1. Note that the base model (Model I) obtains spatially smooth labels simply through the topics capturing location-dependent co-occurring appearance/label patterns, which tend to be nearby in image space. Model II explicitly predicts a region's label from the topics in its local neighborhood, so that neighboring labels share similar contexts defined by latent topics. In both of these models, the interaction between labels takes effect through the hidden input representation. The third model uses a conventional form of spatial dependency by directly incorporating local smoothing in the label field. While this structure may impose a stronger spatial prior than other two, it also requires more complicated learning methods. 3 Inference and Label Prediction Given a new image x = {a, t} and our topic models, we predict its labeling based on the Maximum Posterior Marginals (MPM) criterion: li = arg max P (li |a, t). li (4) We consider the label inference procedure for three models separately as follows. Models I&II: The marginal label distribution P (li |a, t) can be computed as: z j P (li |a, t) = P (li |ai , ti , wj zj )P (zN (i) |a, t) N (i) (5) N (i) 3 The summationjhere is difficult when N (j ) is large. However, it can be approximated as follows. i Denote vi = wj zj and vi,q = N (i) N (i) wj q (zj ), where q (zj ) = {P (zj |a, t)} is the vector form of posterior distribution. Both vi and vi,q are in [0, 1]K . The marginal label distribution can be written as P (li |a, t) = P (li |ai , ti , vi ) P (zN (i) |a,t) . We take the first-order approximation of P (li |ai , ti , vi ) around vi,q using Taylor expansion: P (li |ai , ti , vi ) P (li |ai , ti , vi,q ) + (vi - vi,q )T · vi P (li |ai , ti , vi )|vi,q . (6) Taking expectation on both sides of Equation 6 w.r.t. P (zN (i) |a, t) (notice that vi P (zN (i) |a,t) = z j vi,q ), we have the following approximation: P (li |a, t) P (li |ai , ti , N (i) wj q (zj )). N (i) Model III: We first compute the unary potential of the CRF model from the base topic prediction z model, i.e., Pb (li |a, t) = P (li |ai , ti , zi )P (zi |a, t). Then the label marginals in Equation 4 are i computed by applying loopy belief propagation to the conditional random field. In both situations, we need the conditional distribution of the hidden topic variables z given observed data components to compute the label prediction. We take a Gibbs sampling approach by integrating out the Dirichlet variable . From Equation 1, we can derive the posterior of each topic variable zi given other variables, which is required by Gibbs sampling: m P (zi = k |z-i , ai ) P (ai |zi )(k + z m , k ) (7) S \i where z-i denotes all the topic variables in z except zi , and S is the set of all sites. Given the samples of the topic variables, we estimate their posterior marginal distribution P (zi |a, x) by simply computing their normalized histograms. 4 Learning with partially labeled data Here we consider estimating the parameters of both extended models from a partially labeled image set D = {xn , ln }. For an image xn , its label ln = (ln , ln ) in which ln denotes the observed labels, oh o and ln are missing. We also use o to denote the set of labeled regions. As the three models are built h with different components, we treat them separately. Models I&II. We use the Maximum Likelihood criterion to estimate the model parameters. Let be the parameter set of the model, n = arg max log P (ln , an |tn ; ) (8) o We maximize the log data likelihood by Monte Carlo EM. The lower bound of the likelihood can be wr itten a s i n n n n i log P (li |an , tn , zN (i) ) + log P (an |zi ) + log P (z) P (zn |ln ,an ) (9) Q= i i i o o In the E step, the posterior distributions of the topic variables are estimated by a Gibbs sampling procedure similar to Equation 7. It uses the following conditional probability: j m P (zi = k |z-i , ai , l, t) P (lj |aj , tj , zN (j ) )P (ai |zi )(k + z m , k ) (10) N (i)o S \i Note that any label variable is marginalized out if it is missing. In the M step, we update the model parameters by maximizing the lower bound Q. Denote the posterior distribution of z as q (·), the updating equation for parameters of the appearance module P (a|z ) can be derived from the stationary point of Q: n n k , v q (zi = k ) (an , v ). (11) i ,i The classifier in the label prediction module is learned by maximizing the following log likelihood, n j n j n n log P (li |an , tn , wj zj ) q(zN (i) ) log P (li |an , tn , wj q (zj )). (12) Lc = i i i i ,io N (i) ,io N (i) 4 where the approximation takes the same form as in Equation 6. We use a gradient ascent algorithm to update the classifier parameters. Note that we need to run only a few iterations at each M step, which r educes training time. Model III. We estimate the parameters of Model III in two stages: (1). The parameters of the base topic prediction model are learned using the same procedure as in Models I&II. More specifically, we set N (i) = i and estimate the parameters of the appearance module and label classifier based on Maximum Likelihood. (2). Given the base topic prediction model, we compute the marginal label probability Pb (li |a, t) and plug in the unary potential function in the CRF model (see Equation 3). We then estimate the parameters in the CRF by maximizing conditional pseudo-likelihood as follows: ni a j n n log exp{ Lp = uab ln ,a ln ,b + log Pb (li |an , tn )} - log Zi . (13) i j o n = exp{ N (i) wh ere ,b uab li ,a lj ,b + log Pb (li |a, t)} is the normalizing constant. i As this cost function is convex, we use a simple gradient ascent method to optimize the conditional pseudo-likelihood. n Zi l j N (i) a ,b 5 Experimental evaluation Data sets and representation. Our experiments are based on three image datasets. The first is a subset of the Microsoft Research Cambridge (MSRC) Image Database [14] as in [16]. This subset includes 240 images and 9 different label classes. The second set is the full MSRC image dataset, including 591 images and 21 object classes. The third set is a labeled subset of the Corel database as in [5] (referred therein as Corel-B). It includes 305 manually labeled images with 11 classes, focusing on animals and natural scenes. We use the normalized cut segmentation algorithm [13] to build a super-pixel representation of the images, in which the segmentation algorithm is tuned to generate approximately 1000 segments for each image on average. We extract a set of basic image features, including color, edge and texture information, from each pixel site. For the color information, we transform the RGB values into CIE Lab* color space. The edge and texture are extracted by a set of filter-banks including a differenceof-Gaussian filter at 3 different scales, and quadrature pairs of oriented even- and odd-symmetric filters at 4 orientations and 3 scales.The color descriptor of a super-pixel is the average color over the pixels in that super-pixel. For edge and texture descriptors, we first discretize the edge/texture feature space by k-means, and use each cluster as a bin. Then we compute the normalized histograms of the texture features within a super-pixel as the edge/texture descriptor. In the experiments reported here, we used 20 bins for edge information and 50 bins for texture information. We also augment each feature by a SIFT descriptor extracted from a 30 × 30 image patch centered at the center of that super-pixel. The image position of a super-pixel is the average position of its pixels. To compute the vocabulary of visual words in the topic model, we apply k-means to group the super-pixel descriptors into clusters. The cluster centers are used as visual words and each descriptor is encoded by its word index. Comparison methods. We compare our approach directly with two baseline systems: a superpixel-wise classifier and a basic CRF model. We also report the experimental results from [16], although they adopt a different data representation in their experiments (patches rather than superpixels). The super-pixel-wise classifier is an MLP with one hidden layer, which predicts labels for each super-pixel independently. The MLP has 30 hidden units, a number chosen based on validation performance. In the basic CRF, the conditional distribution of the labels of an image is defined as: iu i P (l|a, t) exp{ u,v li ,u lj ,v + h(li |ai , ti )} (14) ,j ,v where h(·) is the log output from the super-pixel classifier. We train the CRF model by maximizing its conditional pseudo-likelihood, and label the image based on the marginal distribution of each label variable, computed by the loopy belief propagation algorithm. Performance on MSRC-9. Following the setting in [16], we randomly split the dataset into training and testing sets with equal size, and use 10% training data as our validation set. In this experiment, 5 Table 1: A comparison of classification accuracy of the 3 variants of our model with other methods. The average classification accuracy is at the pixel level. Label S Class CRF Model I Model II Model III [16] building 61.2 69.8 64.8 79.2 78.1 73.6 grass 93.2 94.4 93.0 94.1 92.5 91.1 tree 71.3 82.1 76.6 81.4 85.4 82.1 cow 57.0 73.3 72.0 80.2 86.7 73.6 sky 92.9 94.2 93.5 93.5 94.6 95.7 plane 37.5 62.0 65.1 72.4 77.9 78.3 face 69.0 80.5 74.4 86.3 83.5 89.5 car 56.0 80.1 61.3 69.5 74.7 84.5 bike 54.1 78.6 77.7 86.2 88.3 81.4 Total 74.2 83.5 79.7 85.5 86.7 84.9 we set the vocabulary size to 500, the number of hidden topics to 50, and each symmetric Dirichlet parameter k = 0.5, based on validation performance. For Model II, we define the neighborhood of each site i as a subset of sites that falls into a circular region centered at i and with radius of 2 , where is the fall-off rate of the weights. We set to be 10 pixels, which is roughly 1/20 of image size. The classifiers for label prediction have 15 hidden units. The appearance model for topics and the classifier are initialized randomly. In the learning procedure, the E step uses 500 samples to estimate the posterior distribution of topics. In the M step, we take 3 steps of gradient ascent learning of the classifiers per iteration. The performance of our models is first evaluated on the dataset with all the labels available. We compare the performance of the three model variants to the super-pixel classifier (S Class), and the CRF model. Table 1 shows the average classification accuracy rates of our model and the baselines for each class and in total, over 10 different random partitions of the dataset. We can see that Model I, which uses latent feature representations as additional inputs, achieves much better performance than the S Class. Also, Model II and III improve the accuracy further by incorporating the label spatial priors. We notice that the lateral connections between label variables are more effective than integrating information from neighboring latent topic variables. This is also demonstrated by the good performance of the simple CRF. Learning with different amounts of label data. In order to test the robustness of the latent feature representation, we evaluate our models using data with different amount of labeling information. We use an image dilation operator on the image regions labeled as `void', and control the proportion of labeled data by varying the diameters of the dilation operator (see [16] for similar processing). Specifically, we use diameter values of 5, 10, 15, 20, 25, 30 and 35 to change the proportion of the labeled pixels to 62.9%, 52.1%, 44.1%, 36.4%, 30.5%, 24.9% and 20.3%, respectively. The original proportion is 71.9%. We report the average accuracies of 5 runs of training and testing with random equal partition of the dataset in Figure 2. The figure shows that the performance of all three models degrades with fewer labeled data, but the degradation is relatively gradual. When the proportion of labeled data decreases from 72% to 20%, the total loss in accuracy is less than 10%. This suggests that incorporating latent features makes our models more robust against missing labels than the previous work (cf. [16]). We also note that the performance of Model III is more robust than the other two variants, which may derive from stronger smoothing. Table 2: A comparison of classification accuracy of our three model variants with other methods on the full MSRC dataset and Corel-B dataset. MSRC Corel-B S Class 60.0 68.2 Model I 65.9 69.2 Model II 72.3 73.4 Model III 74.0 75.5 [14] 72.2 [5] 75.3 Performance on other sets. We further evaluate our models on two larger datasets to see whether they can scale up. The first dataset is the full version of the MSRC dataset, and we use the same training/testing partition as in [14]. The model setting is the same as in MSRC-9 except that we use a MLP with 20 hidden units for label prediction. The second is the Corel-B dataset, which is divided into 175 training images and 130 testing images randomly. We use the same setting of the models as in the experiments on the full MSRC set. Table 2 summarizes the classification accuracies of our models as well as some previous methods. For the full MSRC set, the two extended versions of our model achieve the similar performance as in [14], and we can see that the latent topic representation 6 0.85 S_Class Model-I Model-II Model-III 0.8 Accuracy 0.75 0.7 0.65 0.6 71.9 62.8 52.1 44.1 36.4 30.5 Percentage of Labeled Pixels 24.9 20.3 Figure 2: Left: Classification Accuracy with gradually decreasing proportion of labeled pixels. Right top: Examples of an image and its super-pixelization. Right bottom: Examples of original labeling and labeling after dilation (the ratio is 36.4). provides useful cues. Also, our models have the same accuracy as reported in [5] on the Corel-B dataset, while we have a simpler label random field and use a smaller training set. It is interesting to note that the topics and spatial smoothness play less roles in the labeling performance on CorelB. Figure 3 shows some examples of labeling results from both datasets. We can see that our models handle the extended regions better than those fine object structures, due to the tendency of (over)smoothing caused by super-pixelization and the two spatial dependency structures. 6 Discussion In this paper, we presented a hybrid framework for image labeling, which combines a generative topic model with discriminative label prediction models. The generative model extends latent Dirichlet allocation to capture joint patterns in the label and appearance space of images. This latent representation of an image then provides an additional input to the label predictor. We also incorporated the spatial dependency into the model structure in two different ways, both imposing a prior of spatial smoothness for labeling on the image plane. The results of applying our methods to three different image datasets suggest that this integrated approach may extend to a variety of image databases with only partial labeling available. The labeling system consistently out-performs alternative approaches, such as a standard classifier and a standard CRF. Its performance also matches that of the state-of-the-art approaches, and is robust against different amount of missing labels. Several avenues exist for future work. First, we would like to understand when the simple first-order approximation in inference for Model II holds, e.g., when the local curvature of the classifier with respect to its input is large. In addition, it is important to address model selection issues, such as the number of topics. We currently rely on the validation set, but more principled approaches are possible. A final issue concerns the reliance on visual words formed by clustering features in a complicated appearance space. Using a stronger appearance model may help us understand the role of different visual cues, as well as construct a more powerful generative model. References [1] Yasemin Altun, David McAllester, and Mikhail Belkin. Maximum margin semi-supervised learning for structured variables. In NIPS 18, 2006. [2] David Blei and Jon McAuliffe. Supervised topic models. In NIPS 20, 2008. [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993­1022, 2003. [4] Xuming He, Richard Zemel, and Miguel Carreira-Perpinan. Multiscale conditional random fields for image labelling. In CVPR, 2004. [5] Xuming He, Richard S. Zemel, and Debajyoti Ray. Learning and incorporating top-down cues in image segmentation. In ECCV, 2006. [6] Michael Kelm, Chris Pal, and Andrew McCallum. Combining generative and discriminative methods for pixel classification with multi-conditional learning. In ICPR, 2006. 7 MSRC Orig Image Ground Truth Our Model Building Grass Tree Cow Sky Plane Face Car Bike Corel-B Hippo/Rhino Horse Tigher Polar Bear Wolf/Lepard Water Vegetaion Sky Ground Snow Fence Figure 3: Some labeling results for the Corel-B (bottom panel) and MSRC-9 (top panel) datasets, based on the best performance of our models. The `Void' region is annotated by color `black'. [7] Sanjiv Kumar and Martial Hebert. Discriminative random fields: A discriminative framework for contextual interaction in classification. In ICCV, 2003. [8] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282­289, 2001. [9] Julia A. Lasserre, Christopher M. Bishop, and Thomas P. Minka. Principled hybrids of generative and discriminative models. In CVPR, 2006. [10] Chi-Hoon Lee, Shaojun Wang, Feng Jiao, Dale Schuurmans, and Russell Greiner. Learning to model spatial dependency: Semi-supervised discriminative random fields. In NIPS 19, 2007. [11] Nicolas Loeff, Himanshu Arora, Alexander Sorokin, and David Forsyth. Efficient unsupervised learning for localization and detection in object categories. In NIPS, 2006. [12] B. Russell, A. Torralba, K. Murphy, and W. Freeman. LabelMe: A database and web-based tool for image annotation. Technical report, MIT AI Lab Memo AIM-2005-025, 2005. [13] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 2000. [14] Jamie Shotton, John M. Winn, Carsten Rother, and Antonio Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV, 2006. [15] Jakob Verbeek and Bill Triggs. Region classification with markov field aspect models. In CVPR, 2007. [16] Jakob Verbeek and Bill Triggs. Scene segmentation with CRFs learned from partially labeled images. In NIPS 20, 2008. [17] Gang Wang, Ye Zhang, and Li Fei-Fei. Using dependent regions for object categorization in a generative framework. In CVPR, 2006. [18] Xiaogang Wang and Eric Grimson. Spatial latent Dirichlet allocation. In NIPS, 2008. Ground Truth Our Model Orig Image 8