Scene Segmentation with CRFs Learned from Partially Labeled Images Jakob Verbeek and Bill Triggs INRIA and Laboratoire Jean Kuntzmann, 655 avenue de l'Europe, 38330 Montbonnot, France Abstract Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labelling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labelling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labelling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labelled datasets. Complete labellings are typically costly and troublesome to produce. We introduce an algorithm that allows CRF models to be learned from datasets where a substantial fraction of the nodes are unlabeled. It works by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that incorporating top-down aggregate features significantly improves the segmentations and that effective models can be learned from fragmentary labellings. The resulting methods give scene segmentation results comparable to the state-of-theart on three different image databases. 1 Introduction In visual scene interpretation the goal is to assign image pixels to one of several semantic classes or scene elements, thus jointly performing segmentation and recognition. Scene interpretation is useful in a variety of applications ranging from keyword-based image retrieval (using the segmentation to automatically index images) to autonomous vehicle navigation [1]. Random field approaches are a popular way of modelling spatial regularities in images. Their applications range from low-level noise reduction [2] to high-level object or category recognition (this paper) and semi-automatic object segmentation [3]. Early work focused on generative modeling using Markov Random Fields, but recently Conditional Random Field (CRF) models [4] have become popular owing to their ability to directly predict the segmentation/labelling given the observed image and the ease with which arbitrary functions of the observed features can be incorporated in the training process. CRF models can be applied either at the pixel-level [5, 6, 7] or at the coarser level of super-pixels or patches [8, 9]. In this paper we label images at the level of small patches, using CRF models that incorporate both purely local (single patch) feature functions and more global `context capturing' feature functions that depend on aggregates of observations over the whole image or large regions. Traditional CRF training algorithms require fully-labelled training data. In practice it is difficult and time-consuming to label every pixel in an image and most of the available image interpretation datasets contain unlabelled pixels. Working at the patch level exacerbates this because patches often 1 contain several different pixel-level labels. Our CRF training algorithm handles both situations by allowing partial and mixed labellings and optimizing the probability for the model segmentation to be consistent with the given labelling constraints. The rest of the paper is organized as follows. We describe our CRF model in the next section, present our training algorithm in Section 3, provide experimental results in Section 4, and conclude in Section 5. 2 A Conditional Random Field using Local and Global Image Features We represent images as rectangular grids of patches at a single scale, associating a hidden class label with each patch. For CRF models the labels are coupled to their 4-neighbours. The local image content of each patch is encoded using texture, color and position descriptors as in [9]. For texture we compute the 128-dimensional SIFT descriptor [10] of the patch and vector quantize it by nearest-neighbour assignement against a ks = 1000 word texton dictionary learned by k-means clustering of all patches in the training dataset. Similarly, for color we take the 36-D hue descriptor of [11] and vector quantize it against a kh = 100 word color dictionary learned from the training set. Position is encoded by overlaying the image with an m × m grid of cells (m = 8) and using the index of the cell in which the patch falls as its position feature. In terms of observation functions, patch i is coded by three binary vectors with respectively ks , kh , and kp = m2 bits, each with a single bit set corresponding to the observed visual word. Note that we treat the three modalities as independent sources of information about the patch labels in the sense that we do not include CRF observation functions that couple them. (Generatively, the descriptors are modeled as being independent given the patch label). The naive Bayes model of the image drops all of the spatial couplings and assumes that the patches label depends only on its three observation functions. Parameter estimation reduces to trivially counting observed visual word frequencies for each label class and feature type. On the MSRC 9-class image database this model returns an average classification rate of 67.4% (see Section 4), so isolated appearance alone does not suffice for reliable patch labelling. In recent years models based on histograms of visual words have proven very successful for image categorization (deciding whether or not the image as a whole belongs to a given category of scenes) [12]. Motivated by this, many of our models take the global image context into account by including observation functions based on image-wide histograms of their patches' visual words. The hope is that this will help to overcome the ambiguities that arise when patches are classified in isolation. To this end, we define a conditional model for patch labels that incorporates both the patches local visual words and a global bias or `prior' based on the image-wide histogram. Let x {1, . . . , C } denote the patch label, y {0, 1}W (W = ks + hh + kp ) denote the concatenated binary indicator vectors of the patches three visual words, and h denote the normalized (sum-to-one) W -dimensional histogram of the visual words of all patches in the image (i.e., the normalized sum of the patches y vectors). The conditional probablity of the patch label is then modelled as -W , w p(x = l|y, h) exp (wl yw + wl hw ) (1) =1 where wl , wl are W × C matrices of coefficients to be learned. We can think of this as a multiplicative combination of a local classifier based on the patch-level observation y, and a global context or bias based the image-wide histogram h. Taking the global context into account improves the average classification rate to 74.4%. We will refer to these single patch classification models as `IND' below, reserving `CRF' for models with couplings between neighbouring labels. To account for correlations among spatially neighboring patch labels, we add couplings between the labels of neighboring patches to the single patch model (1). Let X denote the collection of all patch labels in the image, Y denote the collected patch features, and use xi , yi to denote the label and features of patch i. Then our CRF model for the coupled patch labels is: - , p(X |Y ) exp E ( X |Y ) (2) E (X |Y ) = W iw =1 (wxi yiw + wxi hw ) + i j ij (xi , xj ), (3) 2 h x x x x x x x x y y y y y y y y Figure 1: Graphical representation of the model with a single image-wide aggregate feature function denoted by h. Circular nodes indicate variable nodes xi (in practice connected in a large regular 4-neighborhood over the image), and square nodes indicate feature functions. Arrows denote single node potentials due to feature functions, and undirected edges represent pairwise potentials. The dashed lines indicate the aggregation of the single-patch observations yi in h. where i j denotes the set of all adjacent (4-neighbor) pairs of patches i, j . We can write E (X |Y ) without explicitly including h as an argument because h is a function of Y . We have explored two forms of pairwise potential: ij (xi , xj ) = xi ,xj [xi = xj ], and ij (xi , xj ) = ( + dij )[xi = xj ], where [·] is one if its argument is true and zero otherwise, and dij is some similarity measure over the appearance of the patches i and j . In the first form, xi ,xj is a general symmetric weight matrix that needs to be learned. The second potential is a simplified special case of this that is designed to favor label transitions at image locations with high contrast. As in [3] we use dij = exp(- zi - zj 2/(2)), with zi IR3 denoting the average RGB value in the patch and = zi - zj 2 , the average L2 norm between neighboring RGB values in the image. Models using the first form of potential will be denoted `CRF ' and those using the second will be denoted `CRF ', or `CRF if has been fixed to zero. A graphical representation of the model is given in Figure 1. 3 Estimating a Conditional Random Field from Partially Labeled Images Conditional models p(X |Y ) are trained by maximizing the log-likelihood of correct classification N of the training data: n=1 log p(Xn |Yn ). This usually requires completely labelled training data, i.e. a collection of N pairs (Xn , Yn )n=1,...,N . When learning CRFs for image interpretation the requirement for complete labelling is burdensome and it is interesting to consider the case where the training images are only partially labelled. Formally, we will assume that we do not have X exactly, but know that it takes a value in some given set A. This allows us to deal both with patches that are completely unlabelled and with ones that have a retricted but nontrivial set of possible labels. We thus propose to maximise the log-likelihood for the model to predict a value X in the set A: X L = log p(X A|Y ) = log p(X |Y ) A = log X A exp(-E (X |Y )) - log X exp(-E (X |Y )) . (4) Note that this reduces trivially to the usual maximum likelihood criterion if the training image is completely labelled. To estimate maximum likelihood parameters using gradient descent we need to calculate partial derivatives E (X |Y ) Xp L = (X |Y ) - p(X |Y , X A) . (5) For partially labelled images the set A has O(C k ) elements where k is the number of unlabeled patches and C is the number of possible labels. Therefore both of the terms in the gradient and the log-likelihood are intractable. The log-likelihood can be recognized as the difference between the partition functions of p(X |Y , X A) and p(X |Y ), which can be approximated using the Bethe free-energy approximation of these partition functions: L FB ethe (p(X |Y )) - FB ethe (p(X |Y , X A)). (6) 3 85 80 Accuracy only c 1 to c 75 c to 10 local only 70 0 Figure 2: Classification accuracy as a function of the quantization level c: aggregate features (AF) were computed in each of the cells in the c × c grid. Performance without AF (solid line), using AF of a single c (dotted curve), using AFs for grids 1 × 1 up to c × c (solid curve), using AFs for grids c × c up to 10 × 10 (dashed curve). Performances were measured with the individual patch classifier, using a single training and test set. 4 C 6 8 10 2 Both the gradient and the approximate log-likelihood can be expressed in terms of single-node and pair-wise marginals, which we approximate using loopy belief propagation [13]. We evaluate (6) to adaptively set the gradient ascent step size parameter. We did not encounter any convergence problems with loopy belief propagation, athough we did not use damping. Note that this approach differs from heuristically ignoring all the gradient contributions from unlabeled patches, since the CRF model carries information from the partial labelling to unlabeled regions, in particular to regions close to, or surrounded by, labelled patches. 4 Experimental Results In this section we analyze performance of our segmentation models in detail and compare it to related state-of-the-art work. In our first set of experiments we use the Microsoft Research Cambridge (MSRC) database1 . This consists of 240 images of 213 × 320 pixels and their partial pixel-level labelings. The labelings assign pixels to one of nine classes: building, grass, tree, cow, sky, plane, face, car, and b ike. However about 30% of the pixels are not labeled. Some sample images and labelings are shown in Figure 4. In our experiments we divide the database into 120 images for training and 120 for testing, reporting average results over 20 random test-train divisions. We used 20 × 20 pixel patches with centers at 10 pixel intervals. To obtain a labeling of the patches, pixels are assigned to the nearest patch center. The set of possible labels for a patch is then the set of labels seen among the labels of its pixels, and we allow all labels if unlabeled pixels are assigned to the patch. Learning and inference is takes place at the patch level. To map the patch-level segmentation back to the pixel level we assign each pixel the marginal of the patch with the nearest center. 2 In the evaluation unlabeled pixels are not counted. Aggregate features at different scales. Clearly, the idea of using an image-wide histogram feature can be generalized to scales smaller than the complete image. We experimented with a range of c × c grids over the image. In each cell of the grid we compute a frequency histogram over the visual words, and for all the patches in the cell we add a term to the energy in the same way the image-wide histogram appears in Eq. (1). In Figure 2 we show how performance of the individual patch classifier depends on the use of aggregate features. From the dotted curve in the figure we see that generally the aggregate features over large cells are more informative, although even using a fine 10 × 10 grid (with only 6 to 12 patches per cell) leads to a significant performance increase. Furthermore, combining aggregate features computed at different scales does help, but the increase in performance is small as compared to the gain in performance obtained with just the image-wide aggregate feature. Therefore we used only image-wide features in the subsequent experiments. Available from http://research.microsoft.com/vision/cambridge/recognition. The segmentations shown in Figure 4 were post-processed by a applying a Gaussian filter over the pixel marginals with the scale set to half the patch spacing. 2 1 4 Building (16.1%) Grass (32.4%) Tree (12.3%) Plane (2.2%) Sky (15.4%) Bike (1.5%) 72.0 78.7 30.6 24.9 40.8 69.1 81.4 77.3 76.8 79.4 76.1 Cow (6.2%) Face (4.4%) Car (9.5%) Schroff et al. [14] Verbeek et al. [9] unlabeled removed IND loc only IND loc+glo CRF loc only CRF loc+glo CRF loc only CRF loc+glo CRF loc only CRF loc+glo 56.7 74.0 84.6 63.8 69.2 75.0 73.6 71.4 74.6 65.6 75.0 84.8 88.7 91.0 88.3 88.1 88.6 91.1 86.8 88.7 85.4 88.5 76.4 64.4 76.6 51.9 70.1 72.7 82.1 80.2 82.5 78.2 82.3 83.8 77.4 70.6 56.7 69.3 70.5 73.6 81.0 82.2 74.3 81.0 81.1 95.7 91.3 88.4 89.1 94.7 95.7 94.2 93.9 95.4 94.4 53.8 92.2 43.9 28.6 44.8 55.5 78.3 63.8 61.7 61.8 60.6 68.5 88.8 77.8 64.0 78.1 83.2 89.5 86.3 88.8 84.8 88.7 71.4 81.1 71.4 60.7 67.8 81.4 84.5 85.7 82.8 85.2 82.2 75.2 82.3 78.4 67.1 74.4 80.7 84.9 82.3 83.3 80.3 83.1 Table 1: Classification accuracies on the 9 class data set using different models. In brackets are the frequencies of the different classes in the data set. Comparing the performance of different forms of the model. To determine the relative contribution of different components of our CRF model we compare several variants in Table 1. Models not using aggregate features are marked with `loc only', models using also the global histogram are marked `loc+glo'. First of all it is clear that adding the local coupling of the `CRF ' to the individual patch classifiers significantly increases the accuracy: by 10.5% to 13.6% for `loc+glo' and `loc only' respectively. The improvement is particularly large for relatively rare classes when the global image histogram is not used; in which case the single node-potential is less informative and frequent classes are favored due to their large a priori probability. Interestingly, in our experiments we find that in combination with the image-wide aggregate feature (`loc+glo') the simplest for of the pairwise potential, `CRF ', works better than `CRF ' and `CRF ' which use a more general form of pairwise potential. If we only use the local feature function (`loc only'), `CRF ' which uses a class-dependent pairwise potential, outperforms `CRF '. Furthermore, the performance increase of adding the global feature is smallest when using the `CRF ' model which also implements contextual information, albeit only of a local nature. Apparently, the local label transition preferences in the `CRF model help only in the absence of global contextual information provided by the image-wide aggregate feature. Comparison with ad-hoc removal of unlabeled patches from training images. The maximum likelihood estimation procedure of Section 3 requires running loopy BP twice: once in p(X |Y ) and once in p(X |Y , X A). A simple alternative to deal with partially labeled training data ­which avoids running loopy BP twice­ is to ensure that unlabeled patches do not enter into the model. To do this, we delete all nodes from the random field that correspond to unlabeled or partially labeled patches. After this operation one or more completely labeled connected components remain, and we maximize the log-likelihood of these labeled patches. This is equivalent to using the complete model and setting the pair-wise potentials connected to unlabeled patches to zero. To see this, note that in this case the label of unlabeled patches becomes independent of other labels, and hence p(X |Y ) and p(X |Y , X A) are equivalent for these patches and their contribution to the log-likelihood in Eq. (4) and the gradient in Eq. (5) vanishes. Looking at the labeling of the training images in Figure 3 and Figure 4, we see that typically it is the transitions between different classes that remain unlabeled. Since we leave patches unlabeled if they contain unlabeled pixels, neighboring patches with different class labels are quite rare. Therefore, 5 Per Pixel 90 85 80 Accuracy 75 70 65 60 55 0 / 71 CRF loc+glo IND loc+glo 10 / 54 20 / 40 30 / 28 40 / 21 50 / 15 Disc radius / percentage pixels labeled Figure 3: Recognition performance when learning from increasingly eroded label images (left). Example image with its original annotation, and erosions thereof with disk of size 10 and 20 (right). we expect that ignoring the unlabeled data as described above will lead to overestimation of the pair-wise interaction. Indeed, learning `CRF loc+glo' model by removing all unlabeled patches leads to an estimate of 11.5, whereas the maximum-likelihood estimate of (4) leads to 1.9. The recognition obtained using the model learned by removal of unlabeled patches can be found in Table 1 in the row `unlabeled removed', which should be compared to the results in the row `CRF loc+glo'. We see that accuracy drops in particular for the classes plane and bike for which the labeling has a large boundary relative to the surface covered by these classes, and thus will have many partially labeled patches. It is interesting to note that even though has been severely over-estimated, still the CRF model improves over individual patch classification obtained with `IND loc+glo', but not for bike and only marginally for plane. Recognition as function of the amount of labeling. In this experiment we consider how performance drops as the amount of labeled pixels decreases. We applied a morphological erosion operator to the manual annotations, where we varied the size of the disk-shaped structuring element. In this way we obtain a series of annotations which resemble increasingly sloppy manual annotations, see the example in Figure 3. In the same figure we show recognition performance of `CRF loc+glo' and `IND loc+glo' as a function of the size of the structuring element. Clearly the spatial structure of the CRF not only allows improved performance from relatively well labeled images, but it maintains its advantage when learning from far less labeling. Note that the performance of `CRF loc+glo' when learning from label images eroded with a disc of radius 30 (28% of pixels labeled) is still above that of `IND loc+glo' learned from the original labeling (71% of pixels labeled). Comparison with related work. In Table 1 we compare our recognition results on the MSRC database with those reported in [14, 9]. We see that our CRF model clearly outperforms the sliding window based approach of [14], and performs slightly better than the generative approach of [9].3 Using the Sowerby database and a subset of the Corel database we also compare our model with two CRF models that operate at pixel-level. The Sowerby database consists of 104 images of 96 × 64 pixels of urban and rural scenes labeled with 7 different classes: sky, vegetation, road marking, road surface, building, street objects, and cars. The subset of the Corel database contains 100 images of 180 × 120 pixels of nature scenes, also labeled with 7 classes: rhino/hippo, polar bear, water, snow, vegetation, ground, and sky. Here we used 10 × 10 pixel patches, with a spacing of 2 and 5 pixels for the Sowerby and Corel database respectively. The other parameters were kept as before. In Table 2 we give the recognition accuracy averaged over pixels using our CRF and independent patch model, as well as results reported on these database using TextonBoost [7] and the multi-scale CRF model of [5]. The total training time and test time per image are also listed. The results show that on this databases our model performs comparably to pixel-level approaches; where our model is relatively fast to train and test since it operates at patch-level and uses standard features as opposed to the boosting procedure of [7]. 3 However, note that the results of [9] were obtained using a 90%­10% train-test partition of the images. 6 Sowerby Accuracy Speed IND CRF train test TextonBoost [7] He et al. [5] CRF (CRF ) loc+glo 85.6% 82.4% 86.0% 88.6% 89.5% 87.4% 5h Gibbs 20min 10s Gibbs 5s Corel Accuracy Speed IND CRF train test 68.4% 66.9% 66.9% 74.6% 80.0% 74.6 % 12h Gibbs 15min 30s Gibbs 3s Table 2: Recognition accuracy and speeds on the Corel and Sowerby database. 5 Conclusion We presented several image-patch-level CRF models for semantic image labelling that incorporate both local patch-level observations and more global contextual features based on aggregates of observations at several scales. We showed that partially labeled training images could be handled by maximizing the total likelihood of the image segmentations that comply with the partial labeling, using Loopy BP and Bethe free-energy approximations for the calculations. This allowed us to learn effective CRF models from images where only a small fraction of the pixels were labeled and class transitions were not observed. Experiments on the MSRC database indicated that including imagewide aggregates of features is very helpful, while adding aggregates at smaller scales gives relatively little additional benefit. Comparative experiments showed that our patch-level CRFs have comparable performance to state-of-the-art pixel-level models while being much more efficient because the number of patches is much smaller than the number of pixels. References [1] P. Jansen, W. van der Mark, W. van den Heuvel, and F. Groen. Colour based off-road environment and terrain type classification. In Proceedings of the IEEE Conference on Intelligent Transportation Systems, pages 216­221, 2005. [2] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):712­741, 1984. [3] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3):309­314, 2004. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning, volume 18, pages 282­289, 2001. ~´ [5] X. He, R. Zemel, and M. Carreira-Perpinan. Multiscale conditional random fields for image labelling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 695­702, 2004. [6] S. Kumar and M. Hebert. A hierarchical field framework for unified context-based classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 1284­1291, 2005. [7] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European Conference on Computer Vision, pages 1­15, 2006. ´ ¨ [8] P. Carbonetto, G. Dorko, C. Schmid, H. Kuck, and N. de Freitas. A semi-supervised learning approach to object recognition with spatial integration of local features and segmentation cues. In Toward CategoryLevel Object Recognition, pages 277­300, 2006. [9] J. Verbeek and B. Triggs. Region classification with Markov field aspect models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. to appear. [10] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91­110, 2004. [11] J. van de Weijer and C. Schmid. Coloring local feature extraction. In Proceedings of the European Conference on Computer Vision, pages 334­348, 2006. [12] J. Ponce, T. Berg, M. Everingham, D. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. Russell, A. Torralba, C. Williams, J. Zhang, and A. Zisserman. The 2005 pascal visual object classes challenge. In Selected proceedings of the first PASCAL challenges workshop, 2006. [13] B. Frey and D. MacKay. A revolution: belief propagation in graphs with cycles. In Advances in Neural Information Processing Systems, volume 10, 1998. 7 MSRC Labeling CRF loc+glo Sowerby Labeling CRF loc+glo Corel Figure 4: Samples from the MSRC, Sowerby, and Corel databases with segmentation and labeling. [14] F. Schroff, A. Criminisi, and A.Zisserman. Single-histogram class models for image segmentation. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2006. Labeling CRF loc+glo 8