AC C E P T E D T O A P P E A R AT N E U R A L I N F O R M AT I O N P RO C E S S I N G S Y S T E M S 2 0 0 8 P R E L I M I NA RY V E R S I O N -- S U B J E C T TO R E V I S I O N Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes Erik B. Sudderth and Michael I. Jordan Electrical Engineering & Computer Science, University of California, Berkeley sudderth@cs.berkeley.edu, jordan@cs.berkeley.edu Abstract We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman­Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1 Introduction Images of natural environments contain a rich diversity of spatial structure at both coarse and fine scales. We would like to build systems which can automatically discover the visual categories (e.g., foliage, mountains, buildings, oceans) which compose such scenes. Because the "objects" of interest lack rigid forms, they are poorly suited to traditional, fixed aspect detectors. In simple cases, topic models can be used to cluster local textural elements, coarsely representing categories via a bag of visual features [1, 2]. However, spatial structure plays a crucial role in general scene interpretation [3], particularly when few labeled training examples are available. One approach to modeling additional spatial dependence begins by precomputing one, or several, segmentations of each input image [4­6]. However, low-level grouping cues are often ambiguous, and fixed partitions may improperly split or merge objects. Markov random fields (MRFs) have been used to segment images into one of several known object classes [7, 8], but these approaches require manual segmentations to train category-specific appearance models. In this paper, we instead develop a statistical framework for the unsupervised discovery and segmentation of visual object categories. We approach this problem by considering sets of images depicting related natural scenes (see Fig. 1(a)). Using color and texture cues, our method simultaneously groups dense features into spatially coherent segments, and refines these partitions using shared appearance models. This extends the cosegmentation framework [9], which matches two views of a single object instance, to simultaneously segment multiple object categories across a large image database. Some recent work has pursued similar goals [6, 10], but robust object discovery remains an open challenge. Our models are based on the Pitman­Yor (PY) process [11], a nonparametric Bayesian prior on infinite partitions. This generalization of the Dirichlet process (DP) [12] leads to heavier-tailed, power law distributions for the frequencies of observed objects or topics [13, 14]. Using a large database of manual scene segmentations, Sec. 2 uses statistical tests to quantitatively show that PY priors closely match the true distributions of natural segment sizes, and frequencies with which different object categories are observed. Generalizing the hierarchical DP [15], Sec. 3 then describes a hierarchical Pitman­Yor (HPY) mixture model which shares "bag of features" appearance mod1 10 0 10 3 120 Proportion of forest Segments Number of forest Segments 10 -1 10 2 Number of forest Images Segment Labels PY(0.39,3.70) DP(11.40) Segment Areas PY(0.02,2.20) DP(2.40) 100 80 60 40 20 0 120 Segment Counts PY(0.02,2.20) DP(2.40) 10 -2 10 -3 10 1 10 -4 10 10 0 0 10 1 10 2 10 0 10 10 3 -2 10 -1 10 0 1 2 3 4 5 6 7 8 Segment Labels (sorted by frequency) Proportion of insidecity Segments Proportion of Image Area Number of insidecity Segments Number of Segments per Image Number of insidecity Images Segment Counts PY(0.32,0.80) DP(2.90) 10 -1 Segment Labels PY(0.47,6.90) DP(33.00) Segment Areas PY(0.32,0.80) DP(2.90) 100 80 60 40 20 0 10 2 10 -2 10 -3 10 1 10 -4 10 0 10 1 10 2 10 0 10 -2 10 -1 10 0 1 2 3 4 5 6 7 8 Segment Labels (sorted by frequency) Proportion of Image Area Number of Segments per Image (a) (b) (c) (d) Figure 1: Validation of stick-breaking priors for the statistics of human segmentations of the forest (top) and insidecity (bottom) scene categories. We compare observed frequencies (black) to those predicted by Pitman­ Yor process (PY, red circles) and Dirichlet process (DP, green squares) models. For each model, we also display 95% confidence intervals (dashed). (a) Example human segmentations, where each segment has a text label such as sky, tree trunk, car, or person walking. The full segmented database is available from LabelMe [19]. (b) Frequency with which different semantic text labels, sorted from most to least frequent on a log-log scale, are associated with segments. (c) Number of segments occupying varying proportions of the image area, on a log-log scale. (d) Counts of segments of size at least 5,000 pixels in 256 × 256 images of natural scenes. els among related scenes. Importantly, our statistical analysis shows that this approach coherently models uncertainty in the number of object categories and instances. Our primary technical contribution is a novel image segmentation framework based on spatially dependent HPY processes [16]. As described in Sec. 4, we use thresholded Gaussian processes to link assignments of features to regions, and thereby produce smooth, coherent segments. Simulations show that our use of continuous latent variables captures long-range dependencies neglected by MRFs, including intervening contour cues derived from image boundaries [17, 18]. Furthermore, our formulation naturally leads to an efficient variational learning algorithm, which automatically searches over segmentations of varying resolution. Sec. 5 concludes by demonstrating accurate segmentation of complex images, and discovery of appearance patterns shared across natural scenes. 2 Statistics of Natural Scene Categories To better understand the statistical relationships underlying natural scenes, we analyze manual segmentations of Oliva and Torralba's eight categories [3]. A non-expert user partitioned each image into a variable number of polygonal segments corresponding to distinctive objects or scene elements (see Fig. 1(a)). Each segment has a semantic text label, allowing study of object co-occurrence frequencies across related scenes. There are over 29,000 segments in the collection of 2,688 images.1 2.1 Stick Breaking and Pitman­Yor Processes This Pitman­Yor (PY) process [11], denoted by GEM(a , b ), is defined by two hyperparameters satisfying 0 a < 1, b > -a . When a = 0, we recover a Dirichlet process (DP) with concentration parameter b . This construction induces a size-biased permutation on , so that sub1 The relative frequencies of different object categories, as well as the image areas they occupy, can be stastically modeled via distributions on potentially infinite partitions. Let = (1 , 2 , 3 , . . .), ti k=1 k = 1, denote the probability mass associated with each subset. In nonparametric Bayesian statistics, prior models for partitions are often defined via a stick-breaking construction [12]: w 1 k -1 k -1 (1) - k = wk (1 - w ) = wk k Beta(1 - a , b + k a ) =1 =1 See LabelMe [19]: http://labelme.csail.mit.edu/browseLabelMe/spatial envelope 256x256 static 8outdoorcategories.html 2 Model Type PY Labels DP Labels PY Regions DP Regions coast 0.961 0.082 1.000 1.000 forest 0.957 0.017 1.000 1.000 mountain 0.401 0.011 1.000 1.000 opencoun. 0.896 0.030 1.000 1.000 highway 0.938 0.383 1.000 < 0.001 insidecity 0.940 < 0.001 1.000 < 0.001 street 0.163 0.034 1.000 1.000 tallbuil. 0.653 0.051 0.938 0.873 Table 1: Chi-square goodness of fit tests for Pitman­Yor (PY) and Dirichlet process (DP) models of text label frequencies (top) and region size distributions (bottom). Small significance levels p indicate evidence against ^ the corresponding model. For each scene category, the most accurate models (p > 0.1) are highlighted. ^ sets with more mass k typically have smaller indexes k . When a > 0, E[wk ] decreases with k , and the resulting partition frequencies follow heavier-tailed, power law distributions. While the sequences of beta distributio{s underlying PY processes lead to infinite partitions, only n w a random, finite subset of size K = k | k > } ill have probability greater than any fixed significance level . Implicitly, these nonparametric models thus also place priors on the number of underlying classes or objects. In clustering applications, observations are often partitioned by independently sampling an assignment to subset k with probability k . The number of resulting clusters is then a random variable, whose mean slowly grows as more data is observed. 2.2 Object Label Frequencies Pitman­Yor processes have been previously used to model the well-known power law behavior of text sequences [13, 14]. Intuitively, the labels assigned to segments in the natural scene database have similar properties: some (like sky, trees, and building) occur frequently, while others (rainbow, lichen, scaffolding, obelisk, etc.) are more rare. In this section, we quantitatively verify that PY processes provide an excellent fit to these empirical object frequencies. For a scene category with T segments and K unique text labels, let c1 c2 · · · cK denote the number of occurrences of each label, sorted in descending order. To compare these counts to GEM(a , b ), we also sort the sampled frequencies, producing a two-parameter Poisson­ Dirichlet (PD) distributed partition = (1 , 2 , 3 , . . .) satisfying k > k+1 with probability one [11]. Pearson's chi-square statistic [20] then provides a standard test for histogram similarity: kK (ck - T k )2 ¯ k E[k | a , b ] ¯ (2) 2 = T k ¯ =1 Large values of 2 indicate a mismatch between GEM(a , b ) and the observed frequencies. For stability, we merge all labels with ck < 3 instances into a single "tail" bin when evaluating eq. (2). For each category, we use observed label counts to find maximum likelihood (ML) hyperparameter estimates = (a , b ). Figure 1(b) plots observed frequencies ck , along with quantiles of the ^ ^^ ^^ PD distribution induced by GEM( ). When a > 0, log E[k | ] -a 1 log(k ) + (a , b ) for ^ ^ ^ ^- large k [11], producing power law behavior which accurately predicts observed object frequencies. In contrast, the closest fitting DP model significantly underestimates the number of rare labels. Because PY processes define distributions on random measures, rather than fixed histograms, 2 standard asymptotic goodness of fit tests which assume the sampling distribution of is chi^ square are inappropriate. We instead use a parametric bootstrap test [20]. Thousands of samples (i) GEM(a , b ) are drawn from the assumed model, and eq. (2) evaluated with pseudo-counts ^^ (i) ck computed by drawing T samples from (i). As summarized in Table 1, a significance level p ^ is then estimated as the percentage of simulated chi-square scores which exceed 2 . We find that ^ the PY process provides a good fit for all categories, while there is significant evidence against the DP in most cases. By varying PY hyperparameters, we capture interesting differences among scene types: urban, man-made environments have many more unique objects than natural ones. 2.3 Segment Counts and Size Distributions As illustrated by the human segmentations in Fig. 1(a), natural scenes contain widely varying numbers of objects at both large and small scales. When examining low-level intensity statistics, power laws naturally arise from scale invariance properties [21]. For human segmentations, related power law behavior has been observed in histograms of contour lengths [22] and segment areas [23]. Inspired by this work, we use the larger LabelMe database of semantic scene segmentations to validate PY priors for image partitions. For each image j in a given scene category, let Tj denote the num3 6 6 6 Probability Density Probability Density 4 4 Probability Density 0.2 0.4 0.6 0.8 1 D vjt t ji xji Nj J J kjt f 5 5 5 4 3 3 3 2 2 2 wk Tk f 1 1 1 0 0 0.2 0.4 0.6 0.8 1 0 0 0 0 0.2 0.4 0.6 0.8 1 Stick-Breaking Proportion Stick-Breaking Proportion Stick-Breaking Proportion GEM(0, 10) 0.6 0.6 GEM(0.1, 2) 0.6 GEM(0.5, 5) Probability Density 0.5 0.4 0.3 0.2 0.1 0 -4 Probability Density 0.5 0.4 0.3 0.2 0.1 0 -4 Probability Density -2 0 2 4 0.5 0.4 0.3 0.2 0.1 0 -4 U -2 0 2 4 -2 0 2 4 Stick-Breaking Threshold Stick-Breaking Threshold Stick-Breaking Threshold Figure 2: Stick-breaking representation of a hierarchical Pitman­Yor (HPY) model for groups of features. Left: Directed graphical model in which global category frequencies GEM( ) are constructed from stickbreaking proportions wk Beta(1 - a , b + ka ), as in eq. (1). Similarly, vj t Beta(1 - a , b + ta ) define image-specific region weights j GEM(). Features xj i are sampled as in eq. (4). Upper right: Beta distributions from which stick proportions wk are sampled for three different PY processes: k = 1 (blue), k = 10 (red), k = 20 (green). Lower right: Corresponding distributions on thresholds for an equivalent generative model employing zero mean, unit variance Gaussians (dashed black). See Sec. 4.1. Table 1 summarizes goodness of fit scores p , computed as before from a parametric bootstrap. For ^ natural environments, the DP and PY processes both provide accurate fits. However, some urban environments have many more small objects, producing power laws (see Fig. 1(c)) which are captured significantly better by PY processes. As illustrated in Fig. 1(d), in addition to modeling segment areas, PY priors directly capture uncertainty in the number of segments at various resolutions. ber of segmented regions, and 1 aj 1 aj 2 · · · aj Tj > 0 their proportions of the image area. Letting denote sorted samples GEM(a , b ), we define a related chi-square test statistic: T ¯ jJ t j (aj t - t )2 ¯ Tj 2 = t E[t | a , b ] (3) t ¯ =1 =1 While power laws are often used simply as a descriptive summary of observed statistics [23], PY processes provide a consistent generative model which we use to develop effective segmentation algorithms. We do not claim that PY processes are the only valid prior for image areas; for example, log-normal distributions have similar properties, and may also provide a good model [24]. However, PY priors lead to efficient variational inference algorithms, avoiding the costly MCMC search required by other segmentation methods with region size priors [24, 25]. 3 A Hierarchical Model for Bags of Image Features We now develop hierarchical Pitman­Yor (HPY) process models for visual scenes. We first describe a "bag of features" model [1, 2] capturing prior knowledge about region counts and sizes, and then extend it to model spatially coherent shapes in Sec. 4. Our baseline bag of features model directly generalizes the stick-breaking representation of the hierarchical DP developed by Teh et. al. [15], and applied to visual object recognition by Sudderth et. al. [26]. N-gram language models based on HPY processes [13, 14] have somewhat different forms. 3.1 Hierarchical Pitman­Yor Processes Each image is first divided into roughly 1,000 superpixels [24] using a variant of the normalized cuts spectral clustering algorithm [17]. We describe the texture and color of each superpixel via local histograms: a gradient-based SIFT descriptor [27], as well as a robust hue descriptor [28]. As in texton representations [17], we use K-means to build dictionaries encoding Ws prototypical texture patterns, and Wc color patterns, across a database of natural scenes. Superpixel i in image j is then represented by visual words xj i = (xsi , xci ) indicating its texture xsi and color xci . j j j j Figure 2 contains a directed graphical model summarizing our HPY model for collections of local image features. Each of the potentially infinite set of global object categories occurs with frequency k , where GEM(a , b ) as motivated in Sec. 2.2. Each category k also has an associated 4 sc s c appearance model k = (k , k ), where k and k parameterize multinomial distributions on the Ws texture and Wc color words, respectively. These parameters are regularized by Dirichlet priors s c k Dir(s ), k Dir(c ), with hyperparameters chosen to encourage sparse distributions. Consider a dataset containing J images of related scenes, each of which is allocated an infinite set of potential segments or regions. As in Sec. 2.3, region t occupies a random proportion j t of the area in image j , where j GEM(a , b ). Each region is also associated with a particular global object category kj t . For each superpixel i, we then independently select a region tj i j , and sample features using parameters determined by that segment's global object category: z · x x = x c s (4) kj tj i p si , xci | tj i , kj , Mult si | zj i Mult ci | zj i ji j j j j As in other adaptations of topic models to visual data [8, 26], we assume that different feature channels vary independently within individual object categories and segments. 3.2 Variational Learning for HPY Mixture Models Here, H (q ) is the entropy. We truncate the variational posterior [29] by setting q (vj T = 1) = 1 for each image or group, and q (wK = 1) = 1 for the shared global clusters. Multinomial assignments q (kj t | j t ), q (tj i | j i ), and beta stick proportions q (wk | k ), q (vj t | j t ), then have closed form update equations generalizing [29]. Dirichlet appearance updates q (k | k ) follow standard formulas [31]. To avoid bias, we sort the current sets of image segments, and global categories, in order of decreasing aggregate assignment probability after each iteration [30]. While the bounds produced by our approach are in general looser than those of so-called collapsed variational methods [30, 32], they empirically give accurate parameter estimates when initialized by deterministic annealing [33]. The computational cost of optimizing eq. (5) is O(N T + T K ); we alternate between softly assigning N observations to T segments, and T segments to K global categories. From the analysis in Sec. 2, reasonable bounds on the number of image segments satisfy T N and T K , a substantial savings over the O(N K ) computational cost of collapsed methods. Moreover, our approach naturally generalizes to spatially dependent HPY processes. To allow efficient learning of HPY model parameters from large image databases, we have developed a mean field variational method which combines and extends previous approaches for DP mixtures [29, 30] and finite topic models [31]. Using the stick-breaking representation of Fig. 2, and a factorized variational posterior, we optimize the following lower bound on the marginal likelihood: log p(x | , , ) H (q ) + Eq [log p(x , k , t , v, w, | , , )] (5) K T ·J Nj j k t i q (k , t , v, w, ) = q (wk | k )q (k | k ) q (vj t | j t )q (kj t | j t ) q (tj i | j i ) =1 =1 =1 =1 4 Segmentation with Spatially Dependent Pitman­Yor Processes We now generalize the HPY image segmentation model of Fig. 2 to capture spatial dependencies. For simplicity, we consider a single-image model in which features xi are assigned to regions by indicator variables zi , and each segment k has its own appearance parameters k (see Fig. 3). As in Sec. 3.1, however, this model is easily extended to share appearance parameters among images. 4.1 Coupling Assignments using Thresholded Gaussian Processes We adapt this idea to PY processes using the stick-breaking representation of eq. (1). In particuk -1 lar, we note that if zi where k = vk =1 (1 - v ), a simple induction argument shows that vk = P[zi = k | zi = k - 1, . . . , 1]. The stick-breaking proportion vk is thus the conditional probability of choosing cluster k , given that clusters with indexes < k have been rejected. Combining 5 Consider a generative model which partitions data into two clusters via assignments zi {0, 1} sampled such that P[zi = 1] = v . One representation of this sampling process first generates a Gaussian auxiliary variable ui N (0, 1), and then chooses zi according to the following rule: 1 u 2 if ui < -1 (v ) e-s /2 ds (6) zi = (u) 1 0 otherwise 2 - Here, (u) is the standard normal cumulative distribution u nction (CDF)= ince (ui ) is uniformly fu .S distributed on [0, 1], we immediately have P[zi = 1] = P i < -1 (v ) P[(ui ) < v ] = v . uk1 uk3 B uk2 uk4 u3 51 52 53 54 51 52 53 54 51 52 53 54 z1 z3 x1 x3 x4 z4 z2 , u2 vk x2 u1 6k B 7 Figure 3: A nonparametric Bayesian approach to image segmentation in which thresholded Gaussian processes generate spatially dependent Pitman­Yor processes. Left: Directed graphical model in which expected segment areas GEM() are constructed from stick-breaking proportions vk Beta(1 - a , b + ka ). Zero mean Gaussian processes (uki N (0, 1)) are cut by thresholds -1 (vk ) to produce segment assignments zi , and thereby features xi . Right: Three randomly sampled image partitions (columns), where assignments (bottom, color-coded) are determined by the first of the ordered Gaussian processes uk to cross -1 (vk ). this insight with eq. (6), we can generate samples zi as follows: k w zi = min | uki < -1 (vk ) here uki N (0, 1) and uki ui , k = (7) As illustrated in Fig. 3, each cluster k is now associated with a zero mean Gaussian process (GP) uk , and assignments are determined by the sequence of thresholds in eq. (7). If the GPs have identity covariance functions, we recover the basic HPY model of Sec. 3.1. More general covariances can be used to encode the prior probability that each feature pair occupies the same segment. Intuitively, the ordering of segments underlying this dependent PY model is analogous to layered appearance models [34], in which foreground layers occlude those that are farther from the camera. To retain the power law prior on segment sizes justified in Sec. 2.3, we transform priors on stick proportions vk Beta(1 - a , b + k a ) into corresponding random thresholds: p(vk | ) = N (vk | 0, 1) · Beta((vk ) | 1 - a , b + k a ) ¯ ¯ ¯ vk ¯ -1 (vk ) (8) Fig. 2 illustrates the threshold distributions corresponding to several different PY stick-breaking priors. As the number of features N becomes large relative to the GP covariance length-scale, the proportion assigned to segment k approaches k , where GEM(a , b ) as desired. 4.2 Variational Learning for Dependent PY Processes Substantial innovations are required to extend the variational method of Sec. 3.2 to the Gaussian processes underlying our dependent PY processes. Complications arise due to the threshold assignment process of eq. (7), which is "stronger" than the likelihoods typically used in probit models for GP classification [35], as well as the non-standard threshold prior of eq. (8). In the simplest case, we place factorized Gaussian variational posteriors on thresholds q (vk ) = N (vk | k , k ) and ¯ ¯ assignment surfaces q (uki ) = N (uki | µki , ki ), and exploit the following key identities: E ( k k - µki [log (vk )] log Eq [(vk )] = log ¯ ¯ 9) Pq [uki < vk ] = ¯ q k + k i 1 + k The first expression leads to closed form updates for Dirichlet appearance parameters q (k | k ), while the second evaluates the beta normalization constants in eq. (8). We then jointly optimize each layer's threshold q (vk ) and assignment surface q (uk ), fixing all other layers, via backtracking ¯ conjugate gradient (CG) with line search. 6 Figure 4: Ten samples (columns) from each of four prior models for image partitions (color coded). For the MRF models, we use two edge parameters chosen to roughly match human segments. Row 1: Nearest neighbor Potts MRF with K = 10 states. Row 2: Potts MRF with potentials biased by DP samples [40]. Row 3: Softmax model in which assignment probabilities for K = 10 segments are coupled by logistically transformed GPs [36­39]. Row 4: PY process assignments coupled by thresholded GPs (as in Fig. 3). 4.3 Related Work Recently, Duan et. al. [16] proposed a generalized spatial Dirichlet process which links assignments via thresholded GPs, as in Sec. 4.1. Specializing to DP threshold priors (a = 0), the model outlined in Fig. 3 and the generalized spatial DP produce identical distributions on partitions z . However, the application considered in [16] is a regression task where partitions are "nuisance" variables, as opposed to the segmentation tasks which motivate our PY process generalization. Unlike our HPY extension, they do not consider approaches to sharing parameters among related groups or images. Moreover, their basic Gibbs sampler takes 12 hours on a toy dataset with 2,000 observations; our variational method jointly segments 200 scenes in comparable time. Several authors have independently proposed an alternative spatial model based on a pointwise, multinomial logistic transformation of K latent GPs [36­39]. This produces a field of smoothly varying multinomial distributions i , from which segment assignments are independently sampled as zi i . As illustrated in Fig. 4, this softmax construction produces noisy, less spatially coherent partitions. For example, when K = 2 it is equivalent to thresholding a non-smooth GP which has been corrupted by additive white noise [35]. Unlike the power law prior induced by our dependent PY process, such softmax models are biased towards partitions with K segments of similar size. A previous nonparametric image segmentation method defined its prior as a normalized product of a DP sample GEM(0, ) and a nearest neighbor MRF with Potts potentials [40]. This construction effectively treats as the canonical, rather than moment, parameters of the MRF, and does not produce partitions whose size distribution matches GEM(0, ). Due to the phase transition which occurs with increasing potential strength, Potts models assign low probability to realistic image partitions [41]. Empirically, the DP product construction seems to have similar issues (see Fig. 4), although it can still be effective with strongly informative likelihoods [40]. 5 Results Figure 5 shows segmentation results for several images from three of the natural scene categories considered in Sec. 2. We compare the bag of features PY model (PY-BOF), dependent PY with distance-based squared exponential covariance (PY-Dist), and dependent PY with covariance that incorporates intervening contour cues (PY-Edge) based on the probability of boundary (Pb) detector [18]. For the results in Figs. 5 and 6, we independently segment each test image, without sharing appearance models. The PY-Edge covariance function produces a discriminatively trained model, in which pairwise assignment probabilities are calibrated by image features. Since our approach is based on GPs, such discriminative training does not require iterative parameter estimation. Instead, we use an eigendecomposition to compute a globally consistent, positive definite covariance matrix from local, calibrated estimates of the probability that feature pairs lie in the same segment. 7 Figure 5: Segmentation results for two images (rows) from each of the coast, mountain, and tallbuilding scene categories. From left to right, columns show LabelMe human segments, image with boundaries inferred by PY-Edge, and segments for PY-Edge, PY-Dist, PY-BOF, NCut(3), NCut(4), and NCut(6). Best viewed in color. 1 1 1 PY Gaussian (Edge Covar) Average Rand Index 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.4 0.6 0.8 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.4 0.6 0.8 1 0.8 Average Rand Index 0.9 PY Gaussian (Edge Covar) 0.9 Normalized Cuts PY Gaussian (Edge Covar) PY Gaussian (Distance Covar) PY Bag of Features 1 0.9 0.9 Normalized Cuts PY Gaussian (Edge Covar) PY Gaussian (Distance Covar) PY Bag of Features 0.8 0.7 0.7 0.6 0.6 0.5 2 4 6 8 10 0.5 2 4 6 8 10 Normalized Cuts Number of Normalized Cuts Regions Normalized Cuts Number of Normalized Cuts Regions (a) (b) (c) (d) Figure 6: Quantitative comparison of segmentation results to human segments, using the Rand index. (a) Scatter plot of PY-Edge and NCut(4) Rand indexes for 200 mountain images. (b) Average Rand indexes for mountain images. We plot the performance of NCut(K ) versus the number of segments K , compared to the variable resolution segmentations of PY-Edge, PY-Dist, and PY-BOF. (c) Scatter plot of PY-Edge and NCut(6) Rand indexes for 200 tallbuilding images. (d) Average Rand indexes for tallbuilding images. As a baseline, we consider the normalized cuts spectral clustering method with varying numbers of segments (NCut(K )), and a high-quality affinity function based on color, texture, and intervening contour cues [17]. Note that our PY models consistently capture variability in the number of true segments, and detect both large and small regions. In contrast, normalized cuts has an implicit bias towards equal sized segments, which produces distortions. To quantitatively evaluate segmentation results, we measure their overlap with LabelMe human segments via the Rand index [42]. As summarized in Fig. 6, PY-BOF performs reasonably for some images with unambiguous features, but for many images PY-Edge is substantially better. We have also experimented with the full hierarchical Pitman­Yor model, in which color and texture appearance distributions are shared between images. As shown by the examples in Fig. 7, many of the inferred global visual categories align reasonably with semantic categories (e.g., sky, foliage, mountains, or buildings). However, the variational methods outlined in Sec. 3.2 appear to suffer 8 Figure 7: Most significant segments associated with each of four shared, global visual categories (rows) for hierarchical PY-Edge models trained with 200 images of mountain (left) or tallbuilding (right) scenes. from more local optima when jointly modeling hundreds of images. We suspect that higher-order methods will be needed to realize the full potential of our HPY model. 6 Discussion We have developed a nonparametric framework for image segmentation which uses thresholded Gaussian processes to produce spatially coupled Pitman­Yor processes. This approach produces empirically justified power law priors for region areas and object frequencies, allows visual appearance models to be flexibly shared among natural scenes, and leads to efficient variational inference algorithms which automatically search over segmentations of varying resolution. We believe this provides a promising starting point for discovery of shape-based visual appearance models, as well as weakly supervised nonparametric learning in other, non-visual application domains. Acknowledgments We thank Charless Fowlkes and David Martin for providing Pb boundary estimation and segmentation software, Antonio Torralba for helpful conversations, and Sra. Torralba for her image labeling expertise. We acknowledge research support from ONR grant N00014-06-1-0734, and DARPA under contract NBCHD030010. References [1] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR, volume 2, pages 524­531, 2005. [2] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. In ICCV, volume 1, pages 370­377, 2005. [3] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145­175, 2001. [4] L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification. In ICCV, 2007. [5] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, volume 2, pages 1605­1614, 2006. [6] S. Todorovic and N. Ahuja. Learning the taxonomy and models of categories present in arbitrary images. In ICCV, 2007. ~´ [7] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In CVPR, volume 2, pages 695­702, 2004. [8] J. Verbeek and B. Triggs. Region classification with Markov field aspect models. In CVPR, 2007. [9] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram matching: Incorporating a global constraint into MRFs. In CVPR, volume 1, pages 993­1000, 2006. [10] M. Andreetto, L. Zelnik-Manor, and P. Perona. Non-parametric probabilistic image segmentation. In ICCV, 2007. [11] J. Pitman and M. Yor. The two-parameter Poisson­Dirichlet distribution derived from a stable subordinator. Ann. Prob., 25(2):855­900, 1997. [12] M. I. Jordan. Dirichlet processes, Chinese restaurant processes and all that. Tutorial at NIPS, Dec. 2005. [13] S. Goldwater, T. L. Griffiths, and M. Johnson. Interpolating between types and tokens by estimating power-law generators. In NIPS 18, pages 459­466. MIT Press, 2006. [14] Y. W. Teh. A hierarchical Bayesian language model based on Pitman­Yor processes. In Coling/ACL, 2006. 9 [15] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. J. Amer. Stat. Assoc., 101(476):1566­1581, December 2006. [16] J. A. Duan, M. Guindani, and A. E. Gelfand. Generalized spatial Dirichlet process models. Biometrika, 94(4):809­825, 2007. [17] C. Fowlkes, D. Martin, and J. Malik. Learning affinity functions for image segmentation: Combining patch-based and gradient-based approaches. In CVPR, volume 2, pages 54­61, 2003. [18] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. PAMI, 26(5):530­549, May 2004. [19] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: A database and web-based tool for image annotation. IJCV, 77:157­173, 2008. [20] J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, California, 1995. [21] A. Torralba and A. Oliva. Statistics of natural image categories. Network: Comp. Neural Sys., 14:391­ 412, 2003. [22] X. Ren and J. Malik. A probabilistic multi-scale model for contour completion based on image statistics. In ECCV, volume 1, pages 312­327, 2002. [23] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001. [24] X. Ren and J. Malik. Learning a classification model for segmentation. In ICCV, 2003. [25] Z. Tu and S. C. Zhu. Image segmentation by data-driven Markov chain Monte Carlo. IEEE Trans. PAMI, 24(5):657­673, May 2002. [26] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes using transformed objects and parts. IJCV, 77:291­330, 2008. [27] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91­110, 2004. [28] J. van de Weijer and C. Schmid. Coloring local feature extraction. In ECCV, 2006. [29] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayes. Anal., 1(1):121­ 144, 2006. [30] K. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixture models. In IJCAI 20, pages 2796­2801, 2007. [31] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993­1022, 2003. [32] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In NIPS 20, pages 1481­1488. MIT Press, 2008. [33] N. Ueda and R. Nakano. Deterministic annealing EM algorithm. Neural Net., 11:271­282, 1998. [34] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. IEEE Trans. IP, 3(5):625­ 638, September 1994. [35] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. ´ [36] C. Fernandez and P. J. Green. Modelling spatially correlated data via mixtures: A Bayesian approach. J. R. Stat. Soc. B, 64(4):805­826, 2002. [37] M. A. T. Figueiredo. Bayesian image segmentation using Gaussian field priors. In CVPR Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2005. [38] M. W. Woolrich and T. E. Behrens. Variational Bayes inference of spatial mixture models for segmentation. IEEE Trans. MI, 25(10):1380­1391, October 2006. [39] D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113­120. ACM, 2006. [40] P. Orbanz and J. M. Buhmann. Smooth image segmentation by nonparametric Bayesian inference. In ECCV, volume 1, pages 444­457, 2006. [41] R. D. Morris, X. Descombes, and J. Zerubia. The Ising/Potts model is not well suited to segmentation tasks. In IEEE DSP Workshop, 1996. [42] R. Unnikrishnan, C. Pantofaru, and M. Hebert. Toward objective evaluation of image segmentation algorithms. IEEE Trans. PAMI, 29(6):929­944, June 2007. 10