Non-Local Contrastive Objectives David Vickrey Cliff Chiung-Yu Lin Daphne Koller Stanford University, Stanford, CA 94305-9010 USA dvickrey@cs.stanford.edu chiungyu@cs.stanford.edu koller@cs.stanford.edu Abstract Pseudo-likelihood and contrastive divergence are two well-known examples of contrastive methods. These algorithms trade off the probability of the correct label with the probabilities of other "nearby" instantiations. In this paper we explore more general types of contrastive objectives, which trade off the probability of the correct label against an arbitrary set of other instantiations. We prove that a large class of contrastive objectives are consistent with maximum likelihood, even for finite amounts of data. This result generalizes asymptotic consistency for pseudo-likelihood. The proof gives significant insight into contrastive objectives, suggesting that they enforce (soft) probabilityratio constraints between pairs of instantiations. Based on this insight, we propose Contrastive Constraint Generation (CCG), an iterative constraint-generation style algorithm that allows us to learn a log-linear model using only MAP inference. We evaluate CCG on a scene classification task, showing that it significantly outperforms pseudolikelihood, contrastive divergence, and a wellknown margin-based method. assignment for each labeled example with the probabilities of other, "nearby" assignments. This means that these algorithms do not need to compute the partition function Z. Unfortunately, these algorithms can suffer when the distribution is highly multi-modal, with multiple distant regions of high probability. LeCun & Huang (2005), Smith & Eisner (2005), and Liang & Jordan (2008) all consider the general case of contrastive objectives, where the contrasting set can consist of arbitrary assignments. However, previous work has not pursued the idea of non-local contrastive objectives. Rather than restrict the objective to considering assignments which are close to the correct label as in pseudo-likelihood and contrastive divergence, we allow comparison to any assignment. We prove several results which justify the use of non-local contrastive objectives. We show that a wide class of contrastive objectives are consistent with maximum likelihood, even for finite data under certain conditions. This generalizes and is a considerably stronger result than the asymptotic consistency of pseudo-likelihood. A central idea of this result is that contrastive objectives attempt to enforce probabilityratio constraints between different assignments, based on the structure of the objective. Among other consequences, this result clearly points out cases in which pseudo-likelihood (and other local methods) may fail. Based on this insight, we propose Contrastive Constraint Generation (CCG), a constraint-generation style algorithm that iteratively constructs a contrastive objective based only on a MAP-inference procedure. While similar in flavor to the max-margin cutting plane algorithm suggested by Tsochantaridis et al. (2005), our method has the ability to obtain accurate probability estimates. We compare CCG to pseudo-likelihood, contrastive divergence, and the cutting-plane algorithm on a real-world machine vision problem; CCG achieves a 12% error reduction over the best of these, a statistically significant improvement. 1. Introduction Learning Markov random fields is difficult because computation of the normalization term, generally referred to as the partition function Z, is intractable for many types of networks. Recently there has been significant interest in contrastive methods such as pseudo-likelihood (Besag, 1975) and contrastive divergence (Hinton, 2002). The main idea of these algorithms is to trade off the probability of the correct Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Non-Local Contrastive Objectives 2. Contrastive Objectives We consider prediction problems in which we try to predict a discrete label variable (or set of variables) Y given a set of features X. We are given a data set D of m examples. The ith example di = (xi , yi ) consists of observed features xi i and a correct label i i ^ yi . We use P (Y|X) = |d :(x ,y )=(X,Y)| to refer to ^ Z(X) the empirical distribution observed in our data set D, ^ where Z(X) = |di : xi = X|. We are also given a set of R feature functions f1 (X, Y), . . . , fR (X, Y); let f (X, Y) be a vector containing all feature functions. Given a vector of R weights , our model assigns probability P (Y = y|X = x) = Z(X = x) = y e jective is LL(; D) = T f (x,y) e T f (x,y) For now, we assume that C is given. We will discuss how to construct contrastive objectives in Section 5. 2.1. Related Learning Methods The log-likelihood objective LL(; D) is a contrastive objective with one sub-objective C1 , where S1 contains all values in Y and w1 (x) = 1 for all x. If Y is an MRF (or CRF), contrastive objectives are a generalization of pseudo-likelihood (PL). Let yl be the value of node l in our network, y-l be the value of all nodes except node l, and (y-l , yl ) be a combined instantiation to y which matches yl for node l and y-l for all other nodes. Let dom(Yl ) denote the set of possible values of Yl . For each l, for every possible instantiation y-l , we have one sub-objective Sy-l that contains exactly the set of instantiations consistent with y-l , i.e., (y-l , Yl = a) for all a dom(Yl ). All sub-objective weights wy-l (x) are set to 1. Since a sub-objective Cy-l is only active for examples where yi Sy-l , it follows that each example participates in n sub-objectives, where n is the number of variables in the network. This yields the contrastive objective T f (xi , yi ) - log (xi ,yi ) l adom(Yl ) Z(x) , where . The (log) likelihood obi i (xi ,yi ) log P (y |x ). The main idea of our approach is to define smaller terms over subsets of Y. Definition 1. Let Sj be some subset of values of Y. The (conditional) contrastive probability distribution for Sj is P,j (y|x) = e f (x,y) Zj (x) , T where Zj is the conySj trastive partition function Zj (x) = e T f (x,y) . e T i f (xi ,(y-l ,a)) , We refer to this distribution as contrastive because it compares the (unnormalized) probabilities of the values of Y in Sj . One important property of this distribution is that it also implicitly compares normalized probabilities: e f (x,y) Zj (x) T = P y Sj P (y|x) P (y |x) due to cancel- lation of the global partition function. Definition 2. A contrastive sub-objective Cj (; D) is a weighted maximum-likelihood objective with the model distribution P replaced by the contrastive distribution P,j for some subset Sj : wj (xi ) T f (xi , yi ) - log Zj (xi ) . (xi ,yi ):yi Sj which is the definition of pseudo-likelihood. All of the sub-objectives in PL are local ; they only involve instantiations that differ on a single node. Generalized pseudo-likelihood (GPL) can also easily be expressed in this framework. In GPL two or more variables are allowed to vary. This can lead to large, potentially exponential sub-objectives. In some cases, dynamic programming can render inference tractable within subobjectives. Unlike GPL, our framework allows us to vary multiple variables at a time without including all possible combinations of these variables, giving us considerably more flexibility. Another related learning method is contrastive divergence (CD), which approximates the gradient of maximum likelihood using a non-mixed Markov chain, initialized at the label of the current example. CD is generally defined by the update rule t = (xi ,yi ) k where P is the distribution over y obtained by initializing some MCMC procedure at xi and running for k steps.1 CD cannot be expressed as a contrastive obk jective, because CD uses P to compute expectations rather than P . This means that the probability-ratio matching intuitions in the next section do not hold for 1 In practice, it is intractable to compute the expectation k over P exactly; instead, it is estimated through sampling. wj (x) is a parameter of the sub-objective that determines the overall strength of the sub-objective as well as the relative importance of each value of x. A contrastive objective C(; D) is a sum of J subobjectives Cj , each with a different subset Sj and set of weights wj (x). C is tractable to compute (and optimize) if the contrastive partition functions are tractable to compute. In some cases, we can compute the contrastive partition function Zj (x) even if Sj contains an exponential number of values, e.g., using dynamic programming for tractable sub-structures. We say that sub-objective Cj is active for example (xi , yi ) if yi Sj and wj (xi ) > 0. The number of active sub-objectives for a particular data set D may be much smaller than the total number of sub-objectives. ft (xi , yi ) - EP (ft (xi , yi )), k Non-Local Contrastive Objectives CD. In fact, CD does not optimize any objective. This means that CD requires using stochastic gradient for optimization, whereas a contrastive objective can be optimized using a variety of methods (in this paper, BFGS). Furthermore, similar to PL, standard implementations of CD are local: they compare the correct label yi only to nearby values y . ^ Lemma 1. Suppose that P can represent P (y|x). Then for any contrastive objective C(; D), ^ ^ ^ i. If [P ], C(; D) has a global optimum at . ii. If optimizes C(; D), then for any j, x such ^ ^ that P (x)wj (x)( ySj P (y|x)) > 0, we have ^ P ,j (y|x) = Pj (y|x). We omit proofs of this and subsequent results for lack of space. ^ Corollary 1. Suppose that P can represent P (y|x) and optimizes C(; D). Also suppose y1 , y2 Sj , ^ ^ P (x) > 0, P (y1 |x) > 0, and wj (x) > 0. Then P (y2 |x) P (y1 |x) 3. Theoretical Results The main results of this section show the finite and asymptotic consistency of contrastive objectives, under suitable conditions on the model distribution P and sub-objectives Cj . The proofs of these theoretical results also illustrate a key feature of contrastive objectives: if two label values y, y are connected through a series of sub-objectives, then the objective will encour^ P (y|x) P (y|x) age P (y |x) to match P (y |x) . This will be the main ^ motivation for the methods we propose in Section 5 for choosing non-local sub-objectives. 3.1. Finite Consistency Let be the set of all possible parameter vectors , and let P denote the set of all possible models P obtainable using . We say that P can represent probability distribution P (Y|X) if there exists parameters such that P (Y|x) = P (Y|x) for all x. Let [P ] denote the set of such . While asymptotic representability (i.e., P being able to represent P (Y|X)) is a standard concept in analysis of learning algorithms, finite representability ^ (P being able to represent P (Y|X)) is less common. ^ (Y|X). In many cases, we Suppose P can represent P only see each X = x one time in our data set D, which ^ means that P (Y|X) will have a point estimate on the correct label yi for each xi . Thus, it can be a fairly strong condition on our model class P . However, if we expand the set of allowed weight vectors to include a certain type of infinite length weight vectors (defined by a direction in weight space), it is possible to show that representability is actually a weaker condition than separability, a commonly-used condition in analysis of learning algorithms. Even without infinite-length weight vectors, if D is separable then we can obtain an arbitrarily close approximate to ^ P using finite-length weight vectors. We will assume exact representability for the remainder of this section, but replacing with "near-exact" representability would only slightly weaken the results (specifically, it would add an arbitrarily small approximation factor). ^ Let Pj be the contrastive observed data distribution ^ relative to Sj : Pj (y|x) = |(xi ,yi )=(x,y)| ^ Zj (x) = ^ P (y2 |x) . ^ P (y1 |x) Thus, probability ratios according to P match ^ those according to P within a sub-objective set. Definition 3. We are given a set of sub-objectives Cj with weights wj (x). For a fixed feature value x, we say that there is a path from y1 to y2 relative to probability distribution P (Y|x) if there is a sequence Sjb : b = 1, ..., k such that i. y1 Sj1 and y2 Sjk ii. for every pair Sjb , Sjb+1 , there exists zb dom(Y) s.t. zb Sjb , zb Sjb+1 , and P (zb |x) > 0 iii. wjb (x) > 0 for all jb Intuitively, this definition means that it is possible to "walk" from y1 to y2 : if you are currently at value y, you are allowed to move to any other value y if y and y both are contained in some sub-objective set Sj with positive weight wj (x). If P (y |x) = 0, the walk must stop; otherwise it can continue. ^ Lemma 2. Suppose that P can represent P (y|x) and ^ optimizes C(; D). Also, suppose that P (x) > 0, ^ (y1 |x) > 0, and there is a path from y1 to y2 relative P ^ ^ to P (Y|x). Then P (y2 |x) = P (y2 |x) . P (y1 |x) ^ P (y1 |x) This result follows from applying Corollary 1 to each sub-objective along the path from y1 to y2 . We now have that the probability ratios according to P (y|x) ^ match those according to P (y|x) for any pair of values ^ y1 , y2 connected by a path (relative to P (Y|x)). Definition 4. For fixed x, a set of sub-objectives Sj with weights wj (x) span Y relative to a probability distribution P (Y|x) if for every pair of values y1 , y2 there is a path from y1 to y2 relative to P (Y|x). Note that this condition requires the total size of our sub-objectives to be at least the cardinality of Y. Let be the set of that optimize C(; D). = P ySj ^ P (x,y) ^ P (x,y) ^ where Zj (x) = |(xi , yi ) : xi = x, yi Sj |. Non-Local Contrastive Objectives ^ Theorem 1. Suppose that P can represent P (y|x). i Furthermore, suppose that for every x (i.e., every value of X observed in D), our sub-objectives Sj with ^ weights wj (xi ) span Y relative to P (Y|xi ). Then ^ = [P ]. That is, optimizes C(; D) if and only ^ if P (Y|xi ) = P (Y|xi ) for all xi . This result follows directly from Lemma 2 and the definition of span. The optima of the log-likelihood objective are ex^ ^ actly [P ] (provided P can represent P (y|x)). Thus, the optima of any contrastive objective fulfilling the conditions of the previous theorem are exactly the same as those of the log-likelihood objective. 3.2. Asymptotic Consistency Asymptotic consistency also holds for certain contrastive objectives. If the true data distribution P is in our model class (with parameters [P ]), then in the limit of infinite data, our objective will recover the correct parameters provided that it spans Y. Asymptotic representability is a considerably weaker condition than finite representability, because the model does not need to capture noise in the data. Let {d1 , d2 , . . . } be an infinite sequence of examples drawn i.i.d from P (X, Y). We refer to the data set composed of the first n of these as Dn , and its empir^ ical distribution as P (y|x; Dn ). By the strong law of ^ large numbers, P (limn P (y|x; Dn ) = P (X, Y)) = ^ converges to P almost surely 1, or in short-hand P a.s. ^ P (y|x; Dn ) P (y|x) as n . Let (n) be a sequence of weight vectors { 1 , 2 , . . . } such that for all n, n optimizes C(; Dn ). Theorem 2. Suppose P can represent P (y|x). Furthermore, suppose that for every x such that P (x) > 0, our sub-objectives Sj with weights wj (x) span Y relative to P (Y|x). Then for any (n) and any x such a.s. that P (x) > 0, Pn (Y|x) P (Y|x) as n . The proof of this theorem follows the same lines as finite consistency, with the additional use of the law of large numbers. All of the intermediate lemmas hold in the infinite data case; in particular, values y, y that are connected through a series of sub-objectives have calibrated probability ratios. Note that this theorem does not depend on the previous one: asymptotic consistency can hold even if finite consistency does not. From this result we can derive consistency of pseudolikelihood for strictly positive data distributions P (i.e., P (y|x) > 0 for all y, x), simply by noting that in the limit of infinite data, the set of active PL subobjectives will span Y. It is also easy to see why PL is usually not consistent with finite data (it is unlikely to span the space), and why it may not be consistent for non-positive data distributions (again, because it may not span the space). The most important practical implication from this section is that a contrastive objective attempts to calibrate probabilities within connected components of sub-objectives, but cannot calibrate probabilities between disconnected components. This has important implications for the performance of PL (and other local contrastive objectives), as we will see in Section 6. 4. Analyzing the Weights So far we only considered whether wj (x) > 0. To better understand the effect of the weights on the objective, we write the weights as a combination of three terms, wj (x) = w(x) y P (Cj |y, x)Q(y|x), each chosen by the designer of the contrastive objective. w(x) allows the designer to reweight the relative importance of terms in the objective corresponding to different values of x. This could be useful if we be^ lieve that the empirical distribution P (x) observed in our data does not match the true (or desired) distribution over x. P (Cj |y, x) allows the designer to choose the relative importance of different sub-objectives for a particular value of the label variable y (and also relative to a feature value x). P (Cj |y, x) is constrained to be a probability distribution such that P (Cj |y, x) > 0 only when y Sj . Q(y|x) is an auxiliary probability distribution that allows the designer to choose how important each y is to the objective. Since P and Q are probability distributions, the sum of all weights wj (x) for a given x is w(x). Thus, P and Q do not affect the relative influence of different values of x. This decomposition of the weights wj (x) is overparametrized; it is not hard to show that we can write any choice of wj (x) in this form. We now state a strong relationship between a particular contrastive objective and LL(; D): Lemma 3. Let C contain a sub-objective Sjk for every pair of instantiations yj , yk (including singleton subobjectives where yj = yk ). Let w(x) = |Y| for all x, 1 let P (Cjk |y, x) = |Y| if y Cjk , 0 otherwise (i.e., P is uniform over sub-objectives containing y), and Q(y|x) = P0 (y|x) for some fixed parameter vector 0 . Then dC |0 = dLL |0 . d d Thus, at any point in weight space, we can construct a contrastive objective tangent to log-likelihood at . As a result, we can optimize LL using an EM-like algorithm. We initialize with weight vector 0 . During the ith Expectation step, we update Q(y|x) = Pi-1 (y|x). During the ith Maximization step, we fix Q and use C to compute the gradient of log-likelihood at i-1 . We then take a step in the direction of the gradient, obtaining a new weight vector i . Non-Local Contrastive Objectives curately reconstructs 0 for all values of 1 , while PL does not. Contrastive objectives constructed in this way can be quite powerful, but are somewhat difficult to design. We omit further discussion for lack of space. A different approach is to use the data D to guide the selection of sub-objectives. For unconditioned Markov networks, a simple approach is to construct sub-objectives which compare different observed values of the label variables yi . We employed this strategy for the MRF described above, augmenting PL with a single sub-objective containing all values observed in D. This objective is referred to as Contrastive. As shown in Figure 1, Contrastive is also effective at recovering 0 , although for low values of 1 , the estimate is slightly inaccurate. Simple Contrastive used a static method to choose sub-objectives: the weights wj (x) do not depend on the examples observed in D. Contrastive used a dynamic method. Lemma 4. Suppose the wj (x) do not depend on D. Then EP [ dC(;D) |0 ] = dC(;D ) |0 for all 0 , where d d D is a data set (of possibly infinite size) such that ^ P (y|x; D ) = P (y|x). This lemma shows that for a static method, we get an unbiased estimate of the gradient at any point 0 . Suppose that P can represent P (y|x). In this case, we can apply this lemma at 0 = to get that the expectation D of the gradient at is 0. Loosely speaking, this means that the learned parameters for different D sampled from P will be centered around . Dynamic methods have no such guarantee. This bias in the gradient is precisely the reason why Contrastive in Figure 1 gives an inaccurate estimate for 0 for small values of 1 . However, since dynamic contrastive objectives are more flexible than static ones, this bias may often be acceptable in practice. 5.2. Contrastive Constraint Generation For many problems, the basic approaches presented above are not sufficiently powerful. We propose a (dynamic) method for constructing contrastive objectives called Contrastive Constraint Generation (CCG). In CCG, we begin by building an initial contrastive objective C0 containing relatively few sub-objectives. During iteration t, we first optimize Ct-1 to obtain a new weight vector t . Next, for each example di , we find one or more "interesting" instantiations based on the current model Pt (y|x). Finally, we construct a new contrastive objective Ct that incorporates these new instantiations into Ct-1 . We repeat this process until convergence (or until we decide to stop). We now describe the details of each of these steps. Initialization. We consider two simple initializa- Figure 1. Estimate of 0 (y-axis) vs. 1 used to generate the data (x-axis). The plot shows median learned parameter value over 100 synthetic data sets, each with 1000 instances. Error bars indicate 25th and 75th percentile estimates. Correct 0 = .139. This algorithm has some similarities to an algorithm proposed by Hoefling & Tibshirani (2009) for optimizing log-likelihood using a series of pseudo-likelihoods. We omit further discussion for lack of space. 5. Selecting Sub-objectives In practice we are not going to be able to span Y with our sub-objectives. In this section, we propose several techniques for constructing tractable objectives and examine some of their properties. 5.1. Basic Methods One simple way to construct sub-objectives is to use expert knowledge to determine useful values to compare. In pseudo-likelihood, for example, each subobjective corresponds to instantiations which differ on a (particular) single variable. In generalized PL, subobjectives contain all instantiations which differ on a particular subset of variables. As a concrete example of a sub-objective which is not possible using (generalized) PL, we consider a binary chain MRF (x is empty) with 10 nodes and two parameters: a single bias term specifying the relative weight of 0 vs. 1; and a single affinity term specifying how likely two neighboring nodes are to have the same value. The log-score of an instantiation is 0 |yi = 1| + 1 |yi = yi+1 |. For large 1 , the instantiations {000000000} and {1111111111} have much higher probability than any other instantiations; we expect PL to have trouble fitting 0 in this case, since it does not directly compare the probabilities of these two instantiations. However, we can augment the PL objective with an additional sub-objective containing exactly these two values -- we refer to this objective as Simple Contrastive. Figure 1 shows the error in the estimate of 0 as we vary 1 . Simple Contrastive ac- Non-Local Contrastive Objectives tions: empty (no sub-objectives); and adding all subobjectives from Pseudo-Likelihood. Optimization. This step is straight-forward. We simply optimize Ct-1 using a method such as BFGS. Finding Interesting Instantiations. We consider two general methods for finding new instantiations. For simplicity, we assume that only one new instantiation is generated per round per example, rei ferred to as yt . The first general method is to use a maximum aposteriori (MAP) inference algorithm in order to find the highest probability y according to Pt (y|xi ). In practice, the MAP algorithm will be approximate, i.e., y will not be guaranteed to actually maximize Pt (y|xi ). We considered two methods in this paper. The first, iterated conditional modes (ICM), proposed by Besag (1986), is a simple greedy ascent algorithm. At each round, a region is chosen at random; the label of this region is then changed to the value that gives the highest score . This is repeated until a local maximum is reached. The second is max-product belief propagation (MP) (Pearl, 1988); we use the variant known as residual belief propagation, proposed by Elidan et al. (2006). ICM and MP are both randomly initialized to encourage finding different local optima from iteration to iteration. We also tried a third inference method based on dual decomposition, proposed by Komodakis et al. (2007), but this method obtained similar results to MP while being significantly slower; we do not present results for dual decomposition in this paper. The second general method uses a sampling algorithm such as Gibbs sampling to generate one or more instantiations. Contrastive divergence takes this second approach, with the sampling algorithm initialized at yi and run for only a few steps. If we use this approximate sampling procedure, we end up with an algorithm that has many similarities with CD. The main differences are that CCG uses P to score the k instantiations while CD uses P ; and CD can only use stochastic gradient methods for optimization. Building a New Objective. There are many possible ways to construct a new contrastive objective incorporating the new instantiations; we consider one simple option. For each example di , we construct a sub-objective Cdi such that, at iteration t, Sdi coni tains the correct label yi as well as yt for all t t (since Sdi is a set, duplicate values are ignored). All weights wj (x) are set to 1. Convergence. Convergence is reached when, for i all examples, yt has already been seen. 6. Experimental Results In this section, we apply CCG to a real-world machine vision problem. We use the street scenes image data set described by Gould et al. (2009), consisting of 715 images. Every pixel in each image is labeled with one of 8 classes. To reduce the computational burden and to have access to more coherent features, we took as input the regions as predicted in Gould et al. (2009). This limits the maximum pixel-wise accuracy: the best-possible labeling of regions for this data obtains pixel-wise error of 12.0% (Lower Bound in Table 1). Our model is a CRF using intra-region (single node) and inter-region (pairwise) features, also taken from (Gould et al., 2009). We tested the following learning algorithms: Independent (I). Only the singleton potentials are used during training. Equivalent to logistic regression with individual regions as training examples. Pseudo-Likelihood (PL). See Section 2.1. Contrastive Divergence (CD). Each iteration, we generated a single sample for each data example di and use it to compute a stochastic approximation to the gradient. We used Gibbs sampling to generate the samples, following standard practice for CD (see, for example, (Bengio, 2009)). We tested three variants: CD-1, CD-10, and CD-100, which generate samples using 1, 10, and 100 round of Gibbs, respectively.2 We ran each variant for 10,000 seconds, corresponding to 50k, 16.5k, and 2k iterations, respectively. Max-Margin Cutting Planes (MM). This is a constraint-generation algorithm proposed by Tsochantaridis et al. (2005). It uses a margin-based objective, which tries to find a weight vector such that for every di , the score T f (xi , yi ) of the correct label yi is larger than T f (xi , y) + (y, yi ) for any other y, where (y, yi ) is a loss function which measures how much y and yi differ. For these experiments, we used pixel-wise error as our loss function. The cutting plane algorithm finds at each step the most violated constraint, which corresponds to, for each di , finding the y which maximizes T f (xi , y) + (y, yi ); it then adds a new constraint based on these values. To find the most violated constraint, we tried using (appropriately modified versions of) both ICM and MP, which we refer to as ICM-MV and MP-MV. For our reported results, we initialized with an empty constraint set; initializing with constraints corresponding to PL instantiations did not improve performance. This method has a hyper-parameter C which we chose to maximize performance on a small subset of the test set. This method usually converged in about 150-170 iterations. 2 In one round of Gibbs sampling, each node is resampled once, in random order. Non-Local Contrastive Objectives Table 1. Pixel-wise ICM Test Error 0.3 0.28 0.26 0.24 0.22 0.2 0 2000 4000 6000 8000 10000 Learning Method Lower Bound Independent PL CD-1 CD-10 CD-100 MM(ICM-MV) MM(MP-MV) CCG(Gibbs-1+PL) CCG(Gibbs-10+PL) CCG(Gibbs-100+PL) CCG(ICM+PL) CCG(MP+PL) CCG(ICM-MV+PL) CCG(MP-MV+PL) Test Error .120 .225 .461 .225 .219 .225 .217 .218 .225 .218 .217 .200 .198 .192 .190 Std Dev .005 .014 .044 .016 .015 .014 .009 .007 .016 .015 .015 .013 .015 .011 .013 CCG(ICM+PL) CD-1 CD-10 CD-100 Figure 2. Test Error vs. Running Time (in seconds) Contrastive Constraint Generation (CCG). Our method as described in Section 5.2. For initialization, we tried empty initialization and using the PL sub-objectives. We tried seven total ways of generating instantiations. First, we used the same sampling procedures as the CD variants ­ Gibbs-1, Gibbs-10, and Gibbs-100. Next, we used the approximate MAP procedures ICM and MP to generate instantiations. Finally, we used the most-violated constraint procedures ICM-MV and MP-MV. We refer to, for example, CCG with MP instantiations and empty initialization as CCG(MP); with PL initialization, CCG(MP+PL). The number of iterations required to reach convergence varied based on the initialization and instantiation method, from about 20 iterations for CCG(ICM) to 85 with CCG(MP+PL), while for the Gibbs variants, convergence is not reached within 100 iterations (we stopped at this point). To evaluate the learned weights, we needed a maximum a-posterior (MAP) inference algorithm to produce the most likely labeling at test time. We found that ICM consistently outperformed MP as a test-time inference algorithm, so we only report results using ICM for test-time inference. Results were generated using 10-fold cross-validation on the 715 images, reporting pixel-wise error. Standard deviations are computed based on the individual results for each fold. Table 1 shows all tested algorithms except for CCG with empty initialization. Based on the computed standard deviations, a difference in error of about .01 is statistically significant according to an unpaired t-test. There are two important things to note in this table. First, methods using non-local instantiations clearly outperform methods using the more local instantiations generated by Gibbs sampling (even when run for 100 rounds). The best non-local method, CCG(MPMV+PL), decreases absolute error over the best local method, CCG(Gibbs-100+PL), by 2.7%, a 12% relative reduction in error. Second, CCG significantly outperforms the other non-local method, MM; CCG(MPMV+PL) reduces absolute error from MM(ICM-MV) by 2.7% (12% relative error reduction). CCG is the only algorithm to improve substantially over Independent. PL more than doubles the pixelwise error rate. This is because labels of neighboring regions are highly correlated -- PL relies heavily on this during training, but at test time, the neighbors are no longer given. The strong locality of PL is a significant disadvantage for this problem. Initializing CCG using PL resulted in small but noticeable gains when using ICM at test time (absolute difference ranged from .004 to .007). It also significantly reduced the number of iterations required to reach convergence. When using MP as the test-time inference method, we get very bad results with empty initialization (difference ranged from .145 to .295). Each algorithm performs a differing amount of work at each iteration, ranging from CD (least) to CCG (most). Table 2 shows test accuracy vs. running time for the CD variants as well as for CCG(ICM+PL). The number of iterations pictured is 50k, 16.5k, 2k, and 6, for CD-1, CD-10, CD-100, and CCG(ICM+PL). CD1 has converged, CD-10 probably has, while CD-100 has not. Despite the very small number of iterations for CCG, it is already significantly outperforming CD at this point. This shows that the non-local instantiations generated by ICM are much more informative than the instantiations generated by Gibbs sampling. In fact, the difference between local and non-local methods is even greater than this graph suggests. After six iterations, CCG(Gibbs-1+PL), CCG(Gibbs10+PL), and CCG(Gibbs-100+PL) have error rates .234, .227, and .229, vs. .205 for CCG(ICM+PL); all have similar running times. CD-n is much faster than CCG(Gibbs-n+PL), with comparable results at convergence. The main reason for this is that batch opti- Non-Local Contrastive Objectives mization is much slower than stochastic optimization, at least initially. In future work, we plan to implement an SGD version of CCG and compare it to CD. analysis gives a theoretical grounding to non-local contrastive learning, including a well-defined objective. References Bengio, Y. Learning deep architectures for AI. Found. and Trends in Mach. Learn., pp. 1­127, 2009. Besag, J. Statistical analysis of non-lattice data. The Statistician, pp. 179­195, 1975. Besag, J. On the statistical analysis of dirty pictures. J. Royal Stat. Soc. B, pp. 259­302, 1986. Elidan, G., McGraw, I., and Koller, D. Residual belief propagation: Informed scheduling for asynchronous message passing. Uncertainty in AI, 2006. Gould, S., Gao, T., and Koller, D. Region-based segmentation and object detection. Neural Information Processing Systems, 2009. Gutmann, M. and Hyv¨rinen, A. Noise-contrastive esa timation: A New Estimation Principle for Unnormalized Statistical Models. AI Stat., pp. 297­304, 2010. Hinton, G. Training products of experts by minimizing contrastive divergence. Neural Computation, pp. 1771­1800, 2002. Hinton, G. E., Welling, M., and Mnih, A. Wormholes improve contrastive divergence. Neural Information Processing Systems, 2004. Hoefling, H. and Tibshirani, R. Estimation of sparse binary pairwise markov networks using psuedolikelihoods. J. Mach. Learn. R., pp. 883­906, 2009. Hyv¨rinen, A. Some extensions of score matching. a Comp. Stat. & Data Analysis, pp. 2499­2512, 2007. Komodakis, N., Paragios, N., and Tziritas, G. MRF optimization via dual decomposition: Messagepassing revisited. Comp. Vis. & Pat. Recog., 2007. LeCun, Y. and Huang, F.J. Loss functions for discriminitive training of energy-based models. AI Statistics, 2005. Liang, P. and Jordan, M. I. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. Int. Conf. on Mach. Learn., 2008. Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco, 1988. Smith, N. and Eisner, J. Contrastive estimation: Training log-linear models on unlabeled data. Assoc. for Comp. Ling., pp. 354­362, 2005. Taskar, B., Guestrin, C., and Koller, D. Max-margin markov networks. Neural Info. Proc. Sys., 2003. Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. Int. Conf. on Mach. Learn., pp. 1064­1071, 2008. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., pp. 1435­1484, 2005. 7. Discussion and Related Work LeCun & Huang (2005) and Smith & Eisner (2005) present frameworks for learning energy functions by comparing scores of sets of instantiations. The latter framework, called contrastive estimation, has the same functional form as one of our sub-objectives. However, Smith & Eisner (2005) primarily focus on unsupervised learning tasks, while this work is mostly aimed at supervised learning. More importantly, these two works focus on local contrastive objectives, whereas we propose non-local contrastive objectives, which address weakness in methods such as PL and CD. Hyv¨rinen (2007) proposes an objective for learna ing binary Markov networks by trying to match ratios of probabilities between the model and the observed data. This objective minimizes squared-loss instead of log-loss; the advantage of log-loss is that contrastive objectives are a direct generalization of both log-likelihood and PL. Additionally, Hyv¨rinen (2007) a only proposes matching local probability ratios. Recently, Gutmann & Hyv¨rinen (2010) proposed a a method based on learning probability ratios between the data distribution and some hand-constructed noise distribution. Similar to our method, it does not require computation of the global partition function. Unlike our method, it looks at probability ratios between different distributions, while our method looks at probability ratios between different instantiations. Liang & Jordan (2008) provide an asymptotic analysis of contrastive objectives. They show that under certain conditions, the more different assignments are covered by the objective, the more efficient it is as an estimator. They apply this result to compare the efficiency of pseudo-likelihood to that of maximumlikelihood. Their results suggest that increasing the number of assignments covered by the contrastive objective leads to improved learning efficiency. Max-margin-based methods, such as those proposed by Taskar et al. (2003), also do not need to compute the global partition function. As discussed above, the cutting-plane algorithm proposed by Tsochantaridis et al. (2005) is similar in spirit CCG. Like max-margin methods, CCG can learn using only MAP inference. The primary advantage of contrastive objectives over margin-based methods is that they can calibrate probabilities between instantiations. Hinton et al. (2004) and Tieleman (2008) improve CD by adding non-local contrastive terms. Like CD, these methods do not correspond to an objective. Our