Structured Learning with Approximate Inference Alex Kulesza and Fernando Pereira Department of Computer and Information Science University of Pennsylvania {kulesza, pereira}@cis.upenn.edu Abstract In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed. 1 Introduction Structured prediction models commonly involve complex inference problems for which finding exact solutions is intractable [1]. There are two ways to address this difficulty. Directly, models used in practice can be restricted to those for which inference is feasible, such as conditional random fields on trees [2] or associative Markov networks with binary labels [3]. More generally, however, efficient but approximate inference procedures have been devised that apply to a wide range of models, including loopy belief propagation [4, 5], tree-reweighted message passing [6], and linear programming relaxations [7, 3], all of which give efficient approximate predictions for graphical models of arbitrary structure. Since some form of inference is the dominant subroutine for all structured learning algorithms, it is natural to see good approximate inference techniques as solutions to the problem of tractable learning as well. A number of authors have taken this approach, using inference approximations as drop-in replacements during training, often with empirical success [3, 8]. And yet there has been little theoretical analysis of the relationship between approximate inference and reliable learning. We demonstrate with two counterexamples that the characteristics of approximate inference algorithms relevant for learning can be distinct from those, such as approximation guarantees, that make them appropriate for prediction. First, we show that approximations can reduce the expressivity of a model, making previously simple concepts impossible to implement and hence to learn, even though inference meets an approximation guarantee. Second, we show that standard learning algorithms can be led astray by inexact inference, failing to find valid model parameters. It is therefore crucial to choose compatible inference and learning procedures. This work is based on research supported by NSF ITR IIS 0428193. 1 With these considerations in mind, we prove that LP-relaxation-based approximate inference procedures are compatible with the structured perceptron [9] as well as empirical risk minimization with a margin criterion using the PAC-Bayes framework [10, 11]. 2 Setting Given a scoring model S (y|x) over candidate labelings y for input x, exact Viterbi inference is the computation of the optimal labeling h(x) = arg max S (y|x) . y (1) In a prediction setting, the goal of approximate inference is to compute efficiently a prediction with the highest possible score. However, in learning a tight relationship between the scoring model and true utility cannot be assumed; after all, learning seeks to find such a relationship. Instead, we assume a fixed loss function L(y|x) that measures the true cost of predicting y given x, a distribution D over inputs x, and a parameterized scoring model S (y|x) with associated optimal labeling function h and inference algorithm A . Exact inference implies A = h . Learning seeks the risk minimizer: = arg min ExD [L(A (x)|x)] . (2) Successful learning, then, requires two things: the existence of for which risk is suitably low, and the ability to find such efficiently. In this work we consider the impact of approximate inference on both criteria. We model our examples as pairwise Markov random fields (MRFs) defined over a graph G = (V , E ) with probabilistic scoring model i i P (y|x) i (yi |x) ij (yi , yj |x) , (3) V j E where i (yi |x) and ij (yi , yj |x) are positive potentials. For learning, we use log-linear potentials i (yi |x) = exp(w · f (x, yi )) assuming a feature function f (·) and parameter vector w. Since MRFs are probabilistic, we also refer to Viterbi inference as maximum a posteriori (MAP) inference. 3 Algorithmic separability The existence of suitable model parameters is captured by the standard notion of separability. Definition 1. A distribution D (which can be empirical) is separable with respect to a model S (y|x) and loss L(y|x) if there exists such that ExD [L(h (x), x)] = 01 . However, approximate inference may not be able to match exactly the separating hypothesis h . We need a notion of separability that takes into account the (approximate) inference algorithm. Definition 2. A distribution D is algorithmically separable with respect to parameterized inference algorithm A and loss L(y|x) if there exists such that ExD [L(A (x), x)] = 0. While separability characterizes data distributions with respect to models, algorithmic separability characterizes data distributions with respect to inference algorithms. Note that algorithmic separability is more general than standard separability for any decidable model, since we can design an (inefficient) algorithm A (x) = h (x)2 . However, we show by counterexample that even algorithms with provable approximation guarantees can make separable problems algorithmically inseparable. 3.1 LP-relaxed inference Consider the simple Markov random field pictured in Figure 1, a triangle in which each node has as its set of allowed labels a different pair of the three possible labels A, B, and C. Let the node potentials i (yi ) be fixed to 1 so that labeling preferences derive only from edge potentials. For positive Separability can be weakened to allow nonzero risk, but for simplicity we focus on the strict case. Note further that algorithmic separability supports inference algorithms that are not based on any abstract model at all; such algorithms can describe arbitrary "black box" functions from parameters to predictions. It seems unlikely, however, that such algorithms are of much use since their parameters cannot be easily learned. 2 1 2 constants ij , define edge potentials ij (yi , yj ) = exp(ij ) whenever yi = yj and ij (yi , yj ) = 1 otherwise. Then the joint probability of a configuration y = (y1 , y2 , y3 ) is given by i i P (y) exp(ij ) = exp I(yi = yj )ij (4) j :yi =yj ,j and the MAP labeling is arg maxy i ,j I(yi = yj )ij . Note that this example is associative; that is, neighboring nodes are encouraged to take identical labels (ij > 0). We can therefore perform approximate inference using a linear programming (LP) relaxation and get a multiplicative approximation guarantee [3]. We begin by writing an integer program for computing the MAP labeling; below, µi (yi ) indicates node i taking label yi (which ranges over the two allowed labels for node i) and µij (yi , yj ) indicates nodes i and j taking labels yi and yj , respectively. max 12 µ12 (B , B ) + 23 µ23 (C, C ) + 31 µ31 (A, A) µ y µi (yi ) 1 i s.t. i µij (yi , yj ) µi (yi ) ij, yi , yj µ {0, 1}dim(µ) Integer programming is NP-hard, so we use an LP-relaxation by replacing the integrality constraint with µ 0. Letting i j = arg maxij ij , it is easy to see that the correct MAP configuration assigns matching labels to nodes i and j and an arbitrary label to the third. The score for this configuration is i j . However, the LP-relaxation may generate fractional solutions. In particular, whenever (12 + 23 + 31 )/2 > i j the configuration that assigns to every node both of its allowed labels in equal proportion--µ = 1/2--is optimal. The fractional labeling µ = 1/2 is the most uninformative possible; it suggests that all labelings are equally valid. Even so, (12 + 23 + 31 )/2 3i j /2 by the definition of i j , so LP-relaxed inference for this MRF has a relatively good approximation ratio of 3/2. 3.2 Learning with LP-relaxed inference Suppose now that we wish to learn to predict labelings y from instances of the MRF in Figure 1 with positive features given by x = (x12 , x23 , x31 ). We will parameterize the model using a positive weight vector w = (w12 , w23 , w31 ), letting ij = wij xij . Suppose the data distribution gives equal probability to inputs x = (4, 3, 3), (3, 4, 3), and (3, 3, 4), and that the loss function is defined as follows. Given x, let i j = arg maxij xij . Then assigning matching labels to nodes i and j and an arbitrary label to the third node yields a 0-loss configuration. All other configurations have positive loss. It is clear, first of all, that this problem is separable; if w = (1, 1, 1), ij = xij and the solution to the integer program above coincides with the labeling rule. Furthermore, there is margin: any weight vector in a neighborhood of (1, 1, 1) assigns the highest probability to the correct labeling. Using LP-relaxed inference, however, the problem is impossible to learn. In order to correctly label the instance x = (4, 3, 3) we must have, at a minimum, 12 > 23 , 31 (equivalently 4w12 > 3w23 , 3w31 ) since the 0-loss labeling must have higher objective score than any other labeling. Reasoning similarly for the remaining instances, any separating weight vector must satisfy 4wij > 3wkl for each pair of edges (ij, k l). Without loss of generality, assume an instance to be labeled has feature vector x = (4, 3, 3). Then, 1 1 (12 + 23 + 31 ) = (4w12 + 3w23 + 3w31 ) 2 2 1 3 3 > (4w12 + 3 w12 + 3 w12 ) 2 4 4 > 4w12 = 12 . 3 Figure 1: A simple MRF. Each node is annotated with its allowed labels. As a result, LP-relaxed inference predicts µ = 1/2. The data cannot be correctly labeled using an LP-relaxation with any choice of weight vector, and the example is therefore algorithmically inseparable. 4 Insufficiency of algorithmic separability We cannot expect to learn without algorithmic separability; no amount of training can hope to be successful when there simply do not exist acceptable model parameters. Nevertheless, we could draw upon the usual techniques for dealing with (geometric) inseparability in this case. Approximate inference introduces another complication, however. Learning techniques exploit assumptions about the underlying model to search parameter space; the perceptron, for example, assumes that increasing weights for features present in correct labelings but not incorrect labelings will lead to better predictions. While this is formally true with respect to an underlying linear model, inexact inference methods can disturb and even invert such assumptions. 4.1 Loopy inference Loopy belief propagation (LBP) is a common approximate inference procedure in which maxproduct message passing, known to be exact for trees, is applied to arbitrary, cyclic graphical models [5]. While LBP is, of course, inexact, its behavior can be even more problematic for learning. Because LBP does not respond to model parameters in the usual way, its predictions can lead a learner away from appropriate parameters even for algorithmically separable problems. Consider the simple MRF shown in Figure 2 and discussed previously in [6]. All nodes are binary and take labels from the set {-1, 1}. Suppose that node potentials are assigned by type, where each node is of type A or B as indicated and and are real-valued parameters: A (-1) = 1 B (-1) = 1 A (1) = e B (1) = e Figure 2: An MRF on which LBP is inexact. Also let edge potentials ij (yi , yj ) be equal to the constant when yi = yj and 1 otherwise. Define to be sufficiently positive that the MAP configuration is either (-1, -1, -1, -1) or (1, 1, 1, 1), abbreviated by -1 and 1, respectively. In particular, the solution is -1 when + < 0 and 1 otherwise. With slight abuse of notation we can write yMAP = sign( + ). We now investigate the behavior of LBP on this example. In general, max-product LBP on pairwise MRFs requires iterating the following rule to update messages mij (yj ) from node i to node j , where yj ranges over the possible labels for node j and N (i) is the neighbor set of node i. k mij (yj ) = max ij (yi , yj )i (yi ) mki (yi ) (5) yi N (i)\{j } Since we take to be suitably positive in our example, we can eliminate the max, letting yi = yj , and then divide to remove the edge potentials ij (yj , yj ) = . When messages are initialized uniformly to 1 and passed in parallel, symmetry also implies that messages are completely determined by the the types of the relevant nodes. The updates are then as follows. mAB (-1) = mB A (-1) mB A (-1) = mAB (-1)mB B (-1) mB B (-1) = m2 B (-1) A mAB (1) = e mB A (1) mB A (1) = e mAB (1)mB B (1) mB B (1) = e m2 B (1) A Note that messages mij (-1) remain fixed at 1 after any number of updates. Messages mAB (1), mB A (1), and mB B (1) always take the form exp(p + q ) for appropriate values of p and q , and it is easy to show by iterating the updates that, for all three messages, p and q go to while the ratio q /p converges to 1.089339. The label 1 messages, therefore, approach 0 when + < 0 and when + > 0. Note that after message normalization (mij (-1) + mij (1) = 1 for all ij ) the algorithm converges in either case. 4 (a) y = -1 (b) y = 1 Figure 3: A two-instance training set. Within each instance, nodes of the same shading share a feature vector, as annotated. Below each instance is its correct labeling. j Beliefs are computed from the converged messages as bi (yi ) N (i) mj i (yi ), so we can express the prediction of LBP as yLBP = sign( + ). Intuitively, then, LBP gives a slight preference to the B -type nodes because of their shared edge. If and are both positive or both negative, or if and differ in sign but | | > || or || > | |, LBP finds the correct MAP solution. However, when the strength of the A nodes only slightly exceeds that of the B nodes ( | | > || > | |), the preference exerted by LBP is significant enough to flip the labels. For example, if = 1 and = -0.95, the true MAP configuration is 1 but LBP converges to -1. 4.2 Learning with LBP Suppose now that we wish to use the perceptron algorithm with LBP inference to learn the twoinstance data set shown in Figure 3. For each instance the unshaded nodes are annotated with a feature vector x = (x1 , x2 ) and the shaded nodes are annotated with a feature vector x = (x 1 , x 2 ). We wish to learn weights w = (w1 , w2 ), modeling node potentials as before with = w · x and = w · x . Assume that edge potentials remain fixed using a suitably positive . By the previous analysis, the data are algorithmically separated by w = (1, -1). On instance (a), = 1, = -0.95, and LBP correctly predicts -1. Instance (b) is symmetric. Note that although the predicted configurations are not the true MAP labelings, they correctly match the training labels. The weight vector (1, -1) is therefore an ideal choice in the context of learning. The problem is also separated in the usual sense by the weight vector (-1, 1). Since we can think of the MAP decision problem as computing sign( + ) = sign (w · (x + x )), we can apply the perceptron algorithm with update w w - y (x + x ), ^ where y is the sign of the proposed labeling. The standard ^ perceptron mistake bound guarantees that separable problems require only a finite number of iterations with exact inference to find a separating weight vector. Here, however, LBP causes the perceptron to diverge even though the problem is not only separable but also algorithmically separable. Figure 4 shows the path of the weight vector as it progresses from the origin over the first 20 iterations of the algorithm. Figure 4: Perceptron learning path. During each pass through the data the weight vector is updated twice: once after mislabeling instance (a) (w w - (1, 0.95)), and again after mislabeling instance (b) (w w + (0.95, 1)). The net effect is w w + (-0.05, 0.05). The weight vector continually moves in the opposite direction of w = (1, -1), and learning diverges. 4.3 Discussion To understand why perceptron learning fails with LBP, it is instructive to visualize the feasible regions of weight space. Exact inference correctly labels instance (a) whenever w1 + 0.95w2 < 0, and, similarly, instance (b) requires a weight vector with 0.95w1 + w2 > 0. Weights that satisfy both constraints are feasible, as depicted in Figure 5(a). For LBP, the preference given to nodes 2 and 3 is effectively a scaling of x by 1.089339, so a feasible weight vector must satisfy 5 (a) Exact inference (b) LBP Figure 5: The feasible regions of weight space for exact inference and LBP. Each numbered gray halfspace indicates the region in which the corresponding instance is correctly labeled; their intersection is the feasible region, colored black. w1 + 0.95 w2 < 0 and 0.95 w1 + w2 > 0. Since 0.95 > 1, these constraints define a completely different feasible region of weight space, shown in Figure 5(b). It is clear from the figures why perceptron does not succeed; it assumes that pushing weights into the feasible region of Figure 5(a) will produce correct labelings, while under LBP the exact opposite is required. Algorithmic separability, then, is necessary for learning but may not be sufficient. This does not imply that no algorithm can learn using LBP; a grid search on weight space, for example, will be slow but successful. Instead, care must be taken to ensure that learning and inference are appropriately matched. In particular, it is generally invalid to assume that an arbitrary choice of approximate inference will lead to useful results when the learning method expects exact feedback. 5 Learning bounds for approximate inference In contrast to the failure of LBP in Section 4, appropriate pairs of inference and learning algorithms do exist. We give two bounds using LP-relaxed inference for MRFs with log-linear potentials. First, under the assumption of algorithmic separability, we show that the structured perceptron of Collins [9] makes only a finite number of mistakes. Second, we show using the PAC-Bayesian framework [11] that choosing model parameters to minimize a margin-based empirical risk function (assuming "soft" algorithmic separability) gives rise to a bound on the true risk. In both cases, the proofs are directly adapted from known results using the following characterization of LP-relaxation. Claim 1. Let z = (z1 , . . . , zk ) be the vector of 0/1 optimization variables for an integer program P . Let Z {0, 1}dim(z) be the feasible set of P . Then replacing integrality constraints in P with box constraints 0 zi 1 yields an LP with a feasible polytope having vertices Z Z . Proof. Each z Z is integral and thus a vertex of the polytope defined by box constraints alone. The remaining constraints appear in P and by definition do not exclude any element of Z . The W addition of constraints cannot eliminate a vertex without rendering it infeasible. Thus, Z Z . e can encode the MAP inference problem for MRFs as an integer program over indicators z with objective w · (x, z) for some linear in z (see, for example, [6]). By Claim 1 and the fact that an optimal vertex always exists, LP-relaxed inference given an input x computes LPw (x) = arg max w · (x, z) . zZ (x) (6) We can think of this as exact inference over an expanded set of labelings Z (x), some of which may not be valid (i.e., z Z (x) may be fractional). To simplify notation, we will assume that labelings y are always translated into corresponding indicator values z. 5.1 Perceptron Theorem 1 (adapted from Theorem 1 in [9]). Given a sequence of input/labeling pairs {(xi , zi )}, suppose that there exists a weight vector w with unit norm and > 0 such that, for all i, w · ((xi , zi ) - (xi , z)) for all z Z (xi ) \ {zi }. (The instances are algorithmically separable with margin .) Suppose that there also exists R such that (xi , zi ) - (xi , z) R for all z Z (xi ). Then the structured perceptron makes at most R2 / 2 mistakes. 6 Proof sketch. Let wk be the weight vector before the k th mistake; w1 = 0. Following the proof of Collins without modification, we can show that wk+1 k . We now bound wk+1 in the other direction. If (xk , zk ) is the instance on which the k th update occurs and zLP(k) = LPwk (xk ), then by the update rule, wk+1 2 = wk 2 + 2wk · ((xk , zk ) - (xk , zLP(k) )) + (xk , zk ) - (xk , zLP(k) ) 2 w k 2 + R2 . (7) The inequality follows from the fact that LP-relaxed inference maximizes w · (xk , z) over all z Z (xk ), so the middle term is nonpositive. Hence, by induction, wk+1 2 k R2 . Combining the two bounds, k 2 2 wk+1 2 k R2 , hence k R2 / 2 . 5 .2 PAC-Bayes The perceptron bound applies when data are perfectly algorithmically separable, but we might also hope to use LP-relaxed inference in the presence of noisy or otherwise almost-separable data. The following theorem adapts an empirical risk minimization bound using the PAC-Bayes framework to show that LP-relaxed inference can also be used to learn successfully in these cases. The measure of empirical risk for a weight vector w over a sample S = (x1 , . . . , xm ) is defined as follows. im ^ (w, S ) = 1 R max L(z|xi ) m =1 zHw (xi ) (8) Hw (x) = {z Z (x) | w · ((x, LPw (x)) - (x, z )) |LPw (x) - z |} ^ Intuitively, R accounts for the maximum loss of any z that is closer in score than in 1-norm to the LP prediction. Such z are considered "confusable" at test time. The PAC-Bayesian setting requires that, after training, weight vectors are drawn from some distribution Q(w); however, a deterministic version of the bound can also be proved. Theorem 2 (adapted from Theorem 3 in [11]). Suppose that loss function L(y|x) is bounded between 0 and 1 and can be expanded to L(z|x) for all z Z (x); that is, loss can be defined for every potential value of LP(x). Let = dim(z) be the number of indicator variables in the LP, and let R bound the 2-norm of a feature vector for a single clique. Let Q(w) be a symmetric Gaussian centered at w as defined in [11]. Then with probability at least 1 - over the choice of a sample S of size m from distribution D over inputs x, the following holds for all w. R 2 w 2 ln( 2 m ) + ln( m ) R2 w 2 R2 w 2 ^ ExD,w Q(w) [L(LPw (x)|x)] R(w, S ) + + (9) 2(m - 1) m The proof in [11] can be directly adapted; the only significant changes are the use of Z in place of the set Y of possible labelings and reasoning as above using the definition of LP-relaxed inference. 6 Related work A number of authors have applied inference approximations to a wide range of learning problems, sometimes with theoretical analysis of approximation quality and often with good empirical results [8, 12, 3]. However, none to our knowledge has investigated the theoretical relationship between approximation and learning performance. Daume et al. [13] developed a method for using a linear model to make decisions during a search-based approximate inference process. They showed that perceptron updates give rise to a mistake bound under the assumption that parameters leading to correct decisions exist. Such results are analogous to those presented in Section 5 in that performance bounds follow from an (implicit) assumption of algorithmic separability. Wainright [14] proved that when approximate inference is required at test time due to computational constraints, using an inconsistent (approximate) estimator for learning can be beneficial. His result suggests that optimal performance is obtained when the methods used for training and testing are appropriately aligned, even if those methods are not independently optimal. In contrast, we consider learning algorithms that use identical inference for both training and testing, minimizing a general measure of empirical risk rather than maximizing data likelihood, and argue for compatibility between the learning method and inference process. 7 Roth et al. [15] consider learning independent classifiers for single labels, essentially using a trivial form of approximate inference. They show that this method can outperform exact inference learning when algorithmic separability holds precisely because approximation reduces expressivity; i.e., less complex models require fewer samples to train accurately. When the data are not algorithmically separable, exact inference provides better performance if a large enough sample is available. It is interesting to note that both of our counterexamples involve strong edge potentials. These are precisely the kinds of examples that are difficult to learn using independent classifiers. 7 Conclusion Effective use of approximate inference for learning depends on two considerations that are irrelevant for prediction. First, the expressivity of approximate inference, and consequently the bias for learning, can vary significantly from that of exact inference. Second, learning algorithms can misinterpret feedback received from approximate inference methods, leading to poor results or even divergence. However, when algorithmic separability holds, the use of LP-relaxed inference with standard learning frameworks yields provably good results. Future work includes the investigation of alternate inference methods that, while potentially less suitable for prediction alone, give better feedback for learning. Conversely, learning methods that are tailored specifically to particular inference algorithms might show improved performance over those that assume exact inference. Finally, the notion of algorithmic separability and the ways in which it might relate (through approximation) to traditional separability deserve further study. References [1] Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks (research note). Artif. Intell., 42(2-3):393­405, 1990. [2] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282­289, 2001. [3] Ben Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative Markov networks. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 102, 2004. [4] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. [5] Kevin Murphy, Yair Weiss, and Michael Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 467­47, 1999. [6] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. MAP estimation via agreement on trees: messagepassing and linear programming. IEEE Transactions on Information Theory, 51(11):3697­3717, 2005. [7] D. Roth and W. Yih. A linear programming formulation for global inference in natural language tasks. In Proc. of the Conference on Computational Natural Language Learning (CoNLL), pages 1­8, 2004. [8] Charles Sutton and Andrew McCallum. Collective segmentation and labeling of distant entities in information extraction. Technical Report TR # 04-49, University of Massachusetts, 2004. [9] Michael Collins. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In EMNLP '02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, pages 1­8, 2002. [10] David A. McAllester. PAC-bayesian stochastic model selection. Machine Learning, 51(1):5­21, 2003. [11] David McAllester. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, To Appear. [12] Charles Sutton and Andrew McCallum. Piecewise training of undirected models. In 21st Conference on Uncertainty in Artificial Intelligence, 2005. ´ [13] Hal Daume III and Daniel Marcu. Learning as search optimization: Approximate large margin methods for structured prediction. In International Conference on Machine Learning (ICML), 2005. [14] Martin J. Wainwright. Estimating the "wrong" graphical model: Benefits in the computation-limited setting. Journal of Machine Learning Research, 7:1829­1859, 2006. [15] V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1124­1129, 2005. 8