Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction Shay B. Cohen and Noah A. Smith Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {scohen,nasmith}@cs.cmu.edu Abstract We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM algorithm for learning a probabilistic grammar based on this family of priors. We then experiment with unsupervised dependency grammar induction and show significant improvements using our model for both monolingual learning and bilingual learning with a non-parallel, multilingual corpus. 1 Introduction Probabilistic grammars have become an important tool in natural language processing. They are most commonly used for parsing and linguistic analysis (Charniak and Johnson, 2005; Collins, 2003), but are now commonly seen in applications like machine translation (Wu, 1997) and question answering (Wang et al., 2007). An attractive property of probabilistic grammars is that they permit the use of well-understood parameter estimation methods for learning--both from labeled and unlabeled data. Here we tackle the unsupervised grammar learning problem, specifically for unlexicalized context-free dependency grammars, using an empirical Bayesian approach with a novel family of priors. There has been an increased interest recently in employing Bayesian modeling for probabilistic grammars in different settings, ranging from putting priors over grammar probabilities (Johnson et al., 74 2007) to putting non-parametric priors over derivations (Johnson et al., 2006) to learning the set of states in a grammar (Finkel et al., 2007; Liang et al., 2007). Bayesian methods offer an elegant framework for combining prior knowledge with data. The main challenge in Bayesian grammar learning is efficiently approximating probabilistic inference, which is generally intractable. Most commonly variational (Johnson, 2007; Kurihara and Sato, 2006) or sampling techniques are applied (Johnson et al., 2006). Because probabilistic grammars are built out of multinomial distributions, the Dirichlet family (or, more precisely, a collection of Dirichlets) is a natural candidate for probabilistic grammars because of its conjugacy to the multinomial family. Conjugacy implies a clean form for the posterior distribution over grammar probabilities (given the data and the prior), bestowing computational tractability. Following work by Blei and Lafferty (2006) for topic models, Cohen et al. (2008) proposed an alternative to Dirichlet priors for probabilistic grammars, based on the logistic normal (LN) distribution over the probability simplex. Cohen et al. used this prior to softly tie grammar weights through the covariance parameters of the LN. The prior encodes information about which grammar rules' weights are likely to covary, a more intuitive and expressive representation of knowledge than offered by Dirichlet distributions.1 The contribution of this paper is two-fold. First, from the modeling perspective, we present a generalization of the LN prior of Cohen et al. (2008), showing how to extend the use of the LN prior to Although the task, underlying model, and weights being tied were different, Eisner (2002) also showed evidence for the efficacy of parameter tying in grammar learning. 1 Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 74­82, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics tie between any grammar weights in a probabilistic grammar (instead of only allowing weights within the same multinomial distribution to covary). Second, from the experimental perspective, we show how such flexibility in parameter tying can help in unsupervised grammar learning in the well-known monolingual setting and in a new bilingual setting where grammars for two languages are learned at once (without parallel corpora). Our method is based on a distribution which we call the shared logistic normal distribution, which is a distribution over a collection of multinomials from different probability simplexes. We provide a variational EM algorithm for inference. The rest of this paper is organized as follows. In §2, we give a brief explanation of probabilistic grammars and introduce some notation for the specific type of dependency grammar used in this paper, due to Klein and Manning (2004). In §3, we present our model and a variational inference algorithm for it. In §4, we report on experiments for both monolingual settings and a bilingual setting and discuss them. We discuss future work (§5) and conclude in §6. derivation y: K Nk p(x, y | ) = k,i k,i f (x,y) (1) k=1 i=1 K Nk = exp k=1 i=1 fk,i (x, y) log k,i where fk,i is a function that "counts" the number of times the kth distribution's ith event occurs in the derivation. The are a collection of K multinomials 1 , ..., K , the kth of which includes Nk events. Note that there may be many derivations y for a given string x--perhaps even infinitely many in some kinds of grammars. 2.1 Dependency Model with Valence HMMs and PCFGs are the best-known probabilistic grammars, but there are many others. In this paper, we use the "dependency model with valence" (DMV), due to Klein and Manning (2004). DMV defines a probabilistic grammar for unlabeled, projective dependency structures. Klein and Manning (2004) achieved their best results with a combination of DMV with a model known as the "constituent-context model" (CCM). We do not experiment with CCM in this paper, because it does not fit directly in a Bayesian setting (it is highly deficient) and because state-of-the-art unsupervised dependency parsing results have been achieved with DMV alone (Smith, 2006). Using the notation above, DMV defines x = x1 , x2 , ..., xn to be a sentence. x0 is a special "wall" symbol, $, on the left of every sentence. A tree y is defined by a pair of functions yleft and yright (both {0, 1, 2, ..., n} 2{1,2,...,n} ) that map each word to its sets of left and right dependents, respectively. Here, the graph is constrained to be a projective tree rooted at x0 = $: each word except $ has a single parent, and there are no cycles or crossing dependencies. yleft (0) is taken to be empty, and yright (0) contains the sentence's single head. Let y(i) denote the subtree rooted at position i. The probability P (y(i) | xi , ) of generating this subtree, given its head word xi , is defined recursively, as described in Fig. 1 (Eq. 2). The probability of the entire tree is given by p(x, y | ) = P (y(0) | $, ). The are the multinomial distributions s (· | ·, ·, ·) and c (· | ·, ·). To 2 Probabilistic Grammars and Dependency Grammar Induction A probabilistic grammar defines a probability distribution over grammatical derivations generated through a step-by-step process. HMMs, for example, can be understood as a random walk through a probabilistic finite-state network, with an output symbol sampled at each state. Each "step" of the walk and each symbol emission corresponds to one derivation step. PCFGs generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of "child" symbols (each itself either a nonterminal symbol or a terminal symbol analogous to the emissions of an HMM). Each step or emission of an HMM and each rewriting operation of a PCFG is conditionally independent of the other rewriting operations given a single structural element (one HMM or PCFG state); this Markov property permits efficient inference for the probability distribution defined by the probabilistic grammar. In general, a probabilistic grammar defines the joint probability of a string x and a grammatical 75 P (y(i) | xi , ) = D{left,right} s (stop × jyD (i) s (¬stop | xi , D, [yD (i) = ]) (2) | xi , D, firsty (j)) × c (xj | xi , D) × P (y(j) | xj , ) Figure 1: The "dependency model with valence" recursive equation. firsty (j) is a predicate defined to be true iff xj is the closest child (on either side) to its parent xi . The probability of the tree p(x, y | ) = P (y(0) | $, ). follow the general setting of Eq. 1, we index these distributions as 1 , ..., K . Headden et al. (2009) extended DMV so that the distributions c condition on the valence as well, with smoothing, and showed significant improvements for short sentences. Our experiments found that these improvements do not hold on longer sentences. Here we experiment only with DMV, but note that our techniques are also applicable to richer probabilistic grammars like that of Headden et al. 2.2 Learning DMV Klein and Manning (2004) learned the DMV probabilities from a corpus of part-of-speech-tagged sentences using the EM algorithm. EM manipulates to locally optimize the likelihood of the observed portion of the data (here, x), marginalizing out the hidden portions (here, y). The likelihood surface is not globally concave, so EM only locally optimizes the surface. Klein and Manning's initialization, though reasonable and language-independent, was an important factor in performance. Various alternatives to EM were explored by Smith (2006), achieving substantially more accurate parsing models by altering the objective function. Smith's methods did require substantial hyperparameter tuning, and the best results were obtained using small annotated development sets to choose hyperparameters. In this paper, we consider only fully unsupervised methods, though we the Bayesian ideas explored here might be merged with the biasing approaches of Smith (2006) for further benefit. which makes inference algorithms easier to derive. For example, if we make a "mean-field assumption," with respect to hidden structure and weights, the variational algorithm for approximately inferring the distribution over and trees y resembles the traditional EM algorithm very closely (Johnson, 2007). In fact, variational inference in this case takes an action similar to smoothing the counts using the exp- function during the E-step. Variational inference can be embedded in an empirical Bayes setting, in which we optimize the variational bound with respect to the hyperparameters as well, repeating the process until convergence. 3.1 Logistic Normal Distributions While Dirichlet priors over grammar probabilities make learning algorithms easy, they are limiting. In particular, as noted by Blei and Lafferty (2006), there is no explicit flexible way for the Dirichlet's parameters to encode beliefs about covariance between the probabilities of two events. To illustrate this point, we describe how a multinomial of dimension d is generated from a Dirichlet distribution with parameters = 1 , ..., d : 1. Generate j (j , 1) independently for j {1, ..., d}. 2. j j / i i . 3 Parameter Tying in the Bayesian Setting As stated above, comprises a collection of multinomials that weights the grammar. Taking the Bayesian approach, we wish to place a prior on those multinomials, and the Dirichlet family is a natural candidate for such a prior because of its conjugacy, 76 where (, 1) is a Gamma distribution with shape and scale 1. Correlation among i and j , i = j, cannot be modeled directly, only through the normalization in step 2. In contrast, LN distributions (Aitchison, 1986) provide a natural way to model such correlation. The LN draws a multinomial as follows: 1. Generate Normal(µ, ). 2. j exp(j )/ i exp(i ). I1 I2 I3 IN = = = = = = = = {1:2, 3:6, 7:9} {1:2, 3:6} {1:4, 5:7} {1:2} = = = = { I1,1 , { I2,1 , { { I4,L4 J1 1 I1,2 , I2,L2 I3,1 , J2 I1,L1 I3,L3 JK } } } } partition struct. S 1 2 3 4 ~ 1 ~ 2 ~ 3 = = = 1 3 1 3 1 2 1,1 , 1,2 , 1,3 , 1,4 , 1,5 , 1,6 , 1,7 , 1,8 , 1, 2,1 , 2,2 , 2,3 , 2,4 , 2,5 , 2, 2 3,1 , 3,2 , 3,3 , 3,4 , 3,5 , 3,6 , 3, 3 4,1 , 4, 4 1,1 + 2,1 + 4,1 , 1,2 + 2,2 + 4,2 1,3 + 2,3 + 3,1 , 1,4 + 2,4 + 3,2 , 1,5 + 2,5 + 3,3 , 1,6 + 2,6 + 3,4 combine 1,7 + 3,5 , 1,8 + 3,6 , 1,9 + 3,7 1 2 3 ~ = (exp 1 ) ~ = (exp 2 ) ~ = (exp 3 ) N1 i =1 N2 i =1 N3 i =1 Normal(µ1 , 1 ) Normal(µ2 , 2 ) Normal(µ3 , 3 ) Normal(µ4 , 4 ) sample Figure 2: An example of a shared logistic normal distribution, illustrating Def. 1. N = 4 experts are used to sample K = 3 multinomials; L1 = 3, L2 = 2, L3 = 2, L4 = 1, 1 = 9, 2 = 6, 3 = 7, 4 = 2, N1 = 2, N2 = 4, and N3 = 3. This figure is best viewed in color. exp 1,i ~ exp 2,i ~ softmax exp 3,i ~ Blei and Lafferty (2006) defined correlated topic models by replacing the Dirichlet in latent Dirichlet allocation models (Blei et al., 2003) with a LN distribution. Cohen et al. (2008) compared Dirichlet and LN distributions for learning DMV using empirical Bayes, finding substantial improvements for English using the latter. In that work, we obtained improvements even without specifying exactly which grammar probabilities covaried. While empirical Bayes learning permits these covariances to be discovered without supervision, we found that by initializing the covariance to encode beliefs about which grammar probabilities should covary, further improvements were possible. Specifically, we grouped the Penn Treebank part-of-speech tags into coarse groups based on the treebank annotation guidelines and biased the initial covariance matrix for each child distribution c (· | ·, ·) so that the probabilities of child tags from the same coarse group covaried. For example, the probability that a past-tense verb (VBD) has a singular noun (NN) as a right child may be correlated with the probability that it has a plural noun (NNS) as a right child. Hence linguistic 77 knowledge--specifically, a coarse grouping of word classes--can be encoded in the prior. A per-distribution LN distribution only permits probabilities within a multinomial to covary. We will generalize the LN to permit covariance among any probabilities in , throughout the model. For example, the probability of a past-tense verb (VBD) having a noun as a right child might correlate with the probability that other kinds of verbs (VBZ, VBN, etc.) have a noun as a right child. The partitioned logistic normal distribution (PLN) is a generalization of the LN distribution that takes the first step towards our goal (Aitchison, 1986). Generating from PLN involves drawing a random vector from a multivariate normal distribution, but the logistic transformation is applied to different parts of the vector, leading to sampled multinomial distributions of the required lengths from different probability simplices. This is in principle what is required for arbitrary covariance between grammar probabilities, except that DMV has O(t2 ) weights for a part-of-speech vocabulary of size t, requiring a very large multivariate normal distribution with O(t4 ) covariance parameters. 3.2 Shared Logistic Normal Distributions To solve this problem, we suggest a refinement of the class of PLN distributions. Instead of using a single normal vector for all of the multinomials, we use several normal vectors, partition each one and then recombine parts which correspond to the same multinomial, as a mixture. Next, we apply the logisitic transformation on the mixed vectors (each of which is normally distributed as well). Fig. 2 gives an example of a non-trivial case of using a SLN distribution, where three multinomials are generated from four normal experts. We now formalize this notion. For a natural number N , we denote by 1:N the set {1, ..., N }. For a vector in v RN and a set I 1:N , we denote by vI to be the vector created from v by using the coordinates in I. Recall that K is the number of multinomials in the probabilistic grammar, and Nk is the number of events in the kth multinomial. Definition 1. We define a shared logistic normal distribution with N "experts" over a collection of K multinomial distributions. Let n Normal(µn , n ) be a set of multivariate normal variables for n 1:N , where the length of n Ln is denoted n . Let In = {In,j }j=1 be a partition of 1: n into Ln sets, such that Ln In,j = j=1 1: n and In,j In,j = for j = j . Let Jk for k 1:K be a collection of (disjoint) subsets of {In,j | n 1:N, j 1: n , |In,j | = Nk }, such that all sets in Jk are of the same size, ~ Nk . Let k = |J1k | In,j Jk n,In,j , and k,i = exp(~k,i ) i exp(~k,i ) . We then say distributes according to the shared logistic normal distribution with partition structure S = ({In }N , {Jk }K ) n=1 k=1 and normal experts {(µn , n )}N and denote it by n=1 SLN(µ, , S). The partitioned LN distribution in Aitchison (1986) can be formulated as a shared LN distribution where N = 1. The LN collection used by Cohen et al. (2008) is the special case where N = K, each Ln = 1, each k = Nk , and each Jk = {Ik,1 }. The covariance among arbitrary k,i is not defined directly; it is implied by the definition of the normal experts n,In,j , for each In,j Jk . We note that a SLN can be represented as a PLN by relying on the distributivity of the covariance operator, and merging all the partition structure into one (perhaps 78 sparse) covariance matrix. However, if we are interested in keeping a factored structure on the covariance matrices which generate the grammar weights, we cannot represent every SLN as a PLN. It is convenient to think of each i,j as a weight associated with a unique event's probability, a certain outcome of a certain multinomial in the probabilistic grammar. By letting different i,j covary with each other, we loosen the relationships among k,j and permit the model--at least in principle-- to learn patterns from the data. Def. 1 also implies that we multiply several multinomials together in a product-of-experts style (Hinton, 1999), because the exponential of a mixture of normals becomes a product of (unnormalized) probabilities. Our extension to the model in Cohen et al. (2008) follows naturally after we have defined the shared LN distribution. The generative story for this model is as follows: 1. Generate SLN(µ, , S), where is a collection of vectors k , k = 1, ..., K. 2. Generate x and y from p(x, y | ) (i.e., sample from the probabilistic grammar). 3.3 Inference In this work, the partition structure S is known, the sentences x are observed, the trees y and the grammar weights are hidden, and the parameters of the shared LN distribution µ and are learned.2 Our inference algorithm aims to find the posterior over the grammar probabilities and the hidden structures (grammar trees y). To do that, we use variational approximation techniques (Jordan et al., 1999), which treat the problem of finding the posterior as an optimization problem aimed to find the best approximation q(, y) of the posterior p(, y | x, µ, , S). The posterior q needs to be constrained to be within a family of tractable and manageable distributions, yet rich enough to represent good approximations of the true posterior. "Best approximation" is defined as the KL divergence between q(, y) and p(, y | x, µ, , S). Our variational inference algorithm uses a meanfield assumption: q(, y) = q()q(y). The distribution q() is assumed to be a LN distribution with 2 In future work, we might aim to learn S. log p(x | µ, , S) ~ fk,i ~ k,i µC ~k (~k )2 C y N n=1 Eq [log p( k | µk , k )] + B K k=1 Nk ~ ~ i=1 fk,i k,i + H(q) (3) (4) q(y)fk,i (x, y) 1 ~ k Nk i =1 exp ~ µC - log k + 1 - ~k,i 1 |Jk | 1 |Jk |2 Ir,j Jk µC + ~k,i (~k,i )2 C 2 (5) (6) (7) µr,Ir,j ~ r,Ir,j ~2 Ir,j Jk Figure 3: Variational inference bound. Eq. 3 is the bound itself, using notation defined in Eqs. 4­7 for clarity. Eq. 4 defines expected counts of the grammar events under the variational distribution q(y), calculated using dynamic programming. Eq. 5 describes the weights for the weighted grammar defined by q(y). Eq. 6 and Eq. 7 describe the mean and the variance, respectively, for the multivariate normal eventually used with the weighted grammar. These values ~ are based on the parameterization of q() by µi,j and i,j . An additional set of variational parameters is k , which ~ ~2 helps resolve the non-conjugacy of the LN distribution through a first order Taylor approximation. all off-diagonal covariances fixed at zero (i.e., the variational parameters consist of a single mean µk,i ~ and a single variance k,i for each k,i ). There is ~2 ~ an additional variational parameter, k per multinomial, which is the result of an additional variational approximation because of the lack of conjugacy of the LN distribution to the multinomial distribution. The distribution q(y) is assumed to be defined by a ~ DMV with unnormalized probabilities . Inference optimizes the bound B given in Fig. 3 (Eq. 3) with respect to the variational parameters. Our variational inference algorithm is derived similarly to that of Cohen et al. (2008). Because we wish to learn the values of µ and , we embed variational inference as the E step within a variational EM algorithm, shown schematically in Fig. 4. In our experiments, we use this variational EM algorithm on a training set, and then use the normal experts' means to get a point estimate for , the grammar weights. This is called empirical Bayesian estimation. Our approach differs from maximum a posteriori (MAP) estimation, since we re-estimate the parameters of the normal experts. Exact MAP estimation is probably not feasible; a variational algorithm like ours might be applied, though better performance is expected from adjusting the SLN to fit the data. al., 1993) and the Chinese treebank (Xue et al., 2004). In both cases, following standard practice, sentences were stripped of words and punctuation, leaving part-of-speech tags for the unsupervised induction of dependency structure. For English, we train on §2­21, tune on §22 (without using annotated data), and report final results on §23. For Chinese, we train on §1­270, use §301­1151 for development and report testing results on §271­300.3 To evaluate performance, we report the fraction of words whose predicted parent matches the gold standard corpus. This performance measure is also known as attachment accuracy. We considered two parsing methods after extracting a point estimate for the grammar: the most probable "Viterbi" parse (argmaxy p(y | x, )) and the minimum Bayes risk (MBR) parse (argminy Ep(y |x,) [ (y; x, y )]) with dependency attachment error as the loss function (Goodman, 1996). Performance with MBR parsing is consistently higher than its Viterbi counterpart, so we report only performance with MBR parsing. 4.1 Nouns, Verbs, and Adjectives In this paper, we use a few simple heuristics to decide which partition structure S to use. Our heurisUnsupervised training for these datasets can be costly, and requires iteratively running a cubic-time inside-outside dynamic programming algorithm, so we follow Klein and Manning (2004) in restricting the training set to sentences of ten or fewer words in length. Short sentences are also less structurally ambiguous and may therefore be easier to learn from. 3 4 Experiments Our experiments involve data from two treebanks: the Wall Street Journal Penn treebank (Marcus et 79 Input: initial parameters µ(0) , (0) , partition structure S, observed data x, number of iterations T Output: learned parameters µ, t1; while t T do E-step (for = 1, ..., M ) do: repeat ,(t) optimize B w.r.t. µr , r = 1, ..., N ; ~ ,(t) optimize B w.r.t. r , r = 1, ..., N ; ~ ~ ,(t) update r , r = 1, ..., N ; ~ ,(t) update r , r = 1, ..., N ; ,(t) compute counts ~r , r = 1, ..., N ; f until convergence of B ; M-step: optimize B w.r.t. µ(t) and (t) ; t t + 1; end return µ(T ) , (T ) Figure 4: Main details of the variational inference EM algorithm with empirical Bayes estimation of µ and . B is the bound defined in Fig. 3 (Eq. 3). N is the number of normal experts for the SLN distribution defining the prior. M is the number of training examples. The full algorithm is given in Cohen and Smith (2009). of the predicate firsty (·), the set of multinomials s (· | x, D, v), for v V share a normal expert. · T IE N: This is the same as T IE V, only for nominal parents. · T IE V&N: Tie both verbs and nouns (in separate partitions). This is equivalent to taking the union of the partition structures of the above two settings. · T IE A: This is the same as T IE V, only for adjectival parents. Since inference for a model with parameter tying can be computationally intensive, we first run the inference algorithm without parameter tying, and then add parameter tying to the rest of the inference algorithm's execution until convergence. Initialization is important for the inference algorithm, because the variational bound is a nonconcave function. For the expected values of the normal experts, we use the initializer from Klein and Manning (2004). For the covariance matrices, we follow the setting in Cohen et al. (2008) in our experiments also described in §3.1. For each treebank, we divide the tags into twelve disjoint tag families.4 The covariance matrices for all dependency distributions were initialized with 1 on the diagonal, 0.5 between tags which belong to the same family, and 0 otherwise. This initializer has been shown to be more successful than an identity covariance matrix. 4.2 Monolingual Experiments We begin our experiments with a monolingual setting, where we learn grammars for English and Chinese (separately) using the settings described above. The attachment accuracy for this set of experiments is described in Table 1. The baselines include right attachment (where each word is attached to the word to its right), MLE via EM (Klein and Manning, 2004), and empirical Bayes with Dirichlet and LN priors (Cohen et al., 2008). We also include a "ceiling" (DMV trained using supervised MLE from the training sentences' trees). For English, we see that tying nouns, verbs or adjectives improves performance compared to the LN baseline. Tying both nouns and verbs improves performance a bit more. These are simply coarser tags: adjective, adverb, conjunction, foreign word, interjection, noun, number, particle, preposition, pronoun, proper noun, verb. 4 tics rely mainly on the centrality of content words: nouns, verbs, and adjectives. For example, in the English treebank, the most common attachment errors (with the LN prior from Cohen et al., 2008) happen with a noun (25.9%) or a verb (16.9%) parent. In the Chinese treebank, the most common attachment errors happen with noun (36.0%) and verb (21.2%) parents as well. The errors being governed by such attachments are the direct result of nouns and verbs being the most common parents in these data sets. Following this observation, we compare four different settings in our experiments (all SLN settings include one normal expert for each multinomial on its own, equivalent to the regular LN setting from Cohen et al.): · T IE V: We add normal experts that tie all probabilities corresponding to a verbal parent (any parent, using the coarse tags of Cohen et al., 2008). Let V be the set of part-of-speech tags which belong to the verb category. For each direction D (left or right), the set of multinomials of the form c (· | v, D), for v V , all share a normal expert. For each direction D and each boolean value B 80 Attach-Right EM (K&M, 2004) Dirichlet LN (CG&S, 2008) SLN, T IE V SLN, T IE N SLN, T IE V&N SLN, T IE A Biling. SLN, T IE V Biling. SLN, T IE N Biling. SLN, T IE V&N Biling. SLN, T IE A Supervised MLE Attach-Right EM (K&M, 2004) Dirichlet LN SLN, T IE V SLN, T IE N SLN, T IE V&N SLN, T IE A Biling. SLN, T IE V Biling. SLN, T IE N Biling. SLN, T IE V&N Biling. SLN, T IE A Supervised MLE attachment acc. (%) 10 20 all 38.4 33.4 31.7 46.1 39.9 35.9 46.1 40.6 36.9 59.4 45.9 40.5 60.2 46.2 40.0 60.2 46.7 40.9 61.3 47.4 41.4 59.9 45.8 40.9 47.6 41.7 61.6 48.1 42.1 61.8 62.0 48.0 42.2 61.3 47.6 41.7 84.5 74.9 68.8 34.9 34.6 34.6 38.3 36.1 32.7 38.3 35.9 32.4 50.1 40.5 35.8 42.0 35.8 51.9 43.0 38.4 33.7 45.0 39.2 34.2 47.4 40.4 35.2 51.9 42.0 35.8 48.0 38.9 33.8 51.5 41.7 35.3 52.0 41.3 35.2 84.3 66.1 57.6 ing information between those two models is done by softly tying grammar weights in the two hidden grammars. We first merge the models for English and Chinese by taking a union of the multinomial families of each and the corresponding prior parameters. We then add a normal expert that ties between the parts of speech in the respective partition structures for both grammars together. Parts of speech are matched through the single coarse tagset (footnote 4). For example, with T IE V, let V = V Eng V Chi be the set of part-of-speech tags which belong to the verb category for either treebank. Then, we tie parameters for all part-of-speech tags in V . We tested this joint model for each of T IE V, T IE N, T IE V&N, and T IE A. After running the inference algorithm which learns the two models jointly, we use unseen data to test each learned model separately. Table 1 includes the results for these experiments. The performance on English improved significantly in the bilingual setting, achieving highest performance with T IE V&N. Performance with Chinese is also the highest in the bilingual setting, with T IE A and T IE V&N. Chinese English Table 1: Attachment accuracy of different models, on test data from the Penn Treebank and the Chinese Treebank of varying levels of difficulty imposed through a length filter. Attach-Right attaches each word to the word on its right and the last word to $. Bold marks best overall accuracy per length bound, and marks figures that are not significantly worse (binomial sign test, p < 0.05). 5 Future Work In future work we plan to lexicalize the model, including a Bayesian grammar prior that accounts for the syntactic patterns of words. Nonparametric models (Teh, 2006) may be appropriate. We also believe that Bayesian discovery of cross-linguistic patterns is an exciting topic worthy of further exploration. 4.3 Bilingual Experiments 6 Conclusion Leveraging information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Smith and Smith, 2004; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages.5 Our bilingual experiments use the English and Chinese treebanks, which are not parallel corpora, to train parsers for both languages jointly. SharHaghighi et al. (2008) presented a technique to learn bilingual lexicons from two non-parallel monolingual corpora. 5 We described a Bayesian model that allows soft parameter tying among any weights in a probabilistic grammar. We used this model to improve unsupervised parsing accuracy on two different languages, English and Chinese, achieving state-of-the-art results. We also showed how our model can be effectively used to simultaneously learn grammars in two languages from non-parallel multilingual data. Acknowledgments This research was supported by NSF IIS-0836431. The authors thank the anonymous reviewers and Sylvia Rebholz for helpful comments. 81 References J. Aitchison. 1986. The Statistical Analysis of Compositional Data. Chapman and Hall, London. D. M. Blei and J. D. Lafferty. 2006. Correlated topic models. In Proc. of NIPS. D. M. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993­1022. D. Burkett and D. Klein. 2008. Two languages are better than one (for syntactic parsing). In Proc. of EMNLP. E. Charniak and M. Johnson. 2005. Coarse-to-fine nbest parsing and maxent discriminative reranking. In Proc. of ACL. S. B. Cohen and N. A. Smith. 2009. Inference for probabilistic grammars with shared logistic normal distributions. Technical report, Carnegie Mellon University. S. B. Cohen, K. Gimpel, and N. A. Smith. 2008. Logistic normal priors for unsupervised probabilistic grammar induction. In NIPS. M. Collins. 2003. Head-driven statistical models for natural language processing. Computational Linguistics, 29:589­637. I. Dagan. 1991. Two languages are more informative than one. In Proc. of ACL. J. Eisner. 2002. Transformational priors over grammars. In Proc. of EMNLP. J. R. Finkel, T. Grenager, and C. D. Manning. 2007. The infinite tree. In Proc. of ACL. J. Goodman. 1996. Parsing algorithms and metrics. In Proc. of ACL. A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proc. of ACL. W. P. Headden, M. Johnson, and D. McClosky. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proc. of NAACLHLT. G. E. Hinton. 1999. Products of experts. In Proc. of ICANN. M. Johnson, T. L. Griffiths, and S. Goldwater. 2006. Adaptor grammars: A framework for specifying compositional nonparameteric Bayesian models. In NIPS. M. Johnson, T. L. Griffiths, and S. Goldwater. 2007. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Proc. of NAACL. M. Johnson. 2007. Why doesn't EM find good HMM POS-taggers? In Proc. EMNLP-CoNLL. M. I. Jordan, Z. Ghahramani, T. S. Jaakola, and L. K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183­ 233. D. Klein and C. D. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. of ACL. K. Kurihara and T. Sato. 2006. Variational Bayesian grammar induction for natural language. In Proc. of ICGI. P. Liang, S. Petrov, M. Jordan, and D. Klein. 2007. The infinite PCFG using hierarchical Dirichlet processes. In Proc. of EMNLP. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19:313­330. D. A. Smith and N. A. Smith. 2004. Bilingual parsing with factored estimation: Using English to parse Korean. In Proc. of EMNLP, pages 49­56. N. A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. thesis, Johns Hopkins University. B. Snyder and R. Barzilay. 2008. Unsupervised multilingual learning for morphological segmentation. In Proc. of ACL. Y. W. Teh. 2006. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proc. of COLING-ACL. M. Wang, N. A. Smith, and T. Mitamura. 2007. What is the Jeopardy model? a quasi-synchronous grammar for question answering. In Proc. of EMNLP. D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comp. Ling., 23(3):377­404. N. Xue, F. Xia, F.-D. Chiou, and M. Palmer. 2004. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 10(4):1­30. 82