Learning Bilingual Lexicons from Monolingual Corpora

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division, University of California at Berkeley { aria42,pliang,tberg,klein }@cs.berkeley.edu

Abstract
We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types.

pairs deemed to be word-level translations. Precision and recall are then measured over these bilingual lexicons. This setting has been considered before, most notably in Koehn and Knight (2002) and Fung (1995), but the current paper is the first to use a probabilistic model and present results across a variety of language pairs and data conditions. In our method, we represent each language as a monolingual lexicon (see figure 2): a list of word types characterized by monolingual feature vectors, such as context counts, orthographic substrings, and so on (section 5). We define a generative model over (1) a source lexicon, (2) a target lexicon, and (3) a matching between them (section 2). Our model is based on canonical correlation analysis (CCA)1 and explains matched word pairs via vectors in a common latent space. Inference in the model is done using an EM-style algorithm (section 3). Somewhat surprisingly, we show that it is possible to learn or extend a translation lexicon using monolingual corpora alone, in a variety of languages and using a variety of corpora, even in the absence of orthographic features. As might be expected, the task is harder when no seed lexicon is provided, when the languages are strongly divergent, or when the monolingual corpora are from different domains. Nonetheless, even in the more difficult cases, a sizable set of high-precision translations can be extracted. As an example of the performance of the system, in English-Spanish induction with our best feature set, using corpora derived from topically similar but non-parallel sources, the system obtains 89.0% precision at 33% recall.
1

1

Introduction

Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994). Although parallel text is plentiful for some language pairs such as English-Chinese or EnglishArabic, it is scarce or even non-existent for most others, such as English-Hindi or French-Japanese. Moreover, parallel text could be scarce for a language pair even if monolingual data is readily available for both languages. In this paper, we consider the problem of learning translations from monolingual sources alone. This task, though clearly more difficult than the standard parallel text approach, can operate on language pairs and in domains where standard approaches cannot. We take as input two monolingual corpora and perhaps some seed translations, and we produce as output a bilingual lexicon, defined as a list of word 771

See Hardoon et al. (2003) for an overview.

Proceedings of ACL-08: HLT, pages 771­779, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics


s
state

m

t
sociedad

matched to at most one other word type.2 We take M AT C H I N G - P R I O R to be uniform over M.3 Then, for each matched pair of word types (i, j )  m, we need to generate the observed feature vectors of the source and target word types, fS (si )  RdS and fT (tj )  RdT . The feature vector of each word type is computed from the appropriate monolingual corpus and summarizes the word's monolingual characteristics; see section 5 for details and figure 2 for an illustration. Since si and tj are translations of each other, we expect fS (si ) and fT (tj ) to be connected somehow by the generative process. In our model, they are related through a vector zi,j  Rd representing the shared, language-independent concept. Specifically, to generate the feature vectors, we first generate a random concept zi,j  N (0, Id ), where Id is the d × d identity matrix. The source feature vector fS (si ) is drawn from a multivariate Gaussian with mean WS zi,j and covariance S , where WS is a dS × d matrix which transforms the language-independent concept zi,j into a languagedependent vector in the source space. The arbitrary covariance parameter S 0 explains the sourcespecific variations which are not captured by WS ; it does not play an explicit role in inference. The target fT (tj ) is generated analogously using WT and T , conditionally independent of the source given zi,j (see figure 2). For each of the remaining unmatched source word types si which have not yet been generated, we draw the word type features from a baseline normal distribution with variance  2 IdS , with hyperparameter  2 0; unmatched target words are similarly generated. If two word types are truly translations, it will be better to relate their feature vectors through the latent space than to explain them independently via the baseline distribution. However, if a source word type is not a translation of any of the target word types, we can just generate it independently without requiring it to participate in the matching.
2 Our choice of M permits unmatched word types, but does not allow words to have multiple translations. This setting facilitates comparison to previous work and admits simpler models. 3 However, non-uniform priors could encode useful information, such as rank similarities.

society

estado

enlargement

. . .

amplificación

. . .

control

importancia

importance

control

Figure 1: Bilingual lexicon induction: source word types s are listed on the left and target word types t on the right. Dashed lines between nodes indicate translation pairs which are in the matching m.

2

Bilingual Lexicon Induction

As input, we are given a monolingual corpus S (a sequence of word tokens) in a source language and a monolingual corpus T in a target language. Let s = (s1 , . . . , snS ) denote nS word types appearing in the source language, and t = (t1 , . . . , tnT ) denote word types in the target language. Based on S and T , our goal is to output a matching m between s and t. We represent m as a set of integer pairs so that (i, j )  m if and only if si is matched with tj . 2.1 Generative Model We propose the following generative model over matchings m and word types (s, t), which we call matching canonical correlation analysis (MCCA).
MCCA model m  M AT C H I N G - P R I O R [matching m] For each matched edge (i, j )  m: -zi,j  N (0, Id ) [latent concept] -fS (si )  N (WS zi,j , S ) [source features] -fT (ti )  N (WT zi,j , T ) [target features] For each unmatched source word type i: -fS (si )  N (0,  2 IdS ) [source features] For each unmatched target word type j : -fT (tj )  N (0,  2 IdT ) [target features]

First, we generate a matching m  M, where M is the set of matchings in which each word type is 772


fS (si )

Canonical Space

Rd z

fT (tj )
#ti mpo pe#
1.0 1.0 1.0

{

Orthographic Features

#ti ime me#

1.0 1.0 1.0

{

. . .
change dawn period necessary
20.0 5.0 100.0 50.0

Contextual Features

si
time

tj
tiempo

. . .
suficiente período mismo
40.0 65.0 120.0 45.0

Source Space

Rds

Target Space

Rdt

adicional

Figure 2: Illustration of our MCCA model. Each latent concept zi,j originates in the canonical space. The observed word vectors in the source and target spaces are generated independently given this concept.

3

Inference

Given our probabilistic model, we would like to maximize the log-likelihood of the observed data (s, t): m () = log p(s, t; ) = log p(m, s, t; ) with respect to the model parameters  = (WS , WT , S , T ). We use the hard (Viterbi) EM algorithm as a starting point, but due to modeling and computational considerations, we make several important modifications, which we describe later. The general form of our algorithm is as follows:
Summary of learning algorithm E-step: Find the maximum weighted (partial) bipartite matching m  M M-step: Find the best parameters  by performing canonical correlation analysis (CCA)

RdS ×d of the source and UT  RdT ×d of the target such that the components of the projections US fS (si ) and UT fT (tj ) are maximally correlated.4 US and UT can be found by solving an eigenvalue problem (see Hardoon et al. (2003) for details). Then the maximum likelihood estimates are as follows: WS = CS S US P 1/2 , WT = CT T UT P 1/2 , S = CS S - WS WS , and T = CT T - WT WT , where P is a d × d diagonal matrix of the canonical ( 1 is correlations, CS S = |m| i,j )m fS (si )fS (si ) the empirical covariance matrix in the source domain, and CT T is defined analogously. E-step To perform a conventional E-step, we would need to compute the posterior over all matchings, which is #P-complete (Valiant, 1979). On the other hand, hard EM only requires us to compute the best matching under the current model:5 m = argmax og p(m , s, t; ).
m
l

(2)

M-step Given a matching m, the M-step optimizes log p(m, s, t; ) with respect to , which can be rewritten as ( max log p(si , tj ; ). (1)
 i,j )m

We cast this optimization as a maximum weighted bipartite matching problem as follows. Define the edge weight between source word type i and target word type j to be wi,j = log p(si , tj ; ) - log p(si ; ) - log p(tj ; ),
Since dS and dT can be quite large in practice and often greater than |m|, we use Cholesky decomposition to rerepresent the feature vectors as |m|-dimensional vectors with the same dot products, which is all that CCA depends on. 5 If we wanted softer estimates, we could use the agreementbased learning framework of Liang et al. (2008) to combine two tractable models.
4

(3)

This objective corresponds exactly to maximizing the likelihood of the probabilistic CCA model presented in Bach and Jordan (2006), which proved that the maximum likelihood estimate can be computed by canonical correlation analysis (CCA). Intuitively, CCA finds d-dimensional subspaces US  773


which can be loosely viewed as a pointwise mutual information quantity. We can check that the objective log p(m, s, t; ) is equal to the weight of a matching plus some constant C : ( wi,j + C. (4) log p(m, s, t; ) =
i,j )m

are presented for other languages in section 6. In this section, we describe the data and experimental methodology used throughout this work. 4.1 Data

Each experiment requires a source and target monolingual corpus. We use the following corpora: · E N - E S - W: 3,851 Wikipedia articles with both English and Spanish bodies (generally not direct translations). · E N - E S - P : 1st 100k sentences of text from the parallel English and Spanish Europarl corpus (Koehn, 2005). · E N - E S ( F R ) - D : English: 1st 50k sentences of Europarl; Spanish (French): 2nd 50k sentences of Europarl.7 · E N - C H - D : English: 1st 50k sentences of Xinhua parallel news corpora;8 Chinese: 2nd 50k sentences. · E N - A R - D : English: 1st 50k sentences of 1994 proceedings of UN parallel corpora;9 Arabic: 2nd 50k sentences. · E N - E S - G : English: 100k sentences of English Gigaword; Spanish: 100k sentences of Spanish Gigaword.10 Note that even when corpora are derived from parallel sources, no explicit use is ever made of document or sentence-level alignments. In particular, our method is robust to permutations of the sentences in the corpora. 4.2 Lexicon

To find the optimal partial matching, edges with weight wi,j < 0 are set to zero in the graph and the optimal full matching is computed in O((nS + nT )3 ) time using the Hungarian algorithm (Kuhn, 1955). If a zero edge is present in the solution, we remove the involved word types from the matching.6 Bootstrapping Recall that the E-step produces a partial matching of the word types. If too few word types are matched, learning will not progress quickly; if too many are matched, the model will be swamped with noise. We found that it was helpful to explicitly control the number of edges. Thus, we adopt a bootstrapping-style approach that only permits high confidence edges at first, and then slowly permits more over time. In particular, we compute the optimal full matching, but only retain the highest weighted edges. As we run EM, we gradually increase the number of edges to retain. In our context, bootstrapping has a similar motivation to the annealing approach of Smith and Eisner (2006), which also tries to alter the space of hidden outputs in the E-step over time to facilitate learning in the M-step, though of course the use of bootstrapping in general is quite widespread (Yarowsky, 1995).

4

Experimental Setup

In section 5, we present developmental experiments in English-Spanish lexicon induction; experiments
Empirically, we obtained much better efficiency and even increased accuracy by replacing these marginal likelihood weights with a simple proxy, the distances between the words' mean latent concepts:
  wi,j = A - ||zi - zj ||2 , 6

Each experiment requires a lexicon for evaluation. Following Koehn and Knight (2002), we consider lexicons over only noun word types, although this is not a fundamental limitation of our model. We consider a word type to be a noun if its most common tag is a noun in our monolingual corpus.11 For
Note that the although the corpora here are derived from a parallel corpus, there are no parallel sentences. 8 LDC catalog # 2002E18. 9 LDC catalog # 2004E13. 10 These corpora contain no parallel sentences. 11 We use the Tree Tagger (Schmid, 1994) for all POS tagging except for Arabic, where we use the tagger described in Diab et al. (2004).
7

(5)

 where A is a thresholding constant, zi = E(zi,j | fS (si )) = 1/2 S  P U fS (si ), and zj is defined analogously. The increased accuracy may not be an accident: whether two words are translations is perhaps better characterized directly by how close their latent concepts are, whereas log-probability is more sensitive to perturbations in the source and target spaces.

774


1 EN-ES-P EN-ES-W 0.95

Setting
EDITDIST O RT H O CONTEXT MCCA

p0.1

p0.25

p0.33

p0.50

Best-F1

0.9

Precision

0.85

0.8

58.6 76.0 91.1 87.2

62.6 81.3 81.3 89.7

61.1 80.1 80.2 89.0

--52.3 65.3 89.7

47.4 55.0 58.0 72.0

0.75

0.7

0.65

Table 1: Performance of E D I T D I S T and our model with various features sets on E N - E S - W. See section 5.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.6

Recall

Figure 3: Example precision/recall curve of our system on E N - E S - P and E N - E S - W settings. See section 6.1.

all languages pairs except English-Arabic, we extract evaluation lexicons from the Wiktionary online dictionary. As we discuss in section 7, our extracted lexicons have low coverage, particularly for proper nouns, and thus all performance measures are (sometimes substantially) pessimistic. For EnglishArabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (s, t) to the lexicon if s was aligned to t at least three times and more than any other word. Also, as in Koehn and Knight (2002), we make use of a seed lexicon, which consists of a small, and perhaps incorrect, set of initial translation pairs. We used two methods to derive a seed lexicon. The first is to use the evaluation lexicon Le and select the hundred most common noun word types in the source corpus which have translations in Le . The second method is to heuristically induce, where applicable, a seed lexicon using edit distance, as is done in Koehn and Knight (2002). Section 6.2 compares the performance of these two methods. 4.3 Evaluation We evaluate a proposed lexicon Lp against the evaluation lexicon Le using the F1 measure in the standard fashion; precision is given by the number of proposed translations contained in the evaluation lexicon, and recall is given by the fraction of possible translation pairs proposed.12 Since our model
We should note that precision is not penalized for (s, t) if s does not have a translation in Le , and recall is not penalized for failing to recover multiple translations of s.
12

naturally produces lexicons in which each entry is associated with a weight based on the model, we can give a full precision/recall curve (see figure 3). We summarize these curves with both the best F1 over all possible thresholds and various precisions px at recalls x. All reported numbers exclude evaluation on the seed lexicon entries, regardless of how those seeds are derived or whether they are correct. In all experiments, unless noted otherwise, we used a seed of size 100 obtained from Le and considered lexicons between the top n = 2, 000 most frequent source and target noun word types which were not in the seed lexicon; each system proposed an already-ranked one-to-one translation lexicon amongst these n words. Where applicable, we compare against the E D I T D I S T baseline, which solves a maximum bipartite matching problem where edge weights are normalized edit distances. We will use M C C A (for matching CCA) to denote our model using the optimal feature set (see section 5.3).

5 Features
In this section, we explore feature representations of word types in our model. Recall that fS (·) and fT (·) map source and target word types to vectors in RdS and RdT , respectively (see section 2). The features used in each representation are defined identically and derived only from the appropriate monolingual corpora. For a concrete example of a word type to feature vector mapping, see figure 2. 5.1 Orthographic Features

For closely related languages, such as English and Spanish, translation pairs often share many orthographic features. One direct way to capture orthographic similarity between word pairs is edit distance. Running E D I T D I S T (see section 4.3) on E N -

775


ES-W yielded 61.1 p0.33 , but precision quickly degrades for higher recall levels (see E D I T D I S T in ta-

(a) Corpus Variation

Setting
EN-ES-G EN-ES-W EN-ES-D EN-ES-P

p0.1

p0.25

p0.33

p0.50

Best-F1

ble 1). Nevertheless, when available, orthographic clues are strong indicators of translation pairs. We can represent orthographic features of a word type w by assigning a feature to each substring of length  3. Note that MCCA can learn regular orthographic correspondences between source and target words, which is something edit distance cannot capture (see table 5). Indeed, running our MCCA model with only orthographic features on E N - E S W, labeled O RT H O in table 1, yielded 80.1 p0.33 , a 31% error-reduction over E D I T D I S T in p0.33 . 5.2 Context Features While orthographic features are clearly effective for historically related language pairs, they are more limited for other language pairs, where we need to appeal to other clues. One non-orthographic clue that word types s and t form a translation pair is that there is a strong correlation between the source words used with s and the target words used with t. To capture this information, we define context features for each word type w, consisting of counts of nouns which occur within a window of size 4 around w. Consider the translation pair (time, tiempo) illustrated in figure 2. As we become more confident about other translation pairs which have active period and periodico context features, we learn that translation pairs tend to jointly generate these features, which leads us to believe that time and tiempo might be generated by a common underlying concept vector (see section 2).13 Using context features alone on E N - E S - W, our MCCA model (labeled C O N T E X T in table 1) yielded a 80.2 p0.33 . It is perhaps surprising that context features alone, without orthographic information, can yield a best-F1 comparable to E D I T D I S T. 5.3 Combining Features We can of course combine context and orthographic features. Doing so yielded 89.03 p0.33 (labeled M C C A in table 1); this represents a 46.4% error reduction in p0.33 over the E D I T D I S T baseline. For the remainder of this work, we will use M C C A to refer
It is important to emphasize, however, that our current model does not directly relate a word type's role as a participant in the matching to that word's role as a context feature.
13

75.0 87.2 91.4 97.3
p0.1

71.2 89.7 94.3 94.8
p0.25

68.3 89.0 92.3 93.8
p0.33

--89.7 89.7 92.9
p0.50

49.0 72.0 63.7 77.0

(b) Seed Lexicon Variation

Corpus
EDITDIST MCCA M C C A - AU T O

Best-F1

58.6 91.4 91.2
p0.1

62.6 94.3 90.5
p0.25

61.1 92.3 91.8
p0.33

-- 89.7 77.5
p0.50

47.4 63.7 61.7

(c) Language Variation

Languages
EN-ES EN-FR EN-CH EN-AR

Best-F1

91.4 94.5 60.1 70.0

94.3 89.1 39.3 50.0

92.3 88.3 26.8 31.1

89.7 78.6 -----

63.7 61.9 30.8 33.1

Table 2: (a) varying type of corpora used on system performance (section 6.1), (b) using a heuristically chosen seed compared to one taken from the evaluation lexicon (section 6.2), (c) a variety of language pairs (see section 6.3).

to our model using both orthographic and context features.

6 Experiments
In this section we examine how system performance varies when crucial elements are altered. 6.1 Corpus Variation

There are many sources from which we can derive monolingual corpora, and M C C A performance depends on the degree of similarity between corpora. We explored the following levels of relationships between corpora, roughly in order of closest to most distant: · Same Sentences: E N - E S - P · Non-Parallel Similar Content: E N - E S - W · Distinct Sentences, Same Domain: E N - E S - D · Unrelated Corpora: E N - E S - G Our results for all conditions are presented in table 2(a). The predominant trend is that system performance degraded when the corpora diverged in

776


content, presumably due to context features becoming less informative. However, it is notable that even in the most extreme case of disjoint corpora from different time periods and topics (e.g. E N - E S - G), we are still able to recover lexicons of reasonable accuracy. 6.2 Seed Lexicon Variation All of our experiments so far have exploited a small seed lexicon which has been derived from the evaluation lexicon (see section 4.3). In order to explore system robustness to heuristically chosen seed lexicons, we automatically extracted a seed lexicon similarly to Koehn and Knight (2002): we ran E D I TD I S T on E N - E S - D and took the top 100 most confident translation pairs. Using this automatically derived seed lexicon, we ran our system on E N - E S D as before, evaluating on the top 2,000 noun word types not included in the automatic lexicon.14 Using the automated seed lexicon, and still evaluating against our Wiktionary lexicon, M C C A - AU T O yielded 91.8 p0.33 (see table 2(b)), indicating that our system can produce lexicons of comparable accuracy with a heuristically chosen seed. We should note that this performance represents no knowledge given to the system in the form of gold seed lexicon entries. 6.3 Language Variation We also explored how system performance varies for language pairs other than English-Spanish. On English-French, for the disjoint E N - F R - D corpus (described in section 4.1), M C C A yielded 88.3 p0.33 (see table 2(c) for more performance measures). This verified that our model can work for another closely related language-pair on which no model development was performed. One concern is how our system performs on language pairs where orthographic features are less applicable. Results on disjoint English-Chinese and English-Arabic are given as E N - C H - D and E N - A R in table 2(c), both using only context features. In these cases, M C C A yielded much lower precisions of 26.8 and 31.0 p0.33 , respectively. For both languages, performance degraded compared to E N - E S Note that the 2,000 words evaluated here were not identical to the words tested on when the seed lexicon is derived from the evaluation lexicon.
14

Rank 1. 2. 3. 6. 7. 9. 10. 11. 12. 14. 21. 23. 24. 25.

(a) English-Spanish S o u rc e Target e d u c a tio n e d u c a c ió n p a c to pact s ta b ility e s ta b ilid a d c o rru p tio n c o rru p c ió n to u ris m tu ris m o organisation organización c o n v e n ie n c e c o n v e n ie n c ia s y ria s iria c o o p e ra tio n c o o p e ra c ió n c u ltu re c u ltu ra p ro to c o l p ro to c o lo n o rth n o rte h e a lth s a lu d a c tio n re a c c ió n (b) English-French S o u rc e Target x e n o p h o b ia x é n o p h o b ie c o rru p tio n c o rru p tio n s u b s id ia rity s u b s id ia rité p ro g ra m m e p ro g ra m m e -c a d re tra c e a b ility tra ç a b ilité

C o rre c t
Y Y Y Y Y Y Y Y Y Y Y Y Y N

Rank 3. 4. 5. 6. 8.

C o rre c t
Y Y Y N Y

Rank 1. 2. 3. 4. 5.

(c) English-Chinese S o u rc e Target Correct p ric e s !" Y n e tw o rk #$ Y p o p u la tio n %& Y re p o rte r ' N o il () Y

Table 3: Sample output from our (a) Spanish, (b) French, and (c) Chinese systems. We present the highest confidence system predictions, where the only editing done is to ignore predictions which consist of identical source and target words. D and E N - F R - D, presumably due in part to the lack of orthographic features. However, M C C A still achieved surprising precision at lower recall levels. For instance, at p0.1 , M C C A yielded 60.1 and 70.0 on Chinese and Arabic, respectively. Figure 3 shows the highest-confidence outputs in several languages.

6.4

Comparison To Previous Work

There has been previous work in extracting translation pairs from non-parallel corpora (Rapp, 1995; Fung, 1995; Koehn and Knight, 2002), but generally not in as extreme a setting as the one considered here. Due to unavailability of data and specificity in experimental conditions and evaluations, it is not possible to perform exact comparisons. How-

777


(a) Example Non-Cognate Pairs

health traceability youth report advantages

salud rastreabilidad juventud informe ventajas

(b) Interesting Incorrect Pairs

liberal Kirkhope action Albanians a.m. Netherlands

partido Gorsel reacci´n o Bosnia horas Breta~a n

Table 4: System analysis on E N - E S - W: (a) non-cognate pairs proposed by our system, (b) hand-selected representative errors.
(a) Orthographic Feature
Source Feat. Closest Target Feats. Example Translation (statue, estatua) (felicity, felicidad)

#st ty# ogy

#es, est ad#, d# g´a, g´ i i

(geology, geolog´a) i

(b) Context Feature
Source Feat. Closest Context Features

party democrat beijing

partido, izquierda socialistas, dem´cratas o pek´n, kioto i

Table 5: Hand selected examples of source and target features which are close in canonical space: (a) orthographic feature correspondences, (b) context features.

ever, we attempted to run an experiment as similar as possible in setup to Koehn and Knight (2002), using English Gigaword and German Europarl. In this setting, our M C C A system yielded 61.7% accuracy on the 186 most confident predictions compared to 39% reported in Koehn and Knight (2002).

English-Spanish lexicon produced by our system on E N - E S - W. Of the top 100 errors: 21 were correct translations not contained in the Wiktionary lexicon (e.g. pintura to painting), 4 were purely morphological errors (e.g. airport to aeropuertos), 30 were semantically related (e.g. basketball to b´isbol), 15 were words with e strong orthographic similarities (e.g. coast to costas), and 30 were difficult to categorize and fell into none of these categories. Since many of our `errors' actually represent valid translation pairs not contained in our extracted dictionary, we supplemented our evaluation lexicon with one automatically derived from 100k sentences of parallel Europarl data. We ran the intersected HMM wordalignment model (Liang et al., 2008) and added (s, t) to the lexicon if s was aligned to t at least three times and more than any other word. Evaluating against the union of these lexicons yielded 98.0 p0.33 , a significant improvement over the 92.3 using only the Wiktionary lexicon. Of the true errors, the most common arose from semantically related words which had strong context feature correlations (see table 4(b)). We also explored the relationships our model learns between features of different languages. We projected each source and target feature into the shared canonical space, and for each projected source feature we examined the closest projected target features. In table 5(a), we present some of the orthographic feature relationships learned by our system. Many of these relationships correspond to phonological and morphological regularities such as the English suffix ing mapping to the Spanish suffix g´a. In table 5(b), we present context feature i correspondences. Here, the broad trend is for words which are either translations or semantically related across languages to be close in canonical space.

7

Analysis

8 Conclusion
We have presented a generative model for bilingual lexicon induction based on probabilistic CCA. Our experiments show that high-precision translations can be mined without any access to parallel corpora. It remains to be seen how such lexicons can be best utilized, but they invite new approaches to the statistical translation of resource-poor languages.

We have presented a novel generative model for bilingual lexicon induction and presented results under a variety of data conditions (section 6.1) and languages (section 6.3) showing that our system can produce accurate lexicons even in highly adverse conditions. In this section, we broadly characterize and analyze the behavior of our system. We manually examined the top 100 errors in the 778


References
Francis R. Bach and Michael I. Jordan. 2006. A probabilistic interpretation of canonical correlation analysis. Technical report, University of California, Berkeley. Peter F. Brown, Stephen Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1994. The mathematic of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263­311. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic tagging of arabic text: From raw text to base phrase chunks. In HLT-NAACL. Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel english-chinese corpus. In Third Annual Workshop on Very Large Corpora. Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. 2006. Scalable inference and training of context-rich syntactic translation models. In COLING-ACL. David R. Hardoon, Sandor Szedmak, and John ShaweTaylor. 2003. Canonical correlation analysis an overview with application to learning methods. Technical Report CSD-TR-03-02, Royal Holloway University of London. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of ACL Workshop on Unsupervised Lexical Acquisition. P. Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proceedings of AMTA 2004. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit. H. W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistic Quarterly. P. Liang, D. Klein, and M. I. Jordan. 2008. Agreementbased learning. In NIPS. Reinhard Rapp. 1995. Identifying word translation in non-parallel texts. In ACL. Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing. N. Smith and J. Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In ACL. L. G. Valiant. 1979. The complexity of computing the permanent. Theoretical Computer Science, 8:189­ 201. D. Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In ACL.

779