Learning Bigrams from Unigrams
Xiaojin Zhu and Andrew B. Goldberg and Michael Rabbat and Robert Nowak§  Department of Computer Sciences, University of Wisconsin-Madison  Department of Electrical and Computer Engineering, McGill University § Department of Electrical and Computer Engineering, University of Wisconsin-Madison
{jerryzhu, goldberg}@cs.wisc.edu, michael.rabbat@mcgill.ca, nowak@ece.wisc.edu

Abstract
Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bagof-words documents. In experiments on seven corpora, we observed that our learned bigram language models: i) achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind "oracle bigram models" trained on the corresponding ordered documents; ii) assign higher probabilities to sensible bigram word pairs; iii) improve the accuracy of ordereddocument recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privacy leakage from index files.

1

Introduction

A bag-of-words (BOW) is a basic document representation in natural language processing. In this paper, we consider a BOW in its simplest form, i.e., a unigram count vector or word histogram over the vocabulary. When performing the counting, word order is ignored. For example, the phrases "really neat" and "neat really" contribute equally to a BOW. Obviously, once a set of documents is turned into a set of BOWs, the word order information within them is completely lost--or is it? 656

In this paper, we show that one can in fact partly recover the order information. Specifically, given a set of documents in unigram-count BOW representation, one can recover a non-trivial bigram language model (LM)1 , which has part of the power of a bigram LM trained on ordered documents. At first glance this seems impossible: How can one learn bigram information from unigram counts? However, we will demonstrate that multiple BOW documents enable us to recover some higher-order information. Our results have implications in a wide range of natural language problems, in particular document privacy. With the wide adoption of natural language applications like desktop search engines, software programs are increasingly indexing computer users' personal files for fast processing. Most index files include some variant of the BOW. As we demonstrate in this paper, if a malicious party gains access to BOW index files, it can recover more than just unigram frequencies: (i) the malicious party can recover a higher-order LM; (ii) with the LM it may attempt to recover the original ordered document from a BOW by finding the most-likely word permutation2 . Future research will quantify the extent to which such a privacy breach is possible in theory, and will find solutions to prevent it. There is a vast literature on language modeling; see, e.g., (Rosenfeld, 2000; Chen and Goodman, 1999; Brants et al., 2007; Roark et al., 2007). How1 A trivial bigram LM is a unigram LM which ignores history: P (v |u) = P (v ). 2 It is possible to use a generic higher-order LM, e.g., a trigram LM trained on standard English corpora, for this purpose. However, incorporating a user-specific LM helps.

Proceedings of ACL-08: HLT, pages 656­664, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics


ever, to the best of our knowledge, none addresses this reverse direction of learning higher-order LMs from lower-order data. This work is inspired by recent advances in inferring network structure from co-occurrence data, for example, for computer networks and biological pathways (Rabbat et al., 2007).

A:2, B:1) then  (x) = {z1 , z2 , z3 } with z1 = " d A A B", z2 = " d A B A", z3 = " d B A A". Bigram LM recovery then amounts to finding a  that satisfies the system of marginal-matching equations P (x| ) = p(x) , x  X . ^ (1) As a concrete example where one can exactly recover a bigram LM from BOWs, consider our toy example again. We know there are only three free variables in our 3 × 3 bigram LM  : r =  d A , p = AA , q = B B , since the rest are determined by normalization. Suppose the documents are generated from a bigram LM with true parameters r = 0.25, p = 0.9, q = 0.5. If our BOW corpus is very large, we will observe that 20.25% of the BOWs are ( d :1, A:3, B:0), 37.25% are ( d :1, A:2, B:1), and 18.75% are ( d :1, A:0, B:3). These numbers are computed using the definition of P (x| ). We solve the reverse problem of finding r, p, q from the system of equations (1), now explicitly written as 2  rp = 0.2025   rp(1 - p) + r(1 - p)(1 - q ) +(1 - r)(1 - q )p = 0.3725    (1 - r)q 2 = 0.1875.

2

Problem Formulation and Identifiability

We assume that a vocabulary of size W is given. For notational convenience, we include in the vocabulary a special "begin-of-document" symbol d which appears only at the beginning of each document. The training corpus consists of a collection of n BOW documents {x1 , . . . , xn }. Each BOW xi is a vector (xi1 , . . . , xiW ) where xiu is the number of times word u occurs in document i. Our goal is to learn a bigram LM  , represented as a W ×W transition matrix with uv = P (v |u), from the BOW corpus. Note P (v | d ) corresponds to the initial state probability for word v , and P ( d |u) = 0, u. It is worth noting that traditionally one needs ordered documents to learn a bigram LM. A natural question that arises in our problem is whether or not a bigram LM can be recovered from the BOW corpus with any guarantee. Let X denote the space of all possible BOWs. As a toy example, consider W = 3 with the vocabulary { d , A, B}. Assuming all documents have equal length |x| = 4 (including d ), then X = {( d :1, A:3, B:0), ( d :1, A:2, B:1), ( d :1, A:1, B:2), ( d :1, A:0, B:3)}. Our training BOW corpus, when sufficiently large, provides the marginal distribution p(x) for x  X . Can we re^ cover a bigram LM from p(x)? ^ To answer this question, we first need to introduce a generative model for the BOWs. We assume that the BOW corpus is generated from a bigram LM  in two steps: (i) An ordered document is generated from the bigram LM  ; (ii) The document's unigram counts are collected to produce the BOW x. Therefore, the probability of a BOW x being generated by  can be computed by marginalizing over unique orderings z of x: P (x| ) = z P (z| ) =
 (x)

z

 (x) =2

j|x|

zj -1 ,zj ,

where  (x) is the set of unique orderings, and |x| is the document length. For example, if x =( d :1, 657

The above system has only one valid solution, which is the correct set of bigram LM parameters (r, p, q ) = (0.25, 0.9, 0.5). However, if the true parameters were (r, p, q ) = (0.1, 0.2, 0.3) with proportions of BOWs being 0.4%, 19.8%, 8.1%, respectively, it is easy to verify that the system would have multiple valid solutions: (0.1, 0.2, 0.3), (0.8819, 0.0673, 0.8283), and (0.1180, 0.1841, 0.3030). In general, if p(x) is ^ known from the training BOW corpus, when can we guarantee to uniquely recover the bigram LM  ? This is the question of identifiability, which means the transition matrix  satisfying (1) exists and is unique. Identifiability is related to finding unique solutions of a system of polynomial equations since (1) is such a system in the elements of  . The details are beyond the scope of this paper, but applying the technique in (Basu and Boston, 2000), it is possible to show that for W = 3 (including d ) we need longer documents (|x|  5) to ensure identifiability. The identifiability of more general cases is still an open research question.


3

Bigram Recovery Algorithm

In practice, the documents are not truly generated from a bigram LM, and the BOW corpus may be small. We therefore seek a maximum likelihood estimate of  or a regularized version of it. Equivalently, we no longer require equality in (1), but instead find  that makes the distribution P (x| ) as close to p(x) as possible. We formalize this notion ^ below. 3.1 The Objective Function Given a BOW corpus {x1 , . . . , xn }, its normalizn d log likelihood under  ns ( )  e i 1 i=1 log P (xi | ), where C = i=1 (|xi | - 1) C is the corpus length excluding d 's. The idea is to find  that maximizes ( ). This also brings P (x| ) closest to p(x) in the KL-divergence sense. How^ ever, to prevent overfitting, we regularize the problem so that  prefers to be close to a "prior" bigram LM . The prior  is also estimated from the BOW corpus, and is discussed in Section 3.4. We define the regularizer to be an asymmetric dissimilarity D(,  ) between the prior  and the learned model  . The dissimilarity is 0 if  = , and increases as they diverge. Specifically, the KLdivergence between two word distributions conditioW d on the same history u is K L(u·  u· ) = ne uv We define D(,  ) to be v =1 uv log uv . the average KL-divergence over all histories: W 1 D(,  )  W u=1 K L(u·  u· ), which is convex in  (Cover and Thomas, 1991). We will use the following derivative later:  D(,  )/ uv = -uv /(W uv ). We are now ready to define the regularized optimization problem for recovering a bigram LM  from the BOW corpus: max  subject to ( ) - D(,  )  1 = 1,   0. (2)

The summation over hidden ordered documents z in P (x| ) couples the variables and makes (2) a non-concave problem. We optimize  using an EM algorithm. 3.2 The EM Algorithm We derive the EM algorithm for the optimization problem (2). Let O( )  ( ) - D(,  ) be the objective function. Let  (t-1) be the bigram LM at iteration t - 1. We can lower-bound O as follows: O( ) n 1i C P (z| ) P (z| (t-1) , x) P (z| ) P (z| (t-1) , x)

=

log

=1


-D(,  )

-D(,  ) n 1i z C

z

P (z| (t-1) , x)
 ( x i )

P (z| (t-1) , x) log

= 1  ( x i )

 L( ,  (t-1) ). We used Jensen's inequality above since log() is concave. The lower bound L involves P (z| (t-1) , x), the probability of hidden orderings of the BOW under the previous iteration's model. In the E-step of EM we compute P (z| (t-1) , x), which will be discussed in Section 3.3. One can verify that L( ,  (t-1) ) is concave in  , unlike the original objective O( ). In addition, the lower bound "touches" the objective at  (t-1) , i.e., L( (t-1) ,  (t-1) ) = O( (t-1) ). The EM algorithm iteratively maximizes the lower bound, which is now a concave optimization problem: max L( ,  (t-1) ), subject to  1 = 1. The non-negativity constraints turn out to be automatically satisfied. Introducing Lagrange multipliers u for each history u = 1 . . . W , we form the Lagrangian : W . uW v   L( ,  (t-1) ) - u  uv - 1
=1 =1

The weight  controls the strength of the prior. The constraints ensure that  is a valid bigram matrix, where 1 is an all-one vector, and the non-negativity constraint is element-wise. Equivalently, (2) can be viewed as the maximum a posteriori (MAP) estimate of  , with independent Dirichlet priors for each row of  : p( u· ) = Dir( u· |u· ) and hyperparameters uv = C uv + 1. W 658

Taking the partial derivative with respect to uv and setting it to zero:  / uv = 0, we arrive at the following update:  uv  in z P (z| (t-1) , x)cuv (z) + C uv . W

= 1  ( x i )

(3)


Input: BOW documents {x1 , . . . , xn }, a prior bigram LM , weight . 1. t = 1. Initialize  (0) = . 2. Repeat until the objective O( ) converges: (a) (E-step) Compute P (z| (t-1) , x) for z   (xi ), i = 1, . . . , n. (b) (M-step) Compute  (t) using (3). Let t = t + 1. Output: The recovered bigram LM  . Table 1: The EM algorithm

The normalization is over v = 1 . . . W . We use cuv (z) to denote the number of times the bigram "uv " appears in the ordered document z. This is the M-step of EM. Intuitively, the first term counts how often the bigram "uv " occurs, weighing each ordering by its probability under the previous model; the second term pulls the parameter towards the prior. If the weight of the prior   , we would have uv = uv . The update is related to the MAP estimate for a multinomial distribution with a Dirichlet prior, where we use the expected counts. We initialize the EM algorithm with  (0) = . The EM algorithm is summarized in Table 1. 3.3 Approximate E-step The E-step needs to compute the expected bigram counts of the form z P (z| , x)cuv (z).
 (x)

(4)

However, this poses a computational problem. The summation is over unique ordered documents. The number of unique ordered documents can be on the order of |x|!, i.e., all permutations of the BOW. For a short document of length 15, this number is already 1012 . Clearly, brute-force enumeration is only feasible for very short documents. Approximation is necessary to handle longer ones. A simple Monte Carlo approximation to (4) would involve sampling ordered documents z1 , z2 , . . . , zL according to zi  P (z| , x), and L replacing (4) with i=1 cuv (zi )/L. This estimate is unbiased, and the variance decreases linearly 659

with the number of samples, L. However, sampling directly from P is difficult. Instead, we sample ordered documents zi  R(zi | , x) from a distribution R which is easy to generate, and construct an approximation using importance sampling (see, e.g., (Liu, 2001)). With each sample, zi , we associate a weight wi  P (zi |, x)/R(zi |, x). The importance samL ling approximation to (4) is then given by p L ( i=1 wi cuv (zi ))/( i=1 wi ). Re-weighting the samples in this fashion accounts for the fact that we are using a sampling distribution R which is different the target distribution P , and guarantees that our approximation is asymptotically unbiased. The quality of an importance sampling approximation is closely related to how closely R resembles P ; the more similar they are, the better the approximation, in general. Given a BOW x and our current bigram model estimate,  , we generate one sample (an ordered document zi ) by sequentially drawing words from the bag, with probabilities proportional to  , but properly normalized to form a distribution based on which words remain in the bag. For example, suppose x = ( d :1, A:2, B:1, C:1). Then we set zi1 = d , and sample zi2 = A with probability 2 d A /(2 d A +  d B +  d C ). Similarly, if zi(j -1) = u and if v is in the original BOW that hasn't been sampled yet, then we set the next word in the ordered document zij equal to v with probability proportional to cv uv , where cv is the count of v in the remaining BOW. For this scheme, one can verify (Rabbat et al., 2007) that the importance weight corresponding to a sampled ordered document zi = |x| |x| (zi1 , . . . , zi|x| ) is given by wi = t=2 i=t zt-1 zi . In our implementation, the number of importance samples used for a document x is 10|x|2 if the length of the document |x| > 8; otherwise we enumerate  (x) without importance sampling. 3.4 Prior Bigram LM  The quality of the EM solution  can depend on the prior bigram LM . To assess bigram recoverability from a BOW corpus alone, we consider only priors estimated from the corpus itself3 . Like  ,  is a W × W transition matrix with uv = P (v |u). When
Priors based on general English text or domain-specific knowledge could be used in specific applications.
3


appropriate, we set the initial probability  d v proportional to the number of times word v appears in the BOW corpus. We consider three prior models: Prior 1: Unigram unigram . The most na¨ve i  is a unigram LM which ignores word history. The probability for word v is estimated from the BOW corpus frequency of v , with add-1 smoothing: n unigram  1 + i=1 xiv . We should point out uv that the unigram prior is an asymmetric bigram, i.e., n unigram = uu igram . uv v Prior 2: Frequency of Document Cooccurrence (FDC) f dc . Let  (u, v |x) = 1 if words u = v co-occur (regardless of their counts) in BOW x, and 0 otherwise. In the case u = v ,  (u, u|x) = 1 only if u appears at least twice in n d x. Let cf vc = u i=1  (u, v |xi ) be the number of BOWs in which u, v co-occur. The FDC prior is d d f vc  cf vc + 1. The co-occurrence counts cf dc u u are symmetric, but f dc is asymmetric because of normalization. FDC captures some notion of potential transitions from u to v . FDC is in spirit similar to Kneser-Ney smoothing (Kneser and Ney, 1995) and other methods that accumulate indicators of document membership. Prior 3: Permutation-Based (Perm) perm . Recall that cuv (z) is the number of times the bigram "uv " appears in an ordered document z. We define n cperm = i=1 Ez(xi ) [cuv (z)], where the expectauv tion is with respect to all unique orderings of each BOW. We make the zero-knowledge assumption of uniform probability over these orderings, rather than P (z| ) as in the EM algorithm described above. EM will refine these estimates, though, so this is a natural starting point. Space precludes a full discussion, n p but it can be proven that cnerm = i=1 xiu xiv /|xi | uv if u = v , and cperm = i=1 xiu (xiu - 1)/|xi |. Fiuu p nally, uerm  cperm + 1. uv v 3.5 Decoding Ordered Documents from BOWs Given a BOW x and a bigram LM  , we formulate document recovery as the problem z = argmaxz(x) P (z| ). In fact, we can generate the top N candidate ordered documents in terms of P (z| ). We use A search to construct such an N-best list (Russell and Norvig, 2003). Each state is an ordered, partial document. Its successor states append one more unused word in x to 660

the partial document. The actual cost g from the start (empty document) to a state is the log probability of the partial document under bigram  . We design a heuristic cost h from the state to the goal (complete document) that is admissible: the idea is to over-use the best bigram history for the remaining words in x. Let the partial document end with word we . Let the count vector for the remaining BOW be (c1 ,W. . , cW ). One admissible heuristic . is h = log u=1 P (u|bh(u);  )cu , where the "best history" for word type u is bh(u) = argmaxv vu , and v ranges over the word types with non-zero counts in (c1 , . . . , cW ), plus we . It is easy to see that h is an upper bound on the bigram log probability that the remaining words in x can achieve. We use a memory-bounded A search similar to (Russell, 1992), because long BOWs would otherwise quickly exhaust memory. When the priority queue grows larger than the bound, the worst states (in terms of g + h) in the queue are purged. This necessitates a double-ended priority queue that can pop either the maximum or minimum item. We use an efficient implementation with Splay trees (Chong and Sahni, 2000). We continue running A after popping the goal state from its priority queue. Repeating this N times gives the N-best list.

4

Experiments

We show experimentally that the proposed algorithm is indeed able to recover reasonable bigram LMs from BOW corpora. We observe: 1. Good test set perplexity: Using test (heldout) set perplexity (PP) as an objective measure of LM quality, we demonstrate that our recovered bigram LMs are much better than na¨ve unigram LMs i trained on the same BOW corpus. Furthermore, they are not far behind the "oracle" bigram LMs trained on ordered documents that correspond to the BOWs. 2. Sensible bigram pairs: We inspect the recovered bigram LMs and find that they assign higher probabilities to sensible bigram pairs (e.g., "i mean", "oh boy", "that's funny"), and lower probabilities to nonsense pairs (e.g., "i yep", "you let's", "right lot"). 3. Document recovery from BOW: With the bigram LMs, we show improved accuracy in recovering ordered documents from BOWs. We describe these experiments in detail below.


Corpus SV10 SV25 SV50 SV100 SV250 SV500 SumTime

|V | 10 25 50 100 250 500 882

# Docs 6775 9778 12442 14602 18933 23669 3341

# Tokens 7792 13324 20914 28611 51950 89413 68815

|x| 1.2 1.4 1.7 2.0 2.7 3.8 20.6

Table 2: Corpora statistics: vocabulary size, document count, total token count, and mean document length.

gorithm in Table 1, from the training BOW corpus x1 , . . . xn and using the prior from step 1. 3. Compute the MAP bigram LM from the ordered training documents z1 , . . . zn . We call this the "oracle" bigram LM because it uses order information (not available to our algorithm), and we use it as a lower-bound on perplexity. 4. Test all LMs on zn+1 , . . . , zm by perplexity. 4.2 Good Test Set Perplexity Table 3 reports the 5-fold cross validation mean-testset-PP values for all corpora, and the run time per EM iteration. Because of the long running time, we adopt the rule-of-thumb stopping criterion of "two EM iterations". First, we observe that all bigram LMs perform better than unigram LMs unigram even though they are trained on the same BOW corpus. Second, all recovered bigram LMs  X improved upon their corresponding baselines X . The difference across every row is statistically significant according to a two-tailed paired t-test with p < 0.05. The differences among PP( X ) for the same corpus are also significant (except between  unigram and  perm for SV500). Finally, we observe that  perm tends to be best for the smaller vocabulary corpora, whereas  f dc dominates as the vocabulary grows. To see how much better we could do if we had ordered training documents z1 , . . . , zn , we present the mean-test-set-PP of "oracle" bigram LMs in Table 4. We used three smoothing methods to obtain oracle LMs: absolute discounting using a constant of 0.5 (we experimented with other values, but 0.5 worked best), Good-Turing, and interpolated Witten-Bell as implemented in the SRILM toolkit (Stolcke, 2002). We see that our recovered LMs (trained on unordered BOW documents), especially for small vocabulary corpora, are close to the oracles (trained on ordered documents). For the larger datasets, the recovery task is more difficult, and the gap between the oracle LMs and the  LMs widens. Note that the oracle LMs do much better than the recovered LMs on the SumTime corpus; we suspect the difference is due to the larger vocabulary and significantly higher average sentence length (see Table 2). 4.3 Sensible Bigram Pairs The next set of experiments compares the recovered bigram LMs to their corresponding prior LMs

4.1 Corpora and Protocols We note that although in principle our algorithm works on large corpora, the current implementation does not scale well (Table 3 last column). We therefore experimented on seven corpora with relatively small vocabulary sizes, and with short documents (mostly one sentence per document). Table 2 lists statistics describing the corpora. The first six contain text transcripts of conversational telephone speech from the small vocabulary "SVitchboard 1" data set. King et al. constructed each corpus from the full Switchboard corpus, with the restriction that the sentences use only words in the corresponding vocabulary (King et al., 2005). We refer to these corpora as SV10, SV25, SV50, SV100, SV250, and SV500. The seventh corpus comes from the SumTime-Meteo data set (Sripada et al., 2003), which contains real weather forecasts for offshore oil rigs in the North Sea. For the SumTime corpus, we performed sentence segmentation to produce documents, removed punctuation, and replaced numeric digits with a special token. For each of the seven corpora, we perform 5-fold cross validation. We use four folds other than the k -th fold as the training set to train (recover) bigram LMs, and the k -th fold as the test set for evaluation. This is repeated for k = 1 . . . 5, and we report the average cross validation results. We distinguish the original ordered documents (training set z1 , . . . zn , test set zn+1 , . . . , zm ) and the corresponding BOWs (training set x1 . . . xn , test set xn+1 . . . xm ). In all experiments, we simply set the weight  = 1 in (2). Given a training set and a test set, we perform the following steps: 1. Build prior LMs X from the training BOW corpus x1 , . . . xn , for X = unig ram, f dc, perm. 2. Recover the bigram LMs  X with the EM al661


Corpus SV10

X unigram fdc perm unigram fdc perm unigram fdc perm unigram fdc perm unigram fdc perm unigram fdc perm unigram fdc perm

P P ( X ) 7.48 6.52 6.50 16.4 12.3 12.2 29.1 19.6 19.5 45.4 29.5 30.0 91.8 60.0 65.4 149.1 104.8 123.9 129.7 103.2 187.9

P P ( X ) 6.95 6.47 6.45 12.8 11.8 11.7 19.7 17.8 17.7 27.8 25.3 25.6 51.2 47.3 49.7 87.2 80.1 87.4 81.8 77.7 85.4

SV25

SV50

SV100

SV250

SV500

SumTime

Time/ Iter < 1s < 1s < 1s 0.1s 0.1s 0.1s 2s 4s 5s 7s 11s 11s 5m 8m 8m 3h 3h 3h 4h 4h 3h

Corpus SV10 SV25 SV50 SV100 SV250 SV500 SumTime

Absolute Discount 6.27 10.5 14.8 20.0 33.7 50.9 10.8

GoodTuring 6.28 10.6 14.9 20.1 33.7 50.9 10.5

WittenBell 6.27 10.5 14.8 20.0 33.8 51.3 10.6

 6.45 11.7 17.7 25.3 47.3 80.1 77.7

Table 4: Mean test set perplexities for oracle bigram LMs trained on z1 , . . . , zn and tested on zn+1 , . . . , zm . For reference, the rightmost column lists the best result using a recovered bigram LM ( perm for the first three corpora,  f dc for the latter four).

ten adjacent, but lacks sufficient information to nail down the exact order. 4.4 Document Recovery from BOW We now play the role of the malicious party mentioned in the introduction. We show that, compared to their corresponding prior LMs, our recovered bigram LMs are better able to reconstruct ordered documents out of test BOWs xn+1 , . . . , xm . We perform document recovery using 1-best A decoding. We use "document accuracy" and "n-gram accuracy" (for n = 2, 3) as our evaluation criteria. We define document accuracy (Accdoc ) as the fraction of documents4 for which the decoded document matches the true ordered document exactly. Similarly, n-gram accuracy (Accn ) measures the fraction of all n-grams in test documents (with n or more words) that are recovered correctly. For this evaluation, we compare models built for the SV500 corpus. Table 6 presents 5-fold cross validation average test-set accuracies. For each accuracy measure, we compare the prior LM with the recovered bigram LM. It is interesting to note that the FDC and Perm priors reconstruct documents surprisingly well, but we can always improve them by running our EM algorithm. The accuracies obtained by  are statistically significantly better (via twotailed paired t-tests with p < 0.05) than their corresponding priors  in all cases except Accdoc for  perm versus perm . Furthermore,  f dc and  perm are significantly better than all other models in terms of all three reconstruction accuracy measures.
4

Table 3: Mean test set perplexities of prior LMs and bigram LMs recovered after 2 EM iterations.

in terms of how they assign probabilities to word pairs. One naturally expects probabilities for frequently occurring bigrams to increase, while rare or nonsensical bigrams' probabilities should decrease. For a prior-bigram pair (,  ), we evaluate the change in probabilities by computing the ratio  hw = P (w|h, ) = hw . For a given history h, we hw P (w|h,) sort words w by this ratio rather than by actual bigram probability because the bigrams with the highest and lowest probabilities tend to stay the same, while the changes accounting for differences in PP scores are more noticeable by considering the ratio. Due to space limitation, we present one specific result (FDC prior, fold 1) for the SV500 corpus in Table 5. Other results are similar. The table lists a few most frequent unigrams as history words h (left), and the words w with the smallest (center) and largest (right) hw ratio. Overall we see that our EM algorithm is forcing meaningless bigrams (e.g., "i goodness", "oh thing") to have lower probabilities, while assigning higher probabilities to sensible bigram pairs (e.g., "really good", "that's funny"). Note that the reverse of some common expressions (e.g., "right that's") also rise in probability, suggesting the algorithm detects that the two words are of662

We omit single-word documents from these computations.


h i you right oh that's really

w (smallest hw ) yep, bye-bye, ah, goodness, ahead let's, us, fact, such, deal as, lot, going, years, were thing, here, could, were, doing talking, home, haven't, than, care now, more, yep, work, you're

w (largest hw ) mean, guess, think, bet, agree thank, bet, know, can, do that's, all, right, now, you're boy, really, absolutely, gosh, great funny, wonderful, true, interesting, amazing sad, neat, not, good, it's

Table 5: The recovered bigram LM  f dc decreases nonsense bigram probabilities (center column) and increases sensible ones (right column) compared to the prior f dc on the SV500 corpus.
perm reconstructions of test BOWs just it's it's it's just going it's probably out there else something the the have but it doesn't you to talking nice was it yes that's well that's what i'm saying a little more here home take and they can very be nice too i think well that's great i'm but was he because only always that's think i don't i no that in and it it's interesting that's right that's right that's difficult so just not quite a year well it is a big dog so do you have a car  perm reconstructions of test BOWs it's just it's just it's going it's probably something else out there but it doesn't have the the yes it was nice talking to you well that's that's what i'm saying a little more take home here and they can be very nice too well i think that's great i'm but only because he was always no i don't i think that's and it it's interesting that in right that's that's right that's difficult so just not a quite year well it is big a dog so you do have a car

Table 7: Subset of SV500 documents that only perm or  perm (but not both) reconstructs correctly. The correct reconstructions are in bold.
Accdoc X X 11.1 26.8 30.2 31.0 30.9 31.5 Acc2  17.7 33.0 32.7
X

Acc3
X

X unigram fdc perm

 32.8 35.1 34.8

 2.7 11.4 11.5

X

X 11.8 13.3 13.1

5

Conclusions and Future Work

Table 6: Percentage of correctly reconstructed documents, 2-grams and 3-grams from test BOWs in SV500, 5-fold cross validation. The same trends continue for 4grams and 5-grams (not shown).

We conclude our experiments with a closer look at some BOWs for which  and  reconstruct differently. As a representative example, we compare  perm to perm on one test set of the SV500 corpus. There are 92 documents that are correctly reconstructed by  perm but not by perm . In contrast, only 65 documents are accurately reordered by perm but not by  perm . Table 7 presents a subset of these documents with six or more words. Overall, we conclude that the recovered bigram LMs do a better job at reconstructing BOW documents. 663

We presented an algorithm that learns bigram language models from BOWs. We plan to: i) investigate ways to speed up our algorithm; ii) extend it to trigram and higher-order models; iii) handle the mixture of BOW documents and some ordered documents (or phrases) when available; iv) adapt a general English LM to a special domain using only BOWs from that domain; and v) explore novel applications of our algorithm.

Acknowledgments
We thank Ben Liblit for tips on doubled-ended priority queues, and the anonymous reviewers for valuable comments. This work is supported in part by the Wisconsin Alumni Research Foundation, NSF CCF-0353079 and CCF-0728767, and the Natural Sciences and Engineering Research Council (NSERC) of Canada.


References
Samit Basu and Nigel Boston. 2000. Identifiability of polynomial systems. Technical report, University of Illinois at Urbana-Champaign. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL). Stanley F. Chen and Joshua T. Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4):359­393. Kyun-Rak Chong and Sartaj Sahni. 2000. Correspondence-based data structures for doubleended priority queues. The ACM Journal of Experimental Algorithmics, 5(2). Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley & Sons, Inc. Simon King, Chris Bartels, and Jeff Bilmes. 2005. SVitchboard 1: Small vocabulary tasks from Switchboard 1. In Interspeech 2005, Lisbon, Portugal. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for M-gram language modeling. In ICASSP. Jun S. Liu. 2001. Monte Carlo Strategies in Scientific Computing. Springer. ´ Michael Rabbat, Mario Figueiredo, and Robert Nowak. 2007. Inferring network structure from cooccurrences. In Advances in Neural Information Processing Systems (NIPS) 20. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373­392. Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8). Stuart Russell and Peter Norvig. 2003. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, second edition. Stuart Russell. 1992. Efficient memory-bounded search methods. In The 10th European Conference on Artificial Intelligence. Somayajulu G. Sripada, Ehud Reiter, Jim Hunter, and Jin Yu. 2003. Exploiting a parallel TEXT-DATA corpus. In Proceedings of Corpus Linguistics, pages 734­743, Lancaster, U.K. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In Proceedings of International Conference on Spoken Language Processing, Denver, Colorado.

664