Good-Turing Language Model Smoothing

We discuss briefly Good-Turing smoothing, the effects of binning and smoothing the N_r counts. Code to do this is available at the end of this page.

The code works as follows: given some piped-in text, we generate unigram and bigram tables as follows:

-- calculate unigram and bigram tables type UG = FiniteMap String Int type BG = FiniteMap (String,String) Int createTables :: [[String]] -> (UG, BG) createTables sents = (foldr buildTable emptyFM sents', foldr buildTable emptyFM (map (\s -> zip s (tail s)) sents')) where sents' = map ((["<s>"]++) . (++["</s>"])) sents buildTable l t = foldr (\w t -> addToFM_C (+) t w 1) t l

If we're going to do binning, we can separate out k-many bins of unigram/bigram tables:

-- do the binning; we strive to have k-many bins with equal -- numbers of elements computeBins :: Int -> UG -> BG -> [(UG,BG)] computeBins k ug bg = map mkUGBG bins where l = sortBy (\a b -> snd a `compare` snd b) [((x,y), p_x * p_y) | (x,y) <- keysFM bg , let p_x = lookupWithDefaultFM ug 0 x , let p_y = lookupWithDefaultFM ug 0 y ] n = ceiling (genericLength l / fromIntegral k) bins = splitOff n l mkUGBG bin = (listToFM $ concatMap (\ ((w1,w2),_) -> [(w1, lookupWithDefaultFM ug 0 w1), (w2, lookupWithDefaultFM ug 0 w2)]) bin, listToFM $ map (\ (b,_) -> (b, lookupWithDefaultFM bg 0 b)) bin)

Now, given a bigram table (possibly from a bin), we can generate our N_r table:

-- calcuate N_r table calculateN_r :: BG -> FiniteMap Int Int calculateN_r = foldFM (\_ r fm -> addToFM_C (+) fm r 1) emptyFM

And given this N_r table, we can smooth it (using neighborhood smoothing):

-- smooth the N_r table using binning and renormalize smoothN_r :: FiniteMap Int Int -> FiniteMap Int Double smoothN_r n_r = normalize (foldr addSmoothed emptyFM [low..high]) where low = foldFM (\r _ -> min r) 10000 n_r high = foldFM (\r _ -> max r) 0 n_r + 1 oldSum = fromIntegral $ foldFM (const (+)) 0 n_r smooth i = let fm_l = reverse $ eltsFM $ filterFM (\r _ -> r < i) n_r fm_g = eltsFM $ filterFM (\r _ -> r > i) n_r in average (take 1 fm_l ++ take 1 fm_g ++ maybeToList (lookupFM n_r i)) addSmoothed i fm = addToFM fm i (smooth i) normalize fm = mapFM (const (*s)) fm where s = oldSum / foldFM (const (+)) 0 fm

Finally, given a value for N and N_1 and a smoothed N_r table, we can compute the smoothed r-star table:

-- compute r* given N, N_1 and the smoothed n_r table compute_rstar :: Int -> Int -> FiniteMap Int Double -> FiniteMap Int Double compute_rstar n n_1 n_r_table = removeMax $ addToFM (mapFM calc_rstar n_r_table) 0 r0 where r0 = fromIntegral n_1 / fromIntegral n calc_rstar r n_r = (fromIntegral r + 1) * lookupWithDefaultFM n_r_table 0 (r+1) / n_r removeMax fm = filterFM (\e _ -> e < mx) fm where mx = foldFM (\e _ -> max e) 0 fm

The rest of the code in the program is just IO functions and helper functions.

Now, some results (mostly related to binning). We use 107k sentences (tokenized, lowercased) from news data (2.6 million words). A plot (x-axis is log-plotted) of the N_r table (without binning) appears below, and follows our Zipf distribution:

When binning is performed and we look individually at the first, middle and last bins, we get:

Zoomed in on the lower-left, it looks like:

The red dots are the whole dataset, the blue dots are the middle portion, the green are the first and the yellow are the last. Here, we can see that the blue (middle) follow the same looking distribution as the whole dataset, whereas the green (first) and yellow (last) are very flat. Moreover, the green (first) drops off dramatically after r=2. This means that there aren't really any "poor" guys to give their mass to the "poorer" guys.

Download the source code: GoodTuring.hs.

questions, comments? email me