Good-Turing Language Model Smoothing

We discuss briefly Good-Turing smoothing, the effects of binning and smoothing the N_r counts. Code to do this is available at the end of this page.

The code works as follows: given some piped-in text, we generate unigram and bigram tables as follows:

-- calculate unigram and bigram tables
type UG = FiniteMap String Int
type BG = FiniteMap (String,String) Int

createTables :: [[String]] -> (UG, BG)
createTables sents = 
    (foldr buildTable emptyFM sents', 
     foldr buildTable emptyFM 
        (map (\s -> zip s (tail s)) sents'))
  where
    sents' = map ((["<s>"]++) . (++["</s>"])) sents
    buildTable l t = foldr (\w t -> addToFM_C (+) t w 1) t l

If we're going to do binning, we can separate out k-many bins of unigram/bigram tables:

-- do the binning; we strive to have k-many bins with equal
-- numbers of elements
computeBins :: Int -> UG -> BG -> [(UG,BG)]
computeBins k ug bg = map mkUGBG bins
  where
    l = sortBy (\a b -> snd a `compare` snd b)
          [((x,y), p_x * p_y)
          | (x,y) <- keysFM bg
          , let p_x = lookupWithDefaultFM ug 0 x
          , let p_y = lookupWithDefaultFM ug 0 y
          ]
    n    = ceiling (genericLength l / fromIntegral k)
    bins = splitOff n l
    mkUGBG bin =
      (listToFM $ concatMap (\ ((w1,w2),_) -> [(w1, lookupWithDefaultFM ug 0 w1),
                                               (w2, lookupWithDefaultFM ug 0 w2)]) bin,
       listToFM $ map (\ (b,_) -> (b, lookupWithDefaultFM bg 0 b)) bin)

Now, given a bigram table (possibly from a bin), we can generate our N_r table:

-- calcuate N_r table
calculateN_r :: BG -> FiniteMap Int Int
calculateN_r = foldFM (\_ r fm -> addToFM_C (+) fm r 1) emptyFM

And given this N_r table, we can smooth it (using neighborhood smoothing):

-- smooth the N_r table using binning and renormalize
smoothN_r :: FiniteMap Int Int -> FiniteMap Int Double
smoothN_r n_r = normalize (foldr addSmoothed emptyFM [low..high])
  where
    low    = foldFM (\r _ -> min r) 10000 n_r
    high   = foldFM (\r _ -> max r) 0     n_r + 1
    oldSum = fromIntegral $ foldFM (const (+)) 0     n_r
    smooth i = 
      let fm_l = reverse $ eltsFM $ filterFM (\r _ -> r < i) n_r
          fm_g =           eltsFM $ filterFM (\r _ -> r > i) n_r
      in  average (take 1 fm_l ++ take 1 fm_g ++ maybeToList (lookupFM n_r i))
    addSmoothed i fm = addToFM fm i (smooth i)
    normalize fm = mapFM (const (*s)) fm
      where s = oldSum / foldFM (const (+)) 0 fm

Finally, given a value for N and N_1 and a smoothed N_r table, we can compute the smoothed r-star table:

-- compute r* given N, N_1 and the smoothed n_r table
compute_rstar :: Int -> Int -> FiniteMap Int Double -> FiniteMap Int Double
compute_rstar n n_1 n_r_table = 
  removeMax $ addToFM (mapFM calc_rstar n_r_table) 0 r0
  where
    r0 = fromIntegral n_1 / fromIntegral n
    calc_rstar r n_r = 
        (fromIntegral r + 1) * lookupWithDefaultFM n_r_table 0 (r+1) / n_r
    removeMax fm = filterFM (\e _ -> e < mx) fm
      where mx = foldFM (\e _ -> max e) 0 fm

The rest of the code in the program is just IO functions and helper functions.

Now, some results (mostly related to binning). We use 107k sentences (tokenized, lowercased) from news data (2.6 million words). A plot (x-axis is log-plotted) of the N_r table (without binning) appears below, and follows our Zipf distribution:

When binning is performed and we look individually at the first, middle and last bins, we get:

Zoomed in on the lower-left, it looks like:

The red dots are the whole dataset, the blue dots are the middle portion, the green are the first and the yellow are the last. Here, we can see that the blue (middle) follow the same looking distribution as the whole dataset, whereas the green (first) and yellow (last) are very flat. Moreover, the green (first) drops off dramatically after r=2. This means that there aren't really any "poor" guys to give their mass to the "poorer" guys.

Download the source code: GoodTuring.hs.

questions, comments? email me