Language ID using n-gram models
Language ID using n-gram models
Language identification
Language ID is the problem of taking a document in an unknown language
and determining what language it's written in. This is frequently a
necessary step before processing the document in other ways. It turns
out that n-gram models are simple and very effective way to perform
language identification. You will write a program that does language
ID using a bigram model.
The program (or top-level LISP function, if you're writing in LISP)
should take three arguments, infile1, infile2, and
testfile, which will be filenames. The first two files will
contain text in two languages, e.g. infile1 could contain
English text and infile2 could contain French text. (We'll call
these language1 and language2 just to keep things general.) The third
argument identifies a file containing text in an unknown language.
Next I describe what your program or function should do with its
inputs.
What the program computes
- Using infile1, create a bigram frequency table for
language1. How to represent bigrams and the choice of data structure
for the frequency table is up to you, but one easy way would be to
create a hash table where each key is a character bigram
(e.g. represented as a dotted pair in LISP) and the value for that key
is the number of times you've seen that bigram. For example, if the
file contained "there is the dog" then the character bigram
<t,h> would have frequency 2, and the character bigram
<SPACE,d> would have frequency 1. (LISP programmers will find
the read-char function very helpful here.) You'll also need
to record the total number of bigrams seen in infile1.
- Do the same thing for language2 using infile2.
- Compute the probability for the string of characters in testfile,
using a maximum likelihood estimate based on the bigram
frequencies from infile1. That is, compute
Pr(testfile|language1). (Or, if you prefer, compute the
log probability instead.) So, for example, testfile might
contain "Mary ran to the store.\nJohn likes Mary.\n", in which
case you would be estimating the bigram-model probability of
seeing that 40-character string based on the frequencies
from infile1. (Here \n represents the newline character.)
- Compute the probability for the string of characters in testfile,
using a maximum likelihood estimate based on the bigram
frequencies from infile2. That is, compute Pr(testfile|language2).
(Or, if you prefer, compute the log probability instead.)
- Output "Guess LANGUAGE1" or "Guess LANGUAGE2" depending on
which conditional probability is greater, or "UNDECIDED" if
the two are equal.
Data to work with
For developing and testing your program, you can work with the following
data files:
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/GEN.EN
(The Book of Genesis in English)
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/GEN.FR
(The Book of Genesis in French)
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/21999.1
(a long test document in English)
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/21999.2
(a long test document in French)
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/test.1
(a short test document in English)
-
ftp://umiacs.umd.edu/pub/resnik/ling645/bible/test.2
(a short test document in French)
Notes and recommendations
- For your own sanity, do not do your development using the
complete GEN.EN and GEN.FR files nor the long test documents! Create
(much) shorter files to use as infile1 and infile2 so you can make
sure your code works. On my fairly fast machine, running my program
using GEN.EN and GEN.FR can take a good minute or so. (The
head command in Unix is useful for such purposes.)
- An alternative, if you're clever, would be to write one program
to create the two frequency tables and write them to disk files,
and then have your language ID program read in the frequency
tables rather than recomputing them each time. But that's not
required.
- When creating a bigram frequency table based on a file, the first
character is obviously special since there is no preceding
character. The easiest thing to do is write your code so that
the first character in a file is treated as if it were preceded by
a space (character #\space in LISP).
- If you run into underflow problems work with logprobs instead of
probabilities.
- If you're up for doing something more realistic, modify your
program so it uses smoothed probability estimates rather than
maximum likelihood estimates. (Add-one smoothing is feasible,
but Good-Turing would probably be too much to attempt given the
time constraints unless you're very ambitious.)
Turn in
- Program listing.
- Printout showing the program running successfully.
It's ok if it's only on the small data and/or test files, but what
I'd really like to see are runs on testfiles test.1, test.2,
21999.1 and 21999.2, using infile1=GEN.EN and infile2=GEN.FR.
- Answer this question: When you test on files 21999.1 and
21999.2 (with GEN.EN and GEN.FR as the infiles), why does the
program turn out to report "undecided"
with probabilities both equal to zero (or logprobs both
equal to negative infinity)? You should be able to answer this
question even without running the program, by looking at the
training and test data. (Note that in some LISP implementations
you might get tiny, miniscule probabilities rather than probabilities
exactly equal to zero.)
For less experienced programmers
If you really have no idea where to start, you can take a look at the
following LISP code.
This is a LISP code skeleton for one way of doing language ID using
bigram models. Completing all the "insert code here" sections of code
is one way to solve the problem. I recommend against using this code
skeleton unless you need to, since my solution matches the way I
think, not necessarily the way you think, and runs the risk of
confusing you more than it helps!