High Accuracy Parsing of Name Internal Structure (HAPNIS)
A while ago I wanted a parser for identifying internal structure in
people's names. I couldn't find one, so I tagged data with the goal
of building a tagger to do this. In the process of tagging data, I
realized that this problem is really really easy, so I wrote a short
rule-based perl script for doing it, called HAPNIS. I took the 220
data points I annotated and randomly chose 100 of them for development
data and 120 for test data. After a few rounds of development, I get
100% on the development data. I then ran it on the test data. On the
test data, I get 99.1% accuracy; the two errors I make are:
Truth: Queen_Role Latifah_Surname
Hyp: Queen_Forename Latifah_Surname
Truth: Lee_Forename Ann_Continue Womack_Surname
Hyp: Lee_Forename Ann_Middle Womack_Surname
The first error would be trivial to fix, but I didn't want to cheat.
The second error would be a little harder, but given context (i.e.,
the document source), you could probably fix errors of this kind, too.
The script has a single option, "-names", which toggles whether a list
of common first (but not last) names are used to help disambiguate
single token names. Without this option, the system scores 96.8% on
the test data.
The tag set I use is:
- Surname: Last (family) names.
- Forename: Given name.
- Middle: Given middle names (i.e., not first names).
- Link: A link between two names of the same kind. Used for
conjoined names, and Arabic names like "Al - Jones", where the "-"
will be tagged as a link.
- Role: Mr., Dr., etc.
- Suffix: Name suffixes, like on my name, "III", "Jr",
etc.
- Continue: When a name is part of a multi-word unit, but not
a link, I used continue. The only example is in the test data, where
"Lee Ann" is really a whole first name that just happens to have a
space.
You can download the development data, the test data, the scoring
script and the HAPNIS Perl script (you will have to rename these *.pl; unfortunately my web server
doesn't like to serve up .pl files). If you
find this useful, or find serious bugs, please email me at . Note that this is
developed based on the ACE 2004 training data, and is
mostly based on news, so it is biased toward calling single-word
entries surnames, rather than first names.