TagChunk: A Joint POS Tagger and Syntactic Chunker

TagChunk: A Joint POS Tagger and Syntactic Chunker
Hal Daume III (

)

This is a preliminary release of the joint part of speech (POS) tagger and syntactic chunker described in the original ICML 2005 Learning as Search Optimization paper (this is similar, but not identical to the one described in that paper -- it is trained on a different subset of the data and not as much care was taken to tune hyperparameters against dev data). There will be a subsequent release based on the new learning technique described in the NIPS 2005 Search-based Structured Prediction paper, but that is not yet ready for mass consumption. The one released here is also significantly more efficient.

Downloading:

For now I'm not going to bother telling you how to train it because it's not all that user friendly. However, you can freely use it for prediction. You will need the following materials in order to run it:

The executable, available for Linux (200k).
The resources file which you will have to unpack into a local directory (11mb).
A weights file; the following are pre-built for you:
Optional: a perl script that wraps the executable.

New release (Oct 11): now supports n-best outputs (at most the size of the beam), and weighted examples (put a single line "$W <float>" before each sentence that should have non-unit weight). Also improved training efficiency for weight averaging.

Running:

To run the executable, you say:

% tagchunk.i686 -predict . (weights file) (test file) (resource directory)

You should replace (weights file) with the name of the weights you wish to use, (test file) with the name of the file you wish to tag and (resource directory) with the name of the directory into which you extracted the lists. The program writes the output to stdout.

Example:

For example:

% cat test
The man with the telescope saw me across the street .

% tagchunk.i686 -predict . w-1 test ~/projects/chunking/ > test.out
Loading lists...list-locations1...list-locations2...list-locations3...list-locations4...list-locations5...list-names1...list-names2...list-names3...list-namesA...list-namesB...list-nes...list-positions1...list-positions2...list-positions3...list-verbs1...list-verbs2...list-verbs3...list-tags-wsj...list-ulfreq...list-tags-all...list-tags-all2...list-mp-address...list-mp-adj...list-mp-aux...list-mp-beforeorg...list-mp-begwords...list-mp-dist...list-mp-nn...list-mp-nn1...list-mp-noun...list-mp-subj...list-mp-units...

% cat test.out
The_DT_B-NP man_NN_I-NP with_IN_B-PP the_DT_B-NP telescope_NN_I-NP saw_VBD_B-VP me_PRP_B-NP across_IN_B-PP the_DT_B-NP street_NN_I-NP ._._B-O

You can ignore the "Loading lists..." line. The output should be fairly clear if you know what POS tags and chunk labels look like.

Perl Script:

If you wish to use the included perl script, you need to modify it so that the $BIN variable points to the directory where the binary lives and the $RES variable points to a directory containing both the lists and the weights files. The usage of this is:

% tc.pl [-faster|-lc] file1 ... fileN

Where -faster means to use beam 1 and -lc means to use the lower-case weights. The list of files (file1, ..., fileN) are tagged and the outputs are written to (file1.tc, ..., fileN.tc).

Utah people:

You can run ~hal/bin/tc.pl on any linux machine to run this tagger.

Terms of Use:

Use this for whatever you want. Please cite the ICML paper, or at least put a footnote in any paper that you use it for. As usual, I make no guarantees that it won't destroy your computer.