SVMsequel Documentation
Hal Daume III ()
First Release 31 March 2004
Introduction
SVMsequel is a complete environment for training and use support
vector machines. Some familiarity with kernel methods will be helpful
(see here for class
notes I've used for using SVMs for natural language processing if you
need a refresher.
Documentation
There is far too much to document here, so please see svmsequel.ps or svmsequel.pdf for relevant documentation on
how to use SVMsequel.
I will say that there is really no reason to ever use SVMseq again. You really should switch over to
this new program. It's orders of magnitude faster.
Currently SVMsequel doesn't support ranking (likely to come soon) or
regression (unlikely to come soon). It's very fast (did I say that
yet) and handles enormous datasets very nicely. Additionally, it
supports multiclass classification and probabilistic
classification.
Kernels available include:
- Linear, Polynomial, RBF, Sigmoid
- Information Diffusion on discrete manifolds
- Information Diffusion on the n-simplex
- String kernels (based on dynamic programming) -- O(n*m)
- String kernels (based on suffix-trees) -- O(n+m)
- Tree kernels (as above)
You can also use your own kernel matrix if these don't satistfy your
needs.
Download
You can download
the source, or binaries for
i686 Linux or
Sun4 Solaris. To compile it you will need an O'Caml compiler.
Bugs
If you observe any bugs (things that say "Internal error") that are
replicable or not, please send me the relevant files and the command
you last executed. If possible, run save environment and
just send me the environment and the command. Also let me know what
architecture/OS you are using.
Frequently Asked Questions
Any question that I've received by email is posted here (senders
remain anonymous). So far most of these have to do with the
nonstandard kernels.
- Q: do both dpstring and ststring compute the smae 'standard'
string kernel? if so why no n parameter for ststring?
A: Yeah, more or less. See Vishy's 2002 NIPS paper, "Fast String and Tree
Kernels". Basically, if n > |longest string|, then they are equivalent.
Otherwise, the ststring kernel is strictly more powerful, since it doesn't
have this cutoff. Of course, if lambda is, say, 0.5, then after about 4
or 5 exponentiations, it's gone do zero. There's a way to reintroduce
maxn into the suffix trees, but I don't really need that and it's kind of
a hassle, so it's not implemented.
- Q: what does the scan command actually do?
It just gives you information about the file. Basically, you can specify
arguments to 'load' of the form '[noclasses] [low n] [high m] [points k]'
where noclasses iff there aren't classes in the file, n is the lowest
feature #, m is the highest feature # and k is the number of datapoints.
if you specify all these, then if can just load the data in one pass. If
you don't specify these, it runs 'scan' internally to calculate them, then
seeks back to the beginning of the file, then loads the data. If you run
'scan', it will give you this information...it's not terribly important,
but if you have enormous files and you don't want to have to read through
them twice, and you know those #s, you can give them to 'load' to speed
things up.
- Q: do you have an example of using the discrete kernel?
This isn't well tested, but the following should work:
+1 discrete:1:red discrete:2:happy
+1 discrete:1:blue discrete:2:happy
-1 discrete:1:green discrete:2:sad
-1 discrete:1:yellow discrete:2:sad
then
> new data train
> add discrete clump to train cardinality 4
> add discrete clump to train cardinality 2
> load svm file foo.svm to train clump 0 from 1 clump 1 from 2
the discrete kernel is the least well tested thing, though, so you might
encounter bugs.
- Q: is there a way to display the models in readable form?
Here is some code to do that. Build it
with:
ocamlc suffixTree.ml suffixTreeKernel.ml vocabulary.ml util.ml binaryIO.ml
datapoint.ml dumpModelInfo.ml -o dumpModelInfo
It's nothing fancy, and doesn't dump some information (normalization
information, in particular), but should probably suffice for what you
want.
- Q: Can I control how svmsequel combines multiple string
features? Can I specify different lambdas for the differnt
parameters?
Ah, okay, so this isn't clear.
A single string clump can only hold one string (probably should elicit a
warning about this). So what's happening when you say "all" is that first
feature 1 is loaded into the string clump, then feature 2 overwrites it,
then 3 overwrites it, etc., so you're left with the lemmatized version.
What you want to do is create a different clump for each string feature.
Then, you can give each one different lambdas. You can also adjust (or
better yet, optimize) the weights for the features. Something like:
new data train
add ststring clump to train lambda 0.5
add ststring clump to train lambda 0.25
add ststring clump to train lambda 0.75
add ststring clump to train lambda 0.9
load svm file test.svm to train clump 0 from 1 clump 1 from 2
clump 2 from 3 clump 3 from 4
cross-validate train 10 folds
optimize model m from train 0:weight:log2:-2:1:2 1:weight:log2:-2:1:2
2:weight:log2:-2:1:2 3:weight:log2:-2:1:2
(the relevant data is:
3 string:1:Who#was#the#22nd#President#of#the#US#? string:2:WP#VBD#DT#JJ#NN#IN#DT#NP#SENT string:3:who#be#the#22nd#president#of#the#US#? string:4:Who#WP#was#VBD#the#DT#22nd#JJ#President#NN#of#IN#the#DT#US#NP#?#SENT
5 string:1:What#is#the#money#they#use#in#Zambia#? string:2:WP#VBZ#DT#NN#PP#VVP#IN#NP#SENT string:3:What#be#the#money#they#use#in#Zambia#? string:4:What#WP#is#VBZ#the#DT#money#NN#they#PP#use#VVP#in#IN#Zambia#NP#?#SENT
1 string:1:How#many#feet#in#a#mile#? string:2:WRB#JJ#NNS#IN#DT#NN#SENT string:3:How#many#foot#in#a#mile#? string:4:How#WRB#many#JJ#feet#NNS#in#IN#a#DT#mile#NN#?#SENT
5 string:1:What#is#the#birthstone#of#October#? string:2:WP#VBZ#DT#NN#IN#NP#SENT string:3:What#be#the#birthstone#of#October#? string:4:What#WP#is#VBZ#the#DT#birthstone#NN#of#IN#October#NP#?#SENT
4 string:1:What#is#e-coli#? string:2:WP#VBZ#NNS#SENT string:3:What#be##? string:4:What#WP#is#VBZ#e-coli#NNS#?#SENT
)
- Q: can you cross-validate parameters for string kernels? what values are suggested?
Typically, people use 0.75 or 0.9. This is the lambda parameter. You
specify it by, eg:
add ststring clump to train lambda 0.9
You *can* cross-validate it, by, for example:
optimize model m from train 0:lambda:0.5:0.1:1
which will try lambda=0.5, 0.6, 0.7, 0.8, 0.9 and 1.0
- Q: can i use ssv or csv files with string kernels?
Hrm. To be honest, I've never tried using string kernels with ssv or csv
data. There's obviously a bug there. Just convert your data to svm file
format and it should be fine. I'll take a look at it though; thanks for
pointing it out.
- Q: why am I only getting 50% accuracy on binary problems?
There is
currently a small bug in estimating the bias parameter that affects binary
classifiers (since most things I do are multiclass, I didn't notice this
immediately). I am going to assume that the predictions made by svmsequel
are either all positive or all negative. If this is the case, you might
try using the probabilistic classifier option (estimate sigmoid for
<model> from <dataset>) and see if this fixes the problem (estimating
probabilistic parameters essentially re-estimates the bias). If my
assumptions are wrong, or this doesn't work, please let me know.