SVMseq Documentation
Hal Daume III (

)

SVMseq is now defunct!
New version 1.2 available! (14 Dec 2003)

Development of SVMseq has discontinued and it has been replaced by SVMsequel Please use this from now on. This page will remain here for memory's sake.

Contents:

Introduction
Why SVMseq
Using SVMseq
Using String Kernels
Using Tree Kernels
One-Class SVMs
Download
Version Changes
References

Introduction

SVMseq is a Haskell implementation of the gradient descent optimization algorithm developed by Vijayakumar and Wu (1999) for computing the Langrangian multipliers for support vector machines (that is, it is an implementation of a machine learning algorithm). Furthermore, SVMseq implements sample selection, as described by Schohn and Cohn (2000). Sample selection is very convenient to implement in the sequential optimization framework and makes optimization very fast.

Why SVMseq?

Why not just use SVM-Light for all your SVM needs? Well, SVMseq offers certain advantages. (Of course, SVM-Light also offers certain advantages.) I've listed some of those here:

Pros of SVMseq:

Often [1], the gradient descent optimization is faster than linear programming
Linear programming techniques have problems with the langrange multipliers becoming not-quite zero; this doesn't happen in gradient descent [1].
SVMseq implements sample selection.
SVMseq implements an (optional) randomized quasi-linear algorithm [5].
SVMseq natively supports string, tree kernels and combination of string kernels with other kernels.
SVMseq has built in training for probabilistic outputs.
SVMseq can generate precision/recall graphs out of the box.
SVMseq can be interrupted and restarted partway through.
SVMseq allows greater control over memory usage for large problems.
SVMseq can calculate intermediate results on training and testing data.
SVMseq isn't written in C.

Pros of SVM-Light:

SVM-Light can do regression problems (SVMseq cannot currently).
SVM-Light provides many useful cross validation features, induction, Xi/alpha estimates, etc.
SVM-Light is written in C.
SVM-Light is probably better tested, but then again, it is written in C.

Using SVMseq

SVMseq comes with two main programs, SVMseqLearn and SVMseqClassify, which, not surprisingly, perform learning and classification, respectively. SVMseqClassify usage is straight-forward. SVMseqLearn is a bit more complicated. Running it with no arguments gives you the help screen which is only moderately helpful (though it does list all) options:

SVMseq Learner Version 1.1

Usage: SVMseqLearn [options] example-file model-file
   or: SVMseqLearn stdin < config-file
            where config-file just writes out options

  -?        --help                 this help
  -f v                             pretend we are SVM-Light version v
  -c float                         C: trade-off between training error and margin (1)
  -t int                           type of kernel: 0-linear, 1-poly, 2-rbf, 3-sigmoid (0)
  -d int                           poly:     degree (2)
  -g float                         rbf:      gamma (1/sigma^2) (1)
  -s float                         sig/poly: scaling factor (1)
  -r float                         sig/poly: additive factor (1)
  -i int                           maximum iterations (1000), -1 for infinite
  -h float                         change in min h to stop (0.000001), 0 for none
  -v int                           compute training error every v iterations (0), 0 for never
  -l float                         lambda (affects bias)
  -m int                           max memory to use for kernel cache in mb (40)
            --sigmoid              fit a sigmoid (for probabilistic classificatiation)
            --rank                 do preference ranking
            --rank-margin=float    margin for ranking (0)
            --rank-highmem         use lots of memory for ranking
            --string-lambda=float  lambda value for string kernel (0.9)
            --string-maxn=int      gram length for string kernel (1)
            --tree-lambda=float    lambda value for tree kernel (0.5)
            --tree-maxd=int        max depth for tree kernel (4)
            --test=file            file containing test data
            --mu=float             momentum coefficient for gradient descent, 0=no momentum (0)
            --noread               leave training data on disk
            --dump-h-freq=int      dump lagrangians every k iterations, 0 for none (0)
            --dump-h-file=file     dump lagrangians to this file (dump-h)
            --read-h-iter=int      read iteration number from lagrangian file, 0=none (0)
            --read-h-file=file     read lagrangians from this file (dump-h)
            --ss                   use sample selection
            --ss-0=int             initial # of samples selected (32)
            --ss-batch=int         # of samples to add each time (8)
            --ss-i=int             iteration frequency to add samples (20), 0 for none
            --ss-h=float           change in min h to add new samples (0.00001), 0 for none
            --ss-x=int             max sample size, 0 for none (0)
            --rand-delta=int       combinatorial dimension of the data => use randomized training, 0=no randomized (0)
            --rand-mini=int        minimum number of iterations of randomized selection to run (1)
            --rand-maxi=int        maximum number of iterations of randomized selection to run (6 delta ln |examples|)

You can either put your arguments on the command line or, if you prefer, you can write them to a file and then pipe that file in with stdin as the only argument. Other than that, most arguments are very similar to those of SVM-Light. Defaults are written in parentheses at the end of the line. Input and output files are in the exact same format as SVM-Light; see that page for more documentation.

-f SVMseq outputs models in an identical format to SVM-Light, except it prints a different identifer at the top. If you wish to use SVMseqLearn with SVM-Light's classifier, you'll need to either change this manually or specify "-f 5.00" to output ours with an SVM-Light header. Default is to not pretend we are SVM-Light.
-c Specifies the cost for misclassification for soft margins. If you don't know what this means, read up on an introduction to SVMs, for instance [3]. Default is 1.
-t Specifies the kernel type, one of:
0: linear K(x,x') = x*x' 1: polynomial K(x,x') = (r + s x*x')^d 2: radial K(x,x') = exp(-g ||x-x'||^2) 3: sigmoid K(x,x') = tanh(r + s x*x')
In all of these, * means dot product. Default is linear.
Note, of course, that a sigmoid "kernel" doesn't satisfy Mercer's condition and so really isn't officially a kernel. It's provided since people have historically found it useful.
-d Degree for polynomial kernel. Default is 2.
-g Inverse variance for radial kernel. Default is 1.
-s Scaling factor for polynomial and sigmoidal kernels. Default is 1.
-r Additive factor for polynomial and sigmoidal kernels. Default is 1.
-i Specify the maximum number of iterations to run or, if you want it to run until the change of lagrangians is sufficiently low, specify -1 for infinite iterations. Default is 1000, though this is likely substantially larger than necessary.
-h Minimal change in lagrangians to keep going. The default value is 0.000001, which means that it will stop once the change is strictly less than 0.000001. Obviously a value of 0 will cause this to never stop. Warning: If you specify -i -1 -h 0, the program will not terminate.
-v How often to compute training (and testing) error, in terms of number of iterations. With 0, this is never computed (except at the end). Default is 0 (never).
-l Specifies the lambda value for the optimization. Choosing 1 should be fine (this is the default). If you want more information about lambda, read [1].
-m Specify the maximum amount of memory (in megabytes) to use for the kernel cache. The larger, the faster, but the more memory intensive. I'd suggest making this large and using --noread (see later), but it's up to you. Default is 40 (i.e., 40 megabytes).
--sigmoid Compute A and B terms for probabilistic outputs. I.e., if you want to get probabilities out of the classifier, you must train these terms. Then use the -p or --sigmoid option to SVMseqClassify. See [8] for more information.
--rank Do preference ranking; see [4] for more information.
--rank-margin Only consider rank constraints where the difference between the two values is at least this value (causes fewer constraints and hence faster optimization). Default is 0 (i.e., consider all).
--rank-highmem Usually when doing ranking we don't actually construct the ranking constraints in memory; this means we have to do extra work to compute dot products. Specifying this will cause the program to store all the constraints and will hence take more memory, but will cause the optimization to complete faster (unless it caches all the kernel evaluations -- see -m).
--string-lambda Specifies the lambda value used to back off the string kernel value when tokens are too far separated (see "Using String Kernels" below). Should be in the range from zero to one, inclusive. Default is 0.9.
--string-maxn Specifies the maximum length of substrings to consider. Should be strictly greater than zero. Default is 1.
--tree-lambda Specifies the lambda value used to downweight big subtrees (see "Using Tree Kernels" below). Should be in the range from zero to one, inclusive. Default is 0.5.
--tree-maxd Specifies the maximum depth of subtrees to consider. Should be strictly greater than zero. Default is 4.
--test Read examples from this file for computing test error rates.
--mu In the gradient descent, we can use momentum. If you want to (you may converge faster, but it's not guarenteed), specify the coefficient here; 0 means no momentum. Default is zero.
--noread This leaves the training data on disk and does not read the examples into memory. For enormous training problems (especially when you're using sample selection), this is probably wise. You can compensate for the slow access time by using a large kernel cache (see -m).
--dump-h-freq You can dump the lagrangians every n iterations, allowing you to kill the process at some point and then restart it. You can also use this as approximate solutions. 0 means don't dump. Default is 0.
--dump-h-file Specifies the file to dump lagrangians to. Default is dump-h.
--read-h-iter If you've previously dumped lagrangians, you can read them back. This specifies the iteration number to read. 0 means don't read. Default: 0.
--read-h-file Specifies the file to read the lagrangians from. Default is dump-h. Note that when used in conjunction with sample selection, you must specify that the initial number of samples is the same as the number of samples specified in the lagrangian file at the given iteration. (Of course, we'll tell you that you've got it wrong if you do.)
--ss Do sample selection. See the remaining arguments for how to tune sample selection to your needs. Probably you want to use this in conjunction with --noread and a large value for -m.
--ss-0 Specify the number if samples to start out with (these are chosen randomly). Default is 32.
--ss-batch Specify the number of samples to add every time samples are added. Remember [2] that every time samples are added, each unadded example has to be classified using the current hyperplane, which can be expensive. So you should specify a fairly large number here. Default is 8, which is too low for practical purposes.
--ss-i Add samples after this many iterations, 0 means never add samples by counting iterations. You should give the hyperplane enough time to settle a bit before adding more samples, otherwise it will just get confused. Default is 20 and should probably be higher for practical purposes.
--ss-h Instead of (or in addition to) adding samples based on iteration counts, you can add them according to minimal changes in the lagrangians, just like -h. 0 means none. Default is 0.00001.
--ss-x Specify the maximum sample size. This is useful so that the program can converge after added a certain number of samples without having to worry about adding any more. 0 means none. Default is 0.
--rand-delta Specify the combinatorial size of the data. If this is a nonzero value, then we use quasi-linear randomized training [5]. If you don't know the combinatorial size of the data, then simply choose the value r you would like for the training set sizes (note that r is then an upper-bound on the number of SVs) and then choose delta as sqrt(r) / 6. Default is 0.
--rand-mini Specify the minimum number of randomized iterations to run. Default is 1.
--rand-maxi Specify the maximum number of randomized iterations to run. A value of 0 means only to stop when the SVs generated by the randomized examples correctly classify all the other data points (and should only be used when data is linearly separable). Default is 0.

Using String Kernels

String kernels enable you to compute features over substrings of arbitrary length (set by --string-maxn). The lambda value specifies how much penalty there is for splitting a string up. A value of 1 means that you don't care how far away words are while a value of 0.001 means you care a lot. Of course this is irrelevant if --string-maxn is 1.

Strings can be specified in feature files as long as their id begins with the string "string:. Note that you can mix string features with non-string features (this is possible since linear combinations of kernels are again kernels). You can also have multiple string features (in theory you could use different maxn and lambdas for each, but this isn't implemented). Note that string ids still need to be distinct from non-string ids. Strings should be tokenized by the pound (#) character (that is, spaces are replaced by #). Here is an example which mixes string features with non-string features:

+1 string:1:the#man#is#nice   2:5 string:5:svmseq#rocks
+1 string:1:the#woman#is#nice 3:1 string:5:so#does#svmlight
-1 string:1:i#eat#hamburgers  2:1 string:5:yay

Here, there are two string features (1 and 5) and two non-string features (2 and 3). You can combine these using any kernel you wish.

Using Tree Kernels

Tree kernels enable you to compute features over subtrees of arbitrary depth (set by --tree-maxd). The lambda value (--tree-lambda) specifies how much deeper trees are discounted (see [7]). A value of 1 means no discounting. Of course this is irrelevant if --tree-maxd is 1.

Like string features, tree features are preficed by the string "tree". Trees are in Penn Treebank format, with the exception that spaces are replaced by pounds. Pounds themselves should therefore be replaced by some unique token which you can identify (for instance, perhaps the string "*pound*"). For example, for the Penn-style tree:

(S (NP (DT  "The")
       (NN  "man"))
   (VP (VBD "ate")
       (NP  (DT "a")
            (NN "sandwich")))
   (PERIOD "."))

We would have the following instance (if it were positive):

+1 tree:1:(S#(NP#(DT#"The")#(NN#"man"))#(VP#(VBD#"ate")#(NP#(DT#"a")#(NN#"sandwich")))#(PERIOD#"."))

Of course, as with strings, these can be combined with other features.

One-Class SVMs

To perform learning with one-class SVMs (or 1SVMs, see [6]), you needn't do anything special. Simply specify all instances as being from class "1" and the minimization will proceed as in [6]. The only warning is that C should be set to at least 1.

Download

You can download the source code, or an executable compiled using GHC 6.0.1 for Linux, Windows, Sparc Solaris or Mac OS X (thanks to Markus Schnell).

If you need support for tree kernels and will be compiling this yourself, you'll also need a few support libraries which deal with parse trees. If you don't need this, you can compile using -DNOTREEKERNEL and it will not use tree kernel features. The libraries you need are (note, you will need to ensure that the directory structure matches the module names when compiling): Util.DynamicMap, NLP.FiniteMap, NLP.PennParser, NLP.TreePos, PennUtil.DShow, PennUtil.Util.

You are free to use this program (though I don't promise anything about it) for whatever you want. If you do use it, and your use ends up being published, I'd greatly appreciate that (1) you email me and let me know -- I'm always interested when people use my software -- and that (2) you put a note (footnote is fine) in your publication that you used this software, together with a link to this page.

For those of you here at ISI, you can find executables in /nfs/isd3/hdaume/bin.i686/ (for Linux), /nfs/isd3/hdaume/bin.sun4/ (for Solaris).

Version Changes from 1.1 to 1.2

Added tree kernels.
Added probabilistic classifications, based on [8].
Minor speed improvements.
Fixed small bug when using multiple kernels over strings.
Fixed bug with quasi-linear training which caused it to only run one iteration.

Version Changes from 1.02 to 1.1

Fixed bug with early stopping based on minimal change in lagrangians (wasn't checking absolute changes)
Added support for string kernels.
Added support for quasi-linear training [5].
Made classification lazier (so you don't have to wait for it to finish to get classifications out).
Fixed minor bug in ranking.
Other minor speed improvements.

References

[1] Sethu Vijayakumar and Si Wu (1999), Sequential Support Vector Classifiers and Regression. Proc. International Conference on Soft Computing (SOCO'99), Genoa, Italy, pp.610-619.

[2] G.Schohn and D.Cohn (2000), Less is more: Active learning with support vector machines. Proc. International Conference on Machine Learning (ICML'00), Stanford University, USA, pp.839-846.

[3] C.J.C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery, Vol. 2, Number 2, p. 121-167, 1998

[4] T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

[5] J. Balcazar, Y. Dai and O. Watanabe, Provably Fast Training Algorithms for Support Vector Machines, 2001 IEEE International Conference on Data Mining (ICDM'01), San Jose, California .

[6] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola and R. Williamson, Estimating the support of a high-dimensional distribution, Technical Report 99-87, Microsoft Research, 1999.

[7] M. Collins and N. Duffy, Convolution Kernels for Natural Language, Advances in Neural Information Processing Systems 14, 2002.

[8] J.C. Platt, Probabilities for SV Machines, Advances in Large Margin Classifiers, MIT Press, 1999.

Questions, comments, suggestions, please email me at .

-f	SVMseq outputs models in an identical format to SVM-Light, except it prints a different identifer at the top. If you wish to use SVMseqLearn with SVM-Light's classifier, you'll need to either change this manually or specify "`-f 5.00`" to output ours with an SVM-Light header. Default is to not pretend we are SVM-Light.
-c	Specifies the cost for misclassification for soft margins. If you don't know what this means, read up on an introduction to SVMs, for instance [3]. Default is 1.
-t	Specifies the kernel type, one of: 0: linear K(x,x') = xx' 1: polynomial K(x,x') = (r + s xx')^d 2: radial K(x,x') = exp(-g \|\|x-x'\|\|^2) 3: sigmoid K(x,x') = tanh(r + s xx') In all of these, `` means dot product. Default is linear. Note, of course, that a sigmoid "kernel" doesn't satisfy Mercer's condition and so really isn't officially a kernel. It's provided since people have historically found it useful.
-d	Degree for polynomial kernel. Default is 2.
-g	Inverse variance for radial kernel. Default is 1.
-s	Scaling factor for polynomial and sigmoidal kernels. Default is 1.
-r	Additive factor for polynomial and sigmoidal kernels. Default is 1.
-i	Specify the maximum number of iterations to run or, if you want it to run until the change of lagrangians is sufficiently low, specify -1 for infinite iterations. Default is 1000, though this is likely substantially larger than necessary.
-h	Minimal change in lagrangians to keep going. The default value is 0.000001, which means that it will stop once the change is strictly less than 0.000001. Obviously a value of 0 will cause this to never stop. Warning: If you specify `-i -1 -h 0`, the program will not terminate.
-v	How often to compute training (and testing) error, in terms of number of iterations. With 0, this is never computed (except at the end). Default is 0 (never).
-l	Specifies the lambda value for the optimization. Choosing 1 should be fine (this is the default). If you want more information about lambda, read [1].
-m	Specify the maximum amount of memory (in megabytes) to use for the kernel cache. The larger, the faster, but the more memory intensive. I'd suggest making this large and using `--noread` (see later), but it's up to you. Default is 40 (i.e., 40 megabytes).
--sigmoid	Compute A and B terms for probabilistic outputs. I.e., if you want to get probabilities out of the classifier, you must train these terms. Then use the `-p` or `--sigmoid` option to SVMseqClassify. See [8] for more information.
--rank	Do preference ranking; see [4] for more information.
--rank-margin	Only consider rank constraints where the difference between the two values is at least this value (causes fewer constraints and hence faster optimization). Default is 0 (i.e., consider all).
--rank-highmem	Usually when doing ranking we don't actually construct the ranking constraints in memory; this means we have to do extra work to compute dot products. Specifying this will cause the program to store all the constraints and will hence take more memory, but will cause the optimization to complete faster (unless it caches all the kernel evaluations -- see `-m`).
--string-lambda	Specifies the lambda value used to back off the string kernel value when tokens are too far separated (see "Using String Kernels" below). Should be in the range from zero to one, inclusive. Default is 0.9.
--string-maxn	Specifies the maximum length of substrings to consider. Should be strictly greater than zero. Default is 1.
--tree-lambda	Specifies the lambda value used to downweight big subtrees (see "Using Tree Kernels" below). Should be in the range from zero to one, inclusive. Default is 0.5.
--tree-maxd	Specifies the maximum depth of subtrees to consider. Should be strictly greater than zero. Default is 4.
--test	Read examples from this file for computing test error rates.
--mu	In the gradient descent, we can use momentum. If you want to (you may converge faster, but it's not guarenteed), specify the coefficient here; 0 means no momentum. Default is zero.
--noread	This leaves the training data on disk and does not read the examples into memory. For enormous training problems (especially when you're using sample selection), this is probably wise. You can compensate for the slow access time by using a large kernel cache (see `-m`).
--dump-h-freq	You can dump the lagrangians every n iterations, allowing you to kill the process at some point and then restart it. You can also use this as approximate solutions. 0 means don't dump. Default is 0.
--dump-h-file	Specifies the file to dump lagrangians to. Default is `dump-h`.
--read-h-iter	If you've previously dumped lagrangians, you can read them back. This specifies the iteration number to read. 0 means don't read. Default: 0.
--read-h-file	Specifies the file to read the lagrangians from. Default is `dump-h`. Note that when used in conjunction with sample selection, you must specify that the initial number of samples is the same as the number of samples specified in the lagrangian file at the given iteration. (Of course, we'll tell you that you've got it wrong if you do.)
--ss	Do sample selection. See the remaining arguments for how to tune sample selection to your needs. Probably you want to use this in conjunction with `--noread` and a large value for `-m`.
--ss-0	Specify the number if samples to start out with (these are chosen randomly). Default is 32.
--ss-batch	Specify the number of samples to add every time samples are added. Remember [2] that every time samples are added, each unadded example has to be classified using the current hyperplane, which can be expensive. So you should specify a fairly large number here. Default is 8, which is too low for practical purposes.
--ss-i	Add samples after this many iterations, 0 means never add samples by counting iterations. You should give the hyperplane enough time to settle a bit before adding more samples, otherwise it will just get confused. Default is 20 and should probably be higher for practical purposes.
--ss-h	Instead of (or in addition to) adding samples based on iteration counts, you can add them according to minimal changes in the lagrangians, just like `-h`. 0 means none. Default is 0.00001.
--ss-x	Specify the maximum sample size. This is useful so that the program can converge after added a certain number of samples without having to worry about adding any more. 0 means none. Default is 0.
--rand-delta	Specify the combinatorial size of the data. If this is a nonzero value, then we use quasi-linear randomized training [5]. If you don't know the combinatorial size of the data, then simply choose the value r you would like for the training set sizes (note that r is then an upper-bound on the number of SVs) and then choose delta as `sqrt(r) / 6`. Default is 0.
--rand-mini	Specify the minimum number of randomized iterations to run. Default is 1.
--rand-maxi	Specify the maximum number of randomized iterations to run. A value of 0 means only to stop when the SVs generated by the randomized examples correctly classify all the other data points (and should only be used when data is linearly separable). Default is 0.