Introduction
This is a brief note to explain multiclass classification in VW,
ending with a description of the label-dependent-features (ldf format)
that is likely to be somewhat strange for some people.
Basic Multiclass Classification
In standard multiclass classification, you have a fixed number of
labels (K) and want to drop inputs into one of those K buckets. Here
is an example basic multiclass dataset with four labels in VW format
(do NOT copy the line numbers):
1. 1 | a b c
2. 3 | b c d
3. 1 | a c e
4. 4 | b d f
5. 2 | d e f
This consists of six examples and if you know normal VW format you
should be able to figure out what this does.
You can train a one-against-all model by running:
% vw --oaa 4 < mc.txt
This says to do one-against-all classification with 4 labels (you have
to tell VW how many labels there are). If you specify too few, you'll
get a warning. If you specify too many, nothing will really go wrong
except you'll waste representation power.
If you want to make the third example ten times as important as the
others, you can add weights to the examples:
1. 1 | a b c
2. 3 | b c d
3. 1 10 | a c e
4. 4 | b d f
5. 2 | d e f
OAA classification proceeds by turning each multiclass example into K
binary examples. For example, internally, the second example in the
above dataset (the one with label "3") gets turned into:
-1 | 1_b 1_c 1_d
-1 | 2_b 2_c 2_d
+1 | 3_b 3_c 3_d
-1 | 4_b 4_c 4_d
Note that these examples don't explicitly get constructed: internally
VW does some clever feature index trickery to get different features
for the different labels (result: it's fast).
Test examples look just like normal VW test examples, for instance:
1. | b d e
Cost-Sensitive Multiclass Classification
Slightly more complex than basic multiclass classification is
cost-sensitive multiclass. Here, instead of just having one correct
label (and all others incorrect), you can have different costs for
each of the K different labels. And you can also have different
"allowable" labels for each example. An identical cost-sensitive
training set to that above is:
1. 1:0 2:1 3:1 4:1 | a b c
2. 1:1 2:1 3:0 4:1 | b c d
3. 1:0 2:1 3:1 4:1 | a c e
4. 1:1 2:1 3:1 4:0 | b d f
5. 1:1 2:0 3:1 4:1 | d e f
The first line means that label 1 has a cost of zero and labels 2, 3
and 4 all have a cost of one. You can train this by:
% vowpalwabbit/vw --csoaa 4 < cs.txt
Here, csoaa means "cost-sensitive one against all". You can also train
weighted all pairs by:
% vowpalwabbit/vw --wap 4 < cs.txt
Test examples are different from standard VW test examples
because you have to tell VW which labels are allowed. The following is
equivalent to the above test example:
1. 1 2 3 4 | b d e
At training time, if there's an example with label 2 that you know
(for whatever reason) will never be label 4, you could specify it as:
1. 1:1 2:0 3:1 | example...
This means that labels 1 and 3 have a cost of 1, label 2 has a cost of
zero, and no other labels are allowed. You can do the same at test
time:
1. 1 2 3 | example...
Will never predict anything other than the provided "possible" labels.
Interally, csoaa does something different than you might expect. If
there are K possible labels, it constructs K regression
problems, where the input is a (label-based) version of the features
and the output is the cost of that label. For instance, the second
example in the above cs.txt gets converted to the following
four regression problems:
1 | 1_b 1_c 1_d
1 | 2_b 2_c 2_d
0 | 3_b 3_c 3_d
1 | 4_b 4_c 4_d
And then at test time, the label whose regressor gives the lowest
value (predicted cost) is chosen.
If you use WAP, it does what you might expect. It constructs (K choose
2) training examples, where the correct class is the is the one from
the pair with the lower cost and the weight is (a function of) the
difference of costs. In the case that it's just a standard multiclass
dataset (one label with cost zero, the rest with cost one) you only
get K-1 training examples, as follows (for the second example):
-1 | 1vs3_b 1vs3_c 1vs3_d
+1 | 2vs3_b 2vs3_c 2vs3_d
-1 | 4vs3_b 4vs3_c 4vs3_d
Again, this is all done with feature index manipulation to be
efficient.
At test time, you actually don't have to run all (K choose 2)
classifiers but can actually just run K predictions (taking advantage
of linearity: if you're training a non-linear model this trick won't
work).
Label-dependent-features format
One thing you may have notices is that all of the above examples
assume that the way that features get "exploded" from the multiclass
problem to the binary problems is deterministic and
simplistic. Basically any feature "f" just gets prepended with the
label id (eg., "1_f", "2_f", etc.) for making predictions. Or, another
way to think about this is that you're training K completely separate
classifiers each with their own feature space.
Sometimes you actually know something about your labels that makes
this less ideal. For example, suppose you have four document labels
(inspired by a dataset due to Andrew McCallum) that talk about
automobiles or airplanes. But some are talking about real automobiles
(or airplanes) and some are talking about toy automobiles (or
airplanes). Thus the four labels are: RealAuto ToyAuto RealAir ToyAir
(call them 1, 2, 3, 4 in that order). You might believe that it's
actually useful to share parameters across the classes. For example,
when you have a basic feature "f" on an example with label RealAuto,
you might want to have three version of "f" available for
classification: "Real_f", "Auto_f" and "RealAuto_f". If you use the
normal support for cost-sensitive classification, you'd only get the
last of these.
The ldf format was invented to cope with problems like this. The
easiest way to specify ldf example is in a multiline format. This
means that each label gets its own line in the input file, and
blank lines separate examples. For instance, you could have something
like the following as a valid two-example dataset for the above
problem:
1. 1:0 | Real_f Auto_f RealAuto_f
2. 2:1 | Toy_f Auto_f ToyAuto_f
3. 3:1 | Real_f Air_f RealAir_f
4. 4:1 | Toy_f Air_f ToyAir_f
5.
6. 1:1 | Real_g Auto_g RealAuto_g
7. 2:1 | Toy_g Auto_g ToyAuto_g
8. 3:0 | Real_g Air_g RealAir_g
9. 4:1 | Toy_g Air_g ToyAir_g
10.
(note the trailing newline, which is both necessary and a good
idea in case you want to later concatenate two files.)
As before, the labels don't need to be in order and you need not
include options for all of them. They can also have costs other than
zero/one, if you like.
If this data is in csldf.txt, you can train with:
% vw --csoaa_ldf multiline < csldf.txt
Note that with LDF format you don't need to specify the number of
labels ahead of time: basically because playing with feature ids is
now your job, not VWs. (You can also say "m" instead of "multiline")
for brevity (yes, there's also a single-line format, but it's not
recommended).
This really does just turn these examples into regression problems,
essentially stripping out the "label:" on the lines and training a
model. However, the loss function reported during training is
classification loss.
In this case, it actually might make sense to have features
that appear for every class, as in:
1. 1:0 | g Real_f Auto_f RealAuto_f
2. 2:1 | g Toy_f Auto_f ToyAuto_f
3. 3:1 | g Real_f Air_f RealAir_f
4. 4:1 | g Toy_f Air_f ToyAir_f
5.
If there are a lot of such features, you can share them across a
single example (blank-line-separated block) by:
1. shared | g
2. 1:0 | Real_f Auto_f RealAuto_f
3. 2:1 | Toy_f Auto_f ToyAuto_f
4. 3:1 | Real_f Air_f RealAir_f
5. 4:1 | Toy_f Air_f ToyAir_f
6.
In this case, this doesn't seem obviously helpful, but when "g" is a
lot of features, this can save both disk space and prediction time
(because dot products with these shared features can be cached).
You can also do WAP-ldf as:
% vw --wap_ldf m < csldf.txt
Which trains a classifier in the same way as WAP does before (except
the trick for going from O(K2) to O(K) no longer applies,
unfortunately).
You might find it weird that csoaa_ldf trains a regression model. If
you want it to do the more obvious thing and really train K
classifiers in a one-against-all framework, you can do this by
running:
% vw --csoaa_ldf mc < csldf.txt
Here, we've called csoaa_ldf with argument "mc" rather than just
"m". "mc" means multiline-classifier, and tells it to train
classifiers rather than regression models.
Test examples look the same, except they have no costs, as in:
1. shared | g
2. 1 | Real_f Auto_f RealAuto_f
3. 2 | Toy_f Auto_f ToyAuto_f
4. 4 | Toy_f Air_f ToyAir_f
5.
(Note the trailing newline.) Here, we've said that we know a priori
that label 3 is invalid, so not to predict that label.
Additional Notes
If you get a warning about ring size when running in ldf mode, it
means that you have a multiline example that contains more lines than
VW is willing to hold in memory at once. You can fix this by adding
"--ring_size 1024" or something like that (the default is 256). Just
make sure this number is greater than the maximum number of labels for
any given example.
In ldf mode, having shared features for WAP or classifier-based OAA
is useless: the predictions are all based on differences of
feature vectors, and shared features will cancel each other out.
Be careful when changing the loss function. If you're doing one of the
regression-based models (csoaa) you need to use a regrssion loss and
should probably stick to squared loss. If you're doing one of the
classification-based models (oaa, wap, csoaa_ldf mc or wap_ldf) then
you might want to use a classification loss, like logistic or
hinge.