Fast Boosted/Bagged Decision Tree Learner
Hal Daume III ()

I've written a short, very fast, very simple decision tree learner that supports both bagging and boosting (incidentally, boosted decision trees are becoming known as one of the best, generic learning techniques). The code is in O'Caml and is terribly simple to compile. First, download FastDT.ml. Now, run:

ocamlopt str.cmxa bigarray.cmxa FastDT.ml -o FastDT
If you want faster code, add -unsafe -inline 100 to that command line.

Alternatively, you can download precompiled binaries for:

Usage: An example input file is available here. Each line is an example. The first column is the class (0 or 1) and every other column lists a feature name. Note that features need not be numeric. Any string is fine. FastDT can be run on this by saying:

FastDT test-input > test-tree
This uses default options. Neither bagging nor boosting is done. The tree is limited to a depth of 10. All features are used.

Options: You can specify -bag # or -boost # to bag or boost #-many trees. You can change the depth limit by saying -maxd 50. You can prune all features that occur less than 5 times by saying -minfc 5. You can have it not expand internal nodes that are not for sure one of the classes with probability 0.01 by saying -rho 0.01.

Prediction: To predict, run:

FastDT -load test-tree test-input
The error will be printed to stderr and the predictions (line by line) will be written to stdout.

Tree format: The file format for the trees is simple. The first line is the number of trees (for simple DTs, this is one, for bagged or boosted DTs, this is the size of the committee). Then, for each tree, we have an initial line that says the tree weight in the committee followed by the tree itself. The tree is printed one node per line, internal nodes prefixed by N and leafs by L. Internal nodes then list the feature they split on. Leafs list two values: the number of times the training data was class one versus class zero at this point in the tree. The trees are printed in left-to-right manner (i.e., an internal node is printed first by printing the node line with the split feature, then the entire left (true) child is printed, then the entire right (false) child is printed).

Efficiency: The code is as efficient as I could make it without bad approximations. It tends to eat a bit of memory, but it's hard to get around that without switching to online learning. It uses information gain as the splitting criterea, but doesn't actually compute entropies because computing log is slow. Instead, I use a quadratic approximation (Taylor series) to the entropy term, which is quite accurate and significantly faster. Much computation is cached, and, at this point, a significant percentage of run time is in I/O, rather than computation. As it stands, it takes roughly 20 minutes to train a bagged 20, depth 20 committee on 14k instances with 200k features. Boosting the same data takes roughtly 30 minutes. (On reasonably fast machines.)

Conditions of use: If you find this code useful, great! Please use it for whatever purposes you want. If you use it for research purposes, and publish a paper, please include a footnote citing this program.