The GDC software attempts to mimick the C4.5 system, except that it implements unsupervised clustering algorithms, rather than supervised methods. Currently supported clustering algorithms are: k-means, k-mediods and agglomerative clustering.
To run GDC on a Solaris 2.7 machine at ISI, run /nfs/isd/hdaume/bin/gdc; for other systems or external to ISI, download the source code (GDC.tar.gz) and compile it with GHC.
E1 0.5 1 1.2 0.2 C1
E2 1 0.75 1 1.3 C2
E3 1 1 1 1.4 C2
In this instance, there are three entries (E1, E2 and E3) with four entries in the feature vector and two classes (C1 and C2). The class names must begin with an alphanumeric and feature values must beging with a digit (i.e., you must write "0.2" not just ".2"). The entity name is also optional, in which case entities will be named En for n beginning at 1 and going down.
We allow Haskell-style comments; single line comments begin with -- and comment spands are enclosed in {- and -}.
% gdc test -o test --k-means
This would create several file, which I will explain briefly:
k chosen by slope statistic = 2
1: [["E1","E2","E3"]]
2: [["E2","E3"],["E1"]]
3: [["E3"],["E2"],["E1"]]
Which basically means that if we want one cluster, put them all in one
cluster; if we want two clusters (which is what it thinks is the best
choice), put E2 and E3 together and leave E1 by itself; finally if we
want three clusters, put each in its own cluster.
If you don't include true classification information (i.e., the last column in the input file), it won't create the pr and se files, since we need to know the "right answer" in order to calculate these.
You can specify the prefix of all the output files with -o and the titles of the graphs with --title=STRING.
You can specify the number of clusters (if this isn't specified, it clusters for all possible values) by specifying -k3 to only do three clusters or -k3-8 to cluster for values of k from 3 the 8 (inclusive).
Finally, you can control the output more precisely. You can suppress the creation of any of the output files by specifying --no-output-output-extension, for instance, --no-output-graph. Furthermore, you can rename the output of any of these files by specifying --output-output-extension =FILE, for instance, --output-se=my-se-file
You can get a list of all the options by running gdc --help, but this doesn't include any information about which options go with which other options. gdc should warn you if you try to do something dumb (like specifying --cosine with k-means (which must use squared euclidean distance), but it might not all the time. Use your judgment.
If you have any questions or encounter any bugs, please let me know. Finally, if something seems to be running too slowly, it may simply be because I didn't spend too much time optimizing the algorithm (ahem, k-means -- k-mediods might be a better bet); if you really want a better version, let me know and I'll put it on my to-do list.