Bayesian Clustering for Email Campaign Detection

Peter Haider haider@cs.uni-potsdam.de Tobias Scheffer scheffer@cs.uni-potsdam.de University of Potsdam, Department of Computer Science, August-Bebel-Strasse 89, 14482 Potsdam, Germany

Abstract
We discuss the problem of clustering elements according to the sources that have generated them. For elements that are characterized by independent binary attributes, a closedform Bayesian solution exists. We derive a solution for the case of dependent attributes that is based on a transformation of the instances into a space of independent feature functions. We derive an optimization problem that produces a mapping into a space of independent binary feature vectors; the features can reflect arbitrary dependencies in the input space. This problem setting is motivated by the application of spam filtering for email service providers. Spam traps deliver a real-time stream of messages known to be spam. If elements of the same campaign can be recognized reliably, entire spam and phishing campaigns can be contained. We present a case study that evaluates Bayesian clustering for this application.

Computing the likelihood of a set X under a clustering hypothesis C requires the computation of the joint likelihood of mutually dependent elements Xc within a partition c. This is usually done by assuming latent mixture parameters  that generate the elements of each cluster and imposing a prior over the mixture parameters. The joint likelihood is then an integral over the parameter space, where individual likelihoods are independent given the parameters: P (Xc ) =
xXc

P (x|)P ()d.

(2)

1. Introduction
In model-based clustering, elements X = {x(1) , . . . , x(n) } have been created by an unknown number of sources; each of them creates elements according to its specific distribution. We study the problem of finding the most likely clustering of elements according to their source C  = arg max P (X|C),
C

For suitable choices of P (x|) and P (), this integral has an analytic solution. In many cases, however, the space X of elements is very high-dimensional, and the choice of likelihood and prior involves a trade-off between expressiveness of the generative model and tractability regarding the number of parameters. If the elements are binary vectors, X = {0, 1}D , one extreme would be to model P (x|) as a full multinomial distribution over X , involving 2D parameters. The other extreme is to make an independence assumption on the dimensions of X , reducing the model parameters to a vector of D Bernoulli probabilities. No remedy for this dichotomy is known that preserves the existence of an analytic solution to the integral in Equation 2. Our problem setting is motivated by the application of clustering messages according to campaigns; this will remain our application focus throughout the paper. Filtering spam and phishing messages reliably remains a hard problem. Email service providers operate Mail Transfer Agents which observe a stream of incoming messages, most of which have been created in bulk by a generator. A generator can be an application that dispatches legitimate, possibly customized newsletters, or a script that creates spam or phishing messages and disseminates them from the nodes of a botnet. Mail Transfer Agents typically blacklist known spam and phishing messages. Messages known to be spam can be collected by tapping into botnets, and by harvesting emails in spam traps. Spam traps are email addresses published invisibly on the web that

(1)

where C consistently partitions the elements of X into clusters C = {c1 , . . . , cm } that are mutually exclusive and cover each element.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Bayesian Clustering for Email Campaign Detection

have no legitimate owner and can therefore not receive legitimate mail. In order to avoid blacklisting, spam dissemination tools produce emails according to probabilistic templates. This motivates our problem setting: If all elements that are generated in a joint campaign can be identified reliably, then all instances of that campaign can be blacklisted as soon as one element reaches a spam trap, or is delivered from a known node of a botnet. Likewise, all instances of a newsletter can be whitelisted as soon as one instance is confirmed to be legitimate. While text classification methods are frequently reported to achieve extremely high accuracy for spam filtering under laboratory conditions, their practical contribution to the infrastructure of email services is smaller: they are often applied to decide whether accepted emails are to be delivered to the inbox or the spam folder. The vast majority of all spam delivery attempts, however, is turned down by the provider based on known message and IP blacklists. Text classifiers are challenged with continuously shifting distributions of spam and legitimate messages; their risk of false positives does not approach zero sufficiently closely to constitute a satisfactory solution to the spam problem. This paper makes three major contributions. Firstly, we develop a generative model for a clustering of binary feature vectors, based on a transformation of the input vectors into a space in which an independence assumption incurs minimal approximation error. The transformations can capture arbitrary dependencies in the input space while the number of parameters stays reasonable and full Bayesian inference remains tractable. Secondly, we derive the optimization problem and algorithm that generates the feature transformation. Finally, we present a large-scale case study that explores properties of the Bayesian clustering solution for email campaign detection. The paper is structured as follows. We present the Bayesian clustering model in Section 2, and an optimization problem and algorithm for transforming dependent features into independent features in Section 3. Section 4 discusses the estimation of prior parameters, Section 5 develops a sequential clustering algorithm based on Bayesian decisions. Section 6 reports on empirical results in our motivating application. We review related work in Section 7. Section 8 concludes.

of the subsets Xc of elements in the clusters c  C. Elements within a cluster are dependent, so the likelihood of each element depends on the preceding elements in its cluster, as in Equation 3. P (X|C) =
cC

P (Xc ) P (x(i) |{x(j)  c : j < i}) (3)

=
cC i:x(i) c

The crucial part is modeling the probability P (x|X ) of a binary feature vector x given a set of elements X . A natural way is to introduce latent model parameters  and integrate over them as in Equation 4. P (x|X ) =


P (x|)P (|X )d

(4)

Modeling  as a full joint distribution over all 2D possible feature vectors, with D being the dimensionality of the input space X , is intractable. Let e be independent binary features and let vector (x) be a representation of x in the space of independent features . In order to streamline the presentation of the clustering model, we postpone the rationale and construction of the feature transformation to Section 3. Under the assumption that attributes in the space  are independent, the model parameters can be represented as a vector of Bernoulli probabilities,   (0, 1)E , and we can compute P (x|) as E e=1 P (e (x)|e ). Furthermore, we impose a Beta prior on every component e with parameters e and e . Since the Beta distribution is conjugate to the Bernoulli distribution, we can now compute the posterior over the model parameters analytically as in Equation 5, where #e = |{x  X : e (x ) = 1}|. P (|X ) = P (X |) =
E E e=1

P () P (X )
xX

P (e (x)|e )PBeta (e |e , e ) P (X | )P ( )d

=
e=1

PBeta (e |e + #e , e + |X| - #e ) (5)

The integral in Equation 4 then has the analytic solution of Equation 6: P (x|X ) =
e:e (x)=1

2. Bayesian Clustering for Binary Features
In general, a clustering hypothesis C entails that the likelihood of the dataset factorizes into the likelihoods

e + #e e + e + |X | e + |X | - #e . e + e + |X |

(6)

e:e (x)=0

Bayesian Clustering for Email Campaign Detection

For a single element, independent of all others, the probability term simplifies to P (x) =
e:e (x)=1

e e + e

e:e (x)=0

e . e + e

(7)

Furthermore, the joint probability of an interdependent set X in one cluster can be computed as P (X ) =
i:x(i) X E

P (x(i) |{x(j)  X : j < i}) + k - 1)
|X |-#e (e k=1

=
e=1 E

#e k=1 (e

|X | k=1 (e

+ e - 1)

+ k - 1)

not necessarily sum to one over the event space of possible inputs x. Since Q serves as an approximation of a probability in the Bayesian inference, it is desirable that x Q (x|X)  1 for all X. Moreover, the natural measure of approximation quality ­ the Kullback-Leibler divergence ­ is only motivated for measures that add up to at most one and may be maximized by trivial solutions otherwise. Note that the sum does not have to be exactly 1, since Ż an extra element x with P (Ż ) = 0 can be added x to X that absorbs the remaining probability mass, def Q (Ż |X ) = 1 - xX Q (Ż |X ). x x Normalization of Q (x|X) is intractable, since it would require explicit summation of Equation 8 over all 2D possible input elements. We therefore have to define the space of possible transformations such that after any transformation Q is guaranteed to sum to at most one. By Theorem 1 (see Appendix), this holds for all injective transformations. Every mapping from X = {0, 1}D to the Z = {0, 1}E can be represented as a set of E Boolean functions, and every Boolean function can be constructed as a combination of elementary operations. Therefore we can define the search space as the set of all concatenations of elementary Boolean transformations  that preserve injectiveness. The choice of which elementary transformations to use is driven by the practical goal that  also preserves sparseness. The following two elementary transformations are injective, sufficient to generate any Boolean function, and preserve sparsity:
x ij ((. . . , xi , . . . , xj , . . . ) ) = (. . . , xi , . . . , xi = xj , . . . ) a ij ((. . . , xi , . . . , xj , . . . ) )

=

B(e + #e , e + |X | - #e ) , B(e , e ) e=1

where B denotes the Beta function.

3. Feature Transformation
In this section we will present a method of approximating a distribution over high-dimensional binary vectors that allows analytical integration over the induced model parameters. The idea is to find a mapping into another space of binary vectors where the dimensions are treated independently of each other, such that the divergence between the original distribution and the approximate distribution defined in terms of the mapped vectors is minimal. A straightforward approach would be to employ a model that captures dependencies between small sets of attributes only and assumes independence otherwise. Instead of working with independence assumptions, we construct a search space of transformations. This space is complete in the sense that any possible interaction of attributes can be reflected in the newly constructed attributes. These attributes are constructed such that their product approximates the true distribution as closely as possible. The Bayesian clustering model introduced in Section 2 requires us to infer the probability of a feature vector x given a set of feature vectors X it depends on. We therefore want to approximate P (x|X ) by a quantity Q (x|X ), where  is a mapping from the original vector space X = {0, 1}D to the image space Z = {0, 1}E . We define Q as a product over independent probabilities for each output dimension, as in Equation 8.
E

= (. . . , xi xj , . . . , xi Źxj , Źxi xj , . . . ) . Every  replaces two features by Boolean combinations thereof, leaving every other feature untouched. For any set of elements X, the quantity Q (x|X) should minimize the Kullback-Leibler divergence from the true distribution P (x|X). Hence, the optimization criterion (Equation 9) is the expected KL divergence between Q (x|X) and P (x|X) over all X.  = arg min E

XP(x)

[KL(P (ˇ|X)||Q (ˇ|X))] P (x|X) log
xX

(9) (10) (11) (12)

= arg min E

XP(x)

P (x|X) Q (x|X)

Q (x|X ) =
e=1

P (e (x)|e (X ))

(8)

= arg max E

XP(x)

P (x|X) log Q (x|X)
xX

By its definition as a product over probabilities, quantity Q (x|X) is always non-negative; however, it does

= arg max


E
XP(x),xP(x|X)

[log Q (x|X)] .

Bayesian Clustering for Email Campaign Detection

Equation 10 expands the definition of the KL divergence; Equation 11 replaces minimization by maximization and drops the term P (x|X) log P (x|X) which is constant in . We approximate the expectation in Equation 12 by the sum over an empirical sample S, and obtain Optimization Problem 1. Over the set of concatenations of elementary transformations,   x a {ij , ij } , maximize log Q (x|S \ {x}).

finding the best clustering and then performing gradient descent on the prior parameters. This approach is computationally very expensive, since the likelihood function is not convex and the entire dataset needs to be re-clustered in each iteration. We overcome these problems by using an alternative parametrization of the Beta distribution. This allows us to estimate half of the parameters from an unclustered set of training examples S; the other half of the parameters is pooled into a single value and adjusted by a grid search on tuning data. We re-parametrize the Beta priors as e = ľe  and e = (1-ľe ), where the ľe are the prior means and  is the common precision1 parameter. The probability of an element not given any other elements of the same cluster does not depend on the prior precisions, only on the means. Hence, the means have a stronger impact on the resulting partitioning. Imposing a Beta-distributed hyperprior on ľe with parameters 0 > 1 and 0 > 1 we can compute the Maximum-A-Posteriori estimate of the means as ľe = arg max P (ľ|S) = arg max
ľ ľ xS ľ

xS

The sum of log-probabilities can be calculated as log Q (x|S \ {x}) #S log(0 + #S -1) - |S| log(0 +0 + |S| -1) e e

xS

E

=
e=1

+ (|S| - #S ) log(0 + |S| - #S - 1), e e where #S = |{x  S : e (x ) = 1}|, and 0 , 0 are e the parameters of the Beta prior. Optimization Problem 1 is non-convex, so we apply a greedy procedure that iteratively adds the next-best transformation starting with the identity transformation, as detailed in Algorithm 1. Algorithm 1 Greedy Transformation Composition 0  id for t = 1 . . . do ^   arg max xS Qt-1  (x|S \ {x}) if Qt-1  (x|S \ {x}) < Qt-1 (x|S \ {x}) ^
xS xS

P (e (x)|ľ)P (ľ)

= arg max PBeta (ľ|0 + |{x  S : e (x) = 1}|, 0 + |{x  S : e (x) = 1}| - 1 . 0 + 0 + |S| - 2 0 + |{x  S : e (x) = 0}|)

=

5. Sequential Bayesian Clustering
In this section we discuss the task of inferring the most likely partitioning of a set of emails and present our model-based sequential clustering algorithm. Brute-force search over the entire space of possible partitionings in Equation 1 requires the evaluation of exponentially many clusterings and is therefore intractable for reasonable numbers of emails. A more efficient approach would be to perform Markov chain Monte Carlo sampling methods like in (Williams, 2000), which yields not only the MaximumA-Posteriori partitioning, but samples from the posterior distribution. Approximate agglomerative clustering algorithms (Heller & Ghahramani, 2005) are more efficient. Since in practice emails have to be processed sequentially, and decisions whether an email belongs to a spam campaign cannot be revised after delivering it, we adopt the sequential clustering algorithm of Haider et al.
1 The precision parameter of a Beta distribution is inversely related to its variance.

then return t-1 else ^ t  t-1   end if end for

4. Parameter Estimation
In the following section, we will derive a closed-form estimator for parts of the parameters of the prior P (). The decisions whether to merge an element x with a set X depend strongly on the prior parameters via P (x) in Equation 7 and P (x|X ) in Equation 6. Heller and Ghahramani (2005) derive an EM-like algorithm that maximizes the data likelihood by iteratively

Bayesian Clustering for Email Campaign Detection

Algorithm 2 Model-based Sequential Clustering C  {} for t = 1 . . . n do cj  arg maxcC P (x(t) |Xc ) if P (x(t) |Xc ) < P (x(t) ) then C  C  {{x(t) }} else C  C \ {cj }  {cj  {x(t) }} end if end for return C

stream that contains realistic proportions of messages of mailing campaigns, in the correct chronological order. Benchmark corpora contain messages that have been received by users, and have therefore passed unknown filtering rules employed by the receiving server. Furthermore, benchmark data sets do not contain reliable time stamps whereas the actual chronological order is crucial. Our experimental setting relies on a stream of spam messages received by the mail transfer agent of an email service provider. Between July and November 2008, we recorded a small fraction of spam messages, a total of 139,250 spam messages in correct chronological order. The messages have been tagged as spam because the delivering agent was listed on the Spamhaus IP block list which is maintained manually. We simulate the practical setting where one has a stream of verified spams, and one stream of unknown emails, by taking every other email from the set as training example. The rest is split into test examples (90%) and tuning examples. In order to maintain the users' privacy, we blend the stream of spam messages with an additional stream of 41,016 non-spam messages from public sources. The non-spam portion contains newsletters and mailing lists in correct chronological order as well as Enron emails and personal mails from public corpora which are not necessarily in chronological order. Every email is represented by a binary vector of 1,911,517 attributes that indicate the presence or absence of a word. The feature transformation technique introduces an additional 101,147 attributes. 6.1. Feature Transformation In order to assess the capability of our feature transformation technique for approximating a high dimensional probability distribution, we train the transformation on an additional set S1 of 10,000 older emails including spam and ham (i.e., non-spam) messages in equal parts, and test on another set S2 of emails of the same size. Since we cannot measure the KullbackLeibler divergence from the true distribution directly, -1 we measure the quantity |Si | xSi log Q (x|Si \{x}), which is the average entropy of an email, given all other emails of the set. We compare the entropies on the training and test sets for the transformation found by the Greedy Transformation Composition algorithm to the entropy of the identity transformation. The identity transformation corresponds to an assumption of independent attributes in the input space. In addition to the overall optimal transformation, we compute the optimal transformations a and x composed of only elementary transformations of the forms

(2007). This greedy incremental algorithm has the advantage of approximately finding the best campaign association of a new email in O(n), where n is the number of previously seen emails, instead of taking O(n3 ) operations for performing a full agglomerative re-clustering. Instead of a weighted similarity measure as in (Haider et al., 2007), our clustering model is based on a generative model. We replace the weighted sum over pairwise features by an integral over the model parameters of a cluster. This gives us the model-based sequential clustering in Algorithm 2. In every step, the algorithm compares the hypotheses that the new element belongs to one of the existing clusters with the hypothesis that it forms its own cluster. The likelihoods of these hypotheses are calculated according to Equations 6 and 7. This greedy algorithm can be straightforwardly extended to using a non-uniform prior over clustering hypotheses, by replacing P (x) with P (x)P (C{{x(t) }}) and P (x(t) |Xc ) with P (x(t) |Xc )P (C \ {cj }  {cj  {x(t) }}).

6. Email Campaign Detection
In this section, we explore the behavior of the feature transformation procedure, and conduct a case study of the Bayesian clustering method for spam filtering. In unsupervised clustering, there is no ground truth available that the output of the clustering algorithm can be compared with. Fortunately, our motivating application scenario ­ email spam containment ­ has a natural evaluation criterion: the contribution of the produced partitionings to accurate filtering. We design an experimental setting that evaluates the Bayesian clustering solution and the feature transformation technique for the problem of detecting spam campaigns at an email service provider. Benchmark data sets such as the SpamTREC corpus are not suitable for our evaluation. A fair evaluation relies on a

Bayesian Clustering for Email Campaign Detection
a x ij or ij , respectively. Preliminary experiments showed that the choice of prior parameters 0 and 0 has negligible influence within reasonable ranges, so we report the results for 0 = 1.1 and 0 = 100 in Table 1. We can see that including elementary trans-

a one-class SVM, trained on the history of 5,000 spam messages. Additionally, we evaluate the benefit of the feature transformation by comparing with a clustering algorithm that uses the identity transformation. An EM clustering procedure that uses a point estimates for the model parameters serves as an additional reference. We use a MAP estimate based on the same model prior used for the Bayesian model. EM requires the number of clusters to be known. We use the number of clusters that the Bayesian model identifies as input to the EM clustering. We use two evaluation measures. Firstly, we measure the area under the ROC curve (AUC). Secondly, we use an evaluation measure that reflects the characteristics of the application more closely. An MTA has to be extremely confident when deciding to refuse a message for delivery from a contacting agent. We therefore measure the rate of true positives (spam messages identified as such) at a false positive rate of zero. We adjust the hyperparameters  for the clustering model and C or  for the standard SVM and one-class SVM, respectively, on the tuning set. We tune the parameters separately for optimal AUC, and for an optimal rate of true positives at a false positive rate of zero. Figure 1 shows ROC curves. We can see that in the setting with ham emails available for training, the SVM outperforms the clusteringbased filter in terms of AUC. In terms of the true positive rate at a false positive rate of zero, the clustering method outperforms the SVM classifier, by achieving a true positive rate of 0.945 ą 9.1 × 10-4 compared to 0.938 ą 9.6 × 10-4 of the SVM. The cluster-based filter shows its full strength in setting B, in the absence of non-spam training messages from the test distribution. Here, it achieves an AUC value of 0.992 ą 2.6 × 10-4 and a true positive rate of 0.749 ą 1.7 × 10-3 , whereas the one-class SVM attains an AUC of 0.770ą1.4×10-3 and a true positive rate of 0.102 ą 1.2 × 10-3 . That is, the Bayesian clustering method increases the true positive rate at zero false positives almost sevenfold, in the setting where no training emails from the distribution of the test hams are available. Clustering with the identity transformation as well as clustering with the EM algorithm performs worse in all settings than Bayesian clustering with the feature transformation. In setting A (ham messages from the test distribution available) with the parameters tuned for a high true positive rate at a false positive rate of zero, the EM algorithm achieves a true positive rate of only 0.04.

Table 1. Comparison of training and test entropies using different feature transformations.

Transformation Training entropy Test entropy

id 1013.7 1007.1

 574.1 687.5

a 585.3 672.0

x 632.4 720.2

x formations of the form ij decreases training entropy, but increases test entropy. The best transformation reduces test entropy compared to the identity transformation by about 33%. This shows that factorizing the probability over the dimensions in the image space yields a much better approximation than factorizing over the dimensions in the original feature space.

6.2. Bayesian Clustering for Spam Filtering Our evaluation protocol is as follows. We use a training window of 5,000 known spam messages, corresponding to a history of approximately 11 days. The training messages are partitioned using Algorithm 2. In each step, the clustering algorithm adds the chronologically next 100 known spam emails to the partitioning and removes the 100 oldest. We then classify the 100 next test messages. We use the odds ratio maxXspam P (x|Xspam ) P (x) as classification score, the maximum is over all spam clusters in the training window. Test messages are not added to the window of partitioned training messages. A main difficulty in spam filtering is that ham emails can be very diverse, and it is unrealistic that one has training examples available from every region of the true distribution. We conduct experiments in two different settings that assess performance when ham emails from the test distribution are available and the performance without access to ham emails, respectively. In setting A, we train the feature transformation and parameters ľe with 10,000 ham emails from the test distribution, and in setting B, we train on 10,000 spam messages instead. As baseline for setting A, we use a Support Vector Machine that is trained in every step on the history of the last 5,000 spam and on the same 10,000 ham emails as the clustering method. Hence, the SVM baseline receives the same training data. In setting B, we use

Bayesian Clustering for Email Campaign Detection

With training hams, tuned for AUC 1 True positive rate True positive rate 0.95 0.9 0.85 0.8 0.75 0 0.05 0.1 0.15 False positive rate No training hams, tuned for AUC 1 True positive rate True positive rate 0.8 0.6 0.4 0.2 0 0 0.02 0.04 0.06 0.08 False positive rate 0.1 Clustering with = Clustering with =id One-class SVM EM clustering
a

With training hams, tuned for TP at FP=0 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.2 0x100 2x10-5 4x10-5 6x10-5 8x10-5 1x10-4 False positive rate No training hams, tuned for TP at FP=0 1 0.8 0.6 0.4 0.2 0 0x100 Clustering with = a Clustering with =id One-class SVM EM clustering 1x10-4 2x10-4 False positive rate 3x10-4 Clustering with = a Clustering with =id SVM

SVM Clustering with = a Clustering with =id EM clustering

Figure 1. Evaluation of spam filtering performance.

7. Related Work
Previous work on Bayesian clustering explored in great detail the use of hierarchical priors for the cluster structure and algorithms for inference under such priors, for example (Williams, 2000), (Heller & Ghahramani, 2005), and (Lau & Green, 2007). These approaches focus on modeling hierarchical dependencies between elements, while modeling only low-level dependencies between the attributes within elements, such as Gaussian covariances. By contrast, we assume a uniform prior over the cluster structure, and instead focus on modeling arbitrary dependencies between binary attributes. We find that a non-uniform prior over partitionings is in fact not necessary, because properly taking the prior over mixture parameters P () into account also prevents the trivial solution of assigning every element to its own cluster from being optimal. Haider et al. (2007) devise a technique for mailing campaign detection that relies on training data that are manually clustered by campaigns. We find that the effort of manually partitioning training data into

clusters is prohibitive in practice. Note that the effort of partitioning data is much higher than the effort of labeling data for classification because pairs of examples have to be considered. Multi-way dependencies between attributes have been considered for instance by Zheng and Webb (2000) and Webb et al. (2005). They model the probability of an attribute vector as a product of conditional probabilities, such that each attribute can depend on multiple other attributes. If these approaches were to be used for Bayesian clustering, the number of mixture parameters would grow exponentially in the degree of dependencies. For our application, the high number of attributes renders these approaches infeasible. Remedies for the problem of constantly changing distributions in the Spam filtering domain have been proposed in the area of adverserial learning. Teo et al. (2008) developed a formulation that allows to model test emails as modified versions of the training emails and optimize the classifier against the worst-case scenario of modifications. This approach leads to classi-

Bayesian Clustering for Email Campaign Detection

fiers that are more robust against changes of the distribution of spam emails, but still require the availability of recent spam and ham training data.

Webb, G., Boughton, J., & Wang, Z. (2005). Not So Naive Bayes: Aggregating One-Dependence Estimators. Machine Learning, 58, 5­24. Williams, C. (2000). A MCMC approach to hierarchical mixture modelling. Advances in Neural Information Processing Systems, 12, 680­686. Zheng, Z., & Webb, G. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41, 53­84.

8. Conclusion
We devised a model for Bayesian clustering of binary feature vectors. The model is based on a closedform Bayesian solution of the data likelihood in which the model parameters are integrated out. It allows for arbitrary dependencies between the input features, by transforming them into a space in which treating them as independent incurs minimal approximation error. We derived an optimization problem for learning such a transformation as a concatenation of elementary Boolean operations. In order to estimate the parameters of the prior from unlabeled data, we rewrite the parameters of the beta distribution in terms of mean values and a common variance. The mean values can be inferred in closed form from unlabeled data efficiently, the common variance constitutes a parameter that is adjusted on tuning data. We adapted a sequential clustering algorithm to use it with Bayesian clustering decisions. In a case study, we observed that the Bayesian clustering solution achieves higher true positive rates at a false positive rate of zero than an SVM. The benefit of the clustering solution is particularly visible when no non-spam training messages from the test distribution are available. Acknowledgments We gratefully acknowledge support from STRATO Rechenzentrum AG.

Appendix
Theorem 1. Let the implicit model parameter e of the distribution P (ze ) be Beta-distributed with parameters e and e for each e  {1, . . . , E}. Then the quantity Q (x|X ) as defined in Equation 8 sums to at most 1 for all X iff  is injective. Proof. First we show indirectly that from X : E zZ |{x : (x) = z}| e=1 P (ze |e (X ))  1 follows z : |{x : (x) = z}|  1. Assume that there exists a z  Z with |{x : (x) = z }|  2. Then choose an x with (x ) = z . Without loss of  generality, let e : z =  Then set 1. e E 0.5(e + e ) - e / 1 - E 0.5 + 1 and n = maxe X  = {x , . . . , x } with |X  | = n. It follows that
E E e=1  P (ze |e (X  )) = e=1 E  P (ze |e )P (e |e (X  ))d E

=

 e + n E 0.5 = 0.5, > e + e + n e=1 e=1
E

and thus
zZ

|{x : (x) = z}|
E

e=1

P (ze |e (X  ))

References
Haider, P., Brefeld, U., & Scheffer, T. (2007). Supervised Clustering of Streaming Data for Email Batch Detection. Proceedings of the 24th International Conference on Machine Learning (pp. 345­ 352). Heller, K. A., & Ghahramani, Z. (2005). Bayesian hierarchical clustering. Proceedings of the 22nd International Conference on Machine Learning (pp. 297­304). Lau, J., & Green, P. (2007). Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics, 16, 526­558. Teo, C., Globerson, A., Roweis, S., & Smola, A. (2008). Convex Learning with Invariances. Advances in Neural Information Processing Systems, 20, 1489­1496.

|{x : (x) = z }|

e=1

 P (ze |e (X  )) > 1.

The opposite direction follows from the fact that X : E zZ e=1 P (ze |e (X ))  1, because P (ze |e (X )) is not an approximation, but a true Bayesian probability. Now we have X : X : X :
xX

Q (x|X )  1
E

xX e=1

P (e (x)|e (X ))  1
E

zZ

|{x : (x) = z}|

e=1

P (ze |e (X ))  1

z : |{x : (x) = z}|  1   is injective.