Tailoring Density Estimation via Reproducing Kernel Moment Matching
Le Song L E . S O N G @ N I C TA . C O M . AU Xinhua Zhang X I N H UA . Z H A N G @ N I C TA . C O M . AU Alex Smola A L E X . S M O L A @ N I C TA . C O M . AU Statistical Machine Learning, NICTA, 7 London Circuit, Canberra, ACT 2601, Australia Arthur Gretton A RT H U R . G R E T T O N @ T U E B I N G E N . M P G . D E ¨ Bernhard Scholkopf B E R N H A R D . S C H O E L KO P F @ T U E B I N G E N . M P G . D E Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tuebingen, Germany

Abstract
Moment matching is a popular means of parametric density estimation. We extend this technique to nonparametric estimation of mixture models. Our approach works by embedding distributions into a reproducing kernel Hilbert space, and performing moment matching in that space. This allows us to tailor density estimators to a function class of interest (i.e., for which we would like to compute expectations). We show our density estimation approach is useful in applications such as message compression in graphical models, and image classification and retrieval.

too limited in terms of the class of distributions. Furthermore, exponential families tend to require highly nontrivial integration of high-dimensional distributions to ensure proper normalization. We desire to overcome these drawbacks and extend this technique to a larger class of models. In this paper, we generalize moment matching to nonparametric mixture models. Our major aim is to tailor these density estimators for a particular function class, and provide uniform convergence guarantees for approximating the function expectations. The key idea is if we have good knowledge of the function class, we can tightly couple the density estimation with this knowledge. Rather than performing a full density estimation where we leave the function class and subsequent operations arbitrary, we restrict our attention to a smaller set of functions and the expectation operator. By exploiting this kind of domain knowledge, we make the hard density estimation problem easier. Our approach is motivated by the fact that distributions can be represented as points in the marginal polytope in reproducing kernel Hilbert spaces (RKHS) (Wainwright & Jordan, 2003; Smola et al., 2007). By projecting data and density estimators into RKHS via kernel mean maps, we match them in that space (also referred to as the feature space). Choosing the kernel determines how much information about the density is retained by the kernel mean map, and thus which aspects (e.g., moments) of a density are considered important in the matching process. The matching process, and thus our density estimation procedure, amounts to the solution of a convex quadratic program. We demonstrate the application of our approach in experiments, and show that it can lead to improvements in more complicated applications such as particle filtering and image processing.

1. Introduction
Density estimation is a key element of statistician's toolbox, yet it remains a challenging problem. A popular class of methods relies on mixture models, such as Parzen windows (Parzen, 1962; Silverman, 1986) or mixtures of Gaussians or other basis functions (McLachlan & Basford, 1988). These models are normally learned using the likelihood. However, density estimation is often not the ultimate goal but rather an intermediate step in solving another problem. For instance, we may ultimately want to compute the expectation of a random variable or functions thereof. In this case it is not clear whether likelihood is the ideal objective, especially when the training sample size is small. A second class of density estimators employ exponential family models and are based on the duality between maximum entropy and maximum likelihood estimation (Barndorff-Nielsen, 1978; Dud´k et al., 2004; Altun & i Smola, 2006). These methods match the moments of the estimators to those of the data, which helps focus the models on certain aspects of the data for particular applications. However, these parametric moment based methods can be
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).

2. Background
Let X be a compact domain and X = {x1 , . . . , xm } be a sample of size m drawn independently and identically distributed (iid.) from a distribution p over X . We aim to find an approximation p of p based on the sample X . ^ Let H be a reproducing kernel Hilbert space (RKHS) on


Tailoring Density Estimation via Reproducing Kernel Moment Matching

X with kernel k : X × X  R and associated feature map  : X  H such that k (x, x ) = (x), (x ) H. By design H has the reproducing property, that is, for any f  H we have f (x) = f, k (x, ˇ) H. A kernel k is called universal if H is dense in the space of bounded continuous functions C 0 (X ) on the compact domain X in the L norm (Steinwart, 2002). Examples of such kernels include Gaussian kernel exp(- x - x 2 /22 ) and Laplace kernel exp(- x - x /22 ). The marginal polytope is the range of the expectation of the feature map  under all distributions in a set P , i.e., M := {ľ[p]|p  P , ľ[p] := Exp [(x)]} (Wainwright & Jordan, 2003). The map ľ : P  H associates a distribution to an element in the RKHS. For universal kernels, the elements of the marginal polytope uniquely determine distributions: Theorem 1 (Gretton et al. (2007)) Let k be universal and P denote the set of Borel probability measures p on X with Exp [k (x, x)] < . Then the map ľ is injective.

 , and weight it by a regularization constant  > 0. Using the expansion of p in (2) we obtain a quadratic pro^ gram (QP) for 
1 min 2   (

1 2

2

Q + I) - l


s.t. 

1

= 1 , i  0, (4)

where I is the identity matrix. Q  Rn×n and l  Rn are given by Qij = ľ[pi ], ľ[pj ] H = Expi ,x pj [k (x, x )] , (5) m 1 lj = ľ[X], ľ[pj ] H = m Expj [k (xi , x)] . (6)
i=1

By construction Q 0 is positive semidefinite, hence the program (4) is convex. We will discuss examples of kernels k and prototypes pi where Q and l have closed form in Section 5. In many cases, the prototypes pi also contain tunable parameters. We can also optimize them via gradient methods. Before doing so, we first explain our theoretical basis for tailoring density estimators.

4. Tailoring Density Estimators
Given functions f  H, a key question is to bound how well the expectations of f with respect to p can be approximated by p. We have the following lemma: ^ Lemma 3 Let > 0 and := ľ[X] - ľ[p] H. Under ^ the assumptions of Theorem 2 we have with probability at least 1 - exp(- 2mR-2 /2) sup |Exp [f (x)] - Exp [f (x)]|  2Rm (H, p) + + . ^
f H1

3. Kernel Moment Matching
Given a finite sample X from p, ľ[p] can bmapproximated e 1 by the empirical mean map ľ[X ] := m i=1 (xi ). This suggests that a good estimate p of p should be chosen such ^ that ľ[p] matches ľ[X ]: this is the key idea of the paper. ^ The flow of reasoning works as follows: density p  sample X  empirical mean ľ[X ]  density estimation via ľ[p]  ľ[X ] ^ (1)

The first line of this reasoning was established in (Altun & Smola, 2006, Theorem 15). Let Rm (H, p) be the Rademacher average (Bartlett & Mendelson, 2002) associated with p and H via m . s Rm (H, p) :=
1 m EX E

Proof In the RKHS, we have Exp [f (x)] = f, ľ[p] H and Exp [f (x)] = f, ľ[p] H. Hence the LHS of the ^ ^ bound equates to sup f H1 | ľ[p] - ľ[p], f |, which is ^ given by ľ[p] - ľ[p] H. Using the triangle inequality, our ^ assumption on ľ[p] and Theorem 2 completes the proof. ^ This means that we have good control over the behavior of the expectations, as long as the function class is "smooth" on X in terms of the Rademacher average. It also means that ľ[X] - ľ[p] H is a sensible objective to minimize if ^ we are only interested in approximating well the expectations over functions f . This bound also provides the basis for tailoring density estimators. Essentially, if we have good knowledge of the function class used in an application, we can choose the corresponding RKHS or the mean map. This is equivalent to filtering the data and extracting only certain moments. Then the density estimator p can focus on matching p only ^ up to these moments.

up

f H1

i=1

i f (xi )

where   {ą1} is uniformly random. We use it to bound the deviation between empirical means and expectations: Theorem 2 (Altun & Smola (2006)) Assume f   R for all f  H with f H  1. Then for >0 with probability at least 1 - exp(- 2mR-2 /2) we have ľ[p] - ľ[X ] H  2Rm (H, p) + . This ensures that ľ[X ] is a good proxy for ľ[p]. To carry out the last step of (1) we assume the density estimator p is ^ a mixture of a set of candidate densities pi (or prototypes): n p= ^ i pi where  1 = 1 and i  0, (2)
i=1

where 1 is a vector of all ones. Here the goal is to obtain good estimates for the coefficients i and to obtain performance guarantees which specify how well p is capable of ^ estimating p. This can be cast as an optimization problem: min ľ[X] - ľ[p] 2 s.t.  ^H
 1

5. Examples
We now give concrete examples of density estimation. A number of existing methods are special cases of our setting. Discrete Prototype or Discrete Kernel The simplest case is to represent p by a convex combination of Dirac

= 1 , i  0.

(3)

To prevent overfitting, we add a regularizer [], such as


Tailoring Density Estimation via Reproducing Kernel Moment Matching Table 1. Expansions for Qij and lj when using Gaussian prototypes and various kernels in combination. Let c := xi , xj + 1.
Kernel  Linear kernel x,  ¸ x egree 2 polynomial kernel ( x, x  Degree 3 polynomial kernel ( x, x D Gaussian RBF kernel e
- 1 2 2 ¸ ¸

+ 1 )2 + 1 )3
2

x-x

Qij xi , xj c2 + tr i j + xi j xi + xj i xj c3 + 6xi i j xj + 3c(tr i j + xi j xi + xj i xj )  - 1 - 1 xi -xj 2 + +2 I) 2 (i j  d i + j +  2 I 2 e

P m lj 1 1 xi , xj m Pm i=2 i 1 i=1 (c + x j xi ) m Pm 1 (c3 + 3cxi j xi ) i=1 m 1 - 2 xi -xj 2 - 1 Pm  (j + 2 I) 1 d  j +  2 I 2 i=1 e m

measures pi (x) = xi . Particle filters (Doucet et al., 2001) use this choice when approximating distributions. For instance, we could choose xi to be the set of training points. In this case Q defined in (5) equals the kernel matrix and l is the vector of empirical kernel averagm : es 1 k (xi , xj ). (7) Qij = k (xi , xj ) and lj = m
i=1

Furthermore, Jebara et al. (2004) introduced kernels on probability functions which effectively used definition (5). While they were not motivated by the connection between kernels and density estimation, their results for rich classes of densities, such as HMMs, can be used directly to compute our Q and l.

The key difference between an unweighted set as used in particle filtering and our setting is that our expansion is specifically optimized towards good estimates with respect to functions drawn from H. The problem of data squashing (DuMouchel et al., 1999) can likewise be seen as a special case of kernel mean matching. Here one aims to approximate a potentially large set X by a smaller set X = {(x1 , 1 ), . . . , (xn , n )} of weighted observations. We want to discard X and only retain X for all further processing. If ľ[X] - ľ[X ] H is small, we expect X to be a good proxy for X . Instead of using generic kernels k and discrete measures xi as prototypes for density estimation, we may reverse their roles. That is, we may pick generic densities pi and a Dirac kernel k (x, x ) =  (x = x ). Note this is only well defined for discrete domains X .1 In this case the mean operator simply maps a distribution into itself and we obtain X p(x)p (x)dx. Using (5) we have ľ[p], ľ[p ] H = X m 1 Qij = pi (x)pj (x)dx and lj = m pj (xi ). (8)
i=1

6. Related Work
Our work is related to the density estimators of Vapnik & Mukherjee (2000) and Shawe-Taylor & Dolia (2007). The main difference lies in the function space chosen to measure the approximation properties. The former uses the Banach space of functions of bounded variation, while the latter uses the space L1 (q ), where q denotes a distribution over test functions. For spherically invariant distributions over test functions our approach and the latter approach are identical, with a key difference (to our advantage) that our optimization is a simple QP which does not require constraint sampling to make the optimization feasible. Support Vector Density Estimation The model of Vapnik & Mukherjee (2000) can be summarized as follows: let F [p] be the cumulative distribution function of p and let ^ ^ F [X ] be its empirical counterpart. Assume p is given by ^ (2), and that we have a regularizer [] as previously discussed. In this case the Support Vector Density Estimation problem can be written as m 1 |F [p](xi ) - F [X ](xi )| + []. (10) ^ min m
 feasible i=1

Gaussian Prototype In general we will neither pick discrete prototypes nor discrete kernels for density estimation. We now give explicit expressions for Gaussian prototypes , - d 1 1 pi (x) = (2 )- 2 |i |- 2 exp (9)  2 x - xi 2 i where d is the dimension of the data, i 0 is a covariance matrix, and x - x 2 i := (x - x ) -1 (x - x ) is the i  squared Mahalanobis distance. When used in conjunction with different kernels, we have the expansions in Table 1. Other Prototypes and Kernels Other combinations of kernels and prototypes also lead to closed form expansions. For instance, similar expressions also holds for a Laplacian kernel. However, this involves more tedious integrals of the e-(|x|+|x-a|) form dx = -1 + e-|a| . Another example is to use indicator functions on unit intervals centered at xi as pi and a Gaussian RBF kernel. In this case, both Q and l can be expressed using the error function (erf).
1 On continuous domains such a kernel does not correspond to an RKHS since the point evaluation is not continuous.

That is, we minimize the 1 distance between the empirical and estimated cumulative distribution functions when evaluated on the set of observations X . To integrate this into our framework we need to extend our setting from Hilbert spaces to Banach spaces. Denote by B a Banach space, let X be a domain furnished with probability measures p, p , and let  : X  B be a feature map into B . Analogously, we define the mean map ľ : P  B as ľ[p] := Exp(x) [(x)]. Morev over, we define a distance between distributions p and p ia D(p, p ) := ľ[p] - ľ[p ] B. If we choose (x) =  w here  is the indica(-,x] (x1 ), . . . , (-,x] (xm ) tor function, and use the m norm on  we recover SV den1 sity estimation as a special case. Expected Deviation Estimation Shawe-Taylor & Dolia (2007) defined a distance between distributions as follows: let H be a set of functions on X and q be a probability distribution over F . Then the distance between two distributions p and p is given by


Tailoring Density Estimation via Reproducing Kernel Moment Matching
Negative Log Likelihood

D(p, p ) := Ef q(f ) |Exp [f (x)] - Exp
w

[f

(x)]| . (11)

4.2 4.1 4 3.9 3.8 50

That is, we compute the average distance between p and p ith respect to a distribution of test functions. Lemma 4 Let H be a reproducing kernel Hilbert space, f  H, and assume q (f ) = q ( f H) with finite Ef q [ f H]. Then D(p, p ) = C ľ[p] - ľ[p ] H for some constant C which depends only on H and q . Proof Note that by definition Exp [f (x)] = ľ[p], f H. Using l|nearity of the inner product, Equation (11) equals i ľ[p] - ľ[p ], f H| dq (f ) H ľ d [p] - ľ[p ] ] ,f q (f ), = ľ[p] - ľ[p H ľ[p] - ľ[p ] H where the integral is independent of p, p . To see this, note ľ[p]-ľ[ ] that for any p, p , ľ[p]-ľ[pp] H is a unit vector which can be turned into, say, the first canonical basis vector by a rotation which leaves the integral invariant, bearing in mind that q is rotation invariant. The above result covers a large number of interesting function classes. To go beyond Hilbert spaces, let  : X  B be the transformation from x into f (x) for all f  H and z B := Ef q(f ) [|zf |] be the L1 (q ) norm. Then (11) can also be written as ľ[p] - ľ[p ] B, where ľ is the mean map into Banach spaces. Its main drawback is the nontrivial computation for constraint sampling (de Farias & Roy, 2004) and the additional uniform convergence reasoning required. In Hilbert spaces no such operations are needed.

GMM PZ RSDE KMM

150 250 350 450 Training Sample Size

Figure 1. Left: Negative log-likelihood using a mixture of 3 Gaussians. The height of the bars represents the median of the scores from 100 repeats, and the whiskers correspond to the quartiles. We mark the best method with a black dot above the whiskers. Right: Sparsity of KMM solution. Blue dots are data points. Red circles represent prototypes selected by KMM. The size of a circle is proportional to the magnitude of its weight i .

tation was produced by minimizing an integrated squared distance between the two densities. Kernel Moment Matching (KMM) In applying our method, we used Gaussians with diagonal covariances as our prototypes pi . The regularization parameter  in our algorithm was fixed at 10-10 throughout. Since KMM may be tailored for different RKHS, we instantiated it with the four different kernels in Table 1. We denote them as LIN, POL2, POL3 and RBF, respectively. Our choice of kernel corresponded to the function class where we evaluated the expectations. The initialization of the prototypes will be further discussed below. 7.2. Evaluation Criterion We compared various density estimators in terms of two criteria: negative log-likelihood and discrepancy between function expectations on test data. Since different algorithms are optimized using different criteria, we expect that each will win with respect to the criterion it employs. The benefit of our density estimator is that we can explicitly tailor for different classes of functions. For this reason, we will focus on the second criterion. Given a function f , the discrepancy between function expectations is computed as follows: (i) Evaluate function m 1 expectation using test points, i.e., m i=1 f (xi ); (ii) Evaluate function expectation us1ng estimated density p, i.e., im ^ a Exp [f (x)]. (iii) Calc1 late m i=1. f (xi ) - Exp [f (x)] u ^ ^ m nd normalize it by m i=1 f (xi ) We will compare various methods either by repeated random instantiation or random split of the data (which we will make clear in context). For both cases, we will perform paired sign tests at the significance level of 0.01 on the results obtained from the randomizations. We will always present the median of the results in subsequent tables, and highlight in boldface those statistically equivalent methods that are significantly better than the rest. 7.3. Synthetic Dataset In this experiment, we use synthetic datasets to compare various methods as the sample size changes. We also show

7. Experiments
We focus on two aspects: first, our method performs well as a density estimator per se; and second, it can be tailored towards the expectation over a particular function class. 7.1. Methods for Comparison Gaussian Mixture Model (GMM)2 The density was represented as a convex sum of Gaussians. GMM was initialized with k -means clustering. The centers, covariances and mixing proportions of the Gaussians were optimized using the EM algorithm. We used diagonal covariances in all our experiments. We always employed 50 random restarts for k -means, and returned the results from the best restart. Parzen Windows (PZ) The density was represented as an average of a set of normalized RBF functions, with each centered on a data point. The bandwidths of the RBF functions were identical and tuned via the likelihood using leave-one-out cross validation. Reduced Set Density Estimation (RSDE)3 Girolami & He (2003) compressed a Parzen window estimator using RBF functions of larger bandwidths. The reduced represen2 3

GMM codes from: http://www.datalab.uci.edu/resources/gmm/ PZ and RSDE from: http://ttic.uchicago.edu/ihler/code/


Tailoring Density Estimation via Reproducing Kernel Moment Matching Table 2. Sparsity of the solution of RSDE and KMM. We show the median of the number of retained prototypes from 100 random initializations. Also shown are median percentages of retained prototypes after optimization.
Sample Size 50 150 250 350 450 RSDE 12 (25.0%) 35 (23.3%) 62 (24.8%) 90 (25.9%) 124 (27.5%) KMM 10 (20.0%) 21 (14.0%) 30 (12.0%) 36 (10.3%) 42 (9.3%)

Table 3. Negative log-likelihood on test points as computed by various density estimators over randomizations.
Data PZ GMM RSDE LIN POL2 covertype 11.48 11.22 14.97 11.64 208.29 ionosphere 28.09 36.58 56.68 29.55 69.92 sonar 78.29 119.16 122.35 78.13 129.81 australian 3.32 5.82 8.82 3.40 4.64 specft 43.01 42.61 43.16 42.90 231.76 wdbc 25.17 42.97 48.44 25.98 248.73 wine 19.68 21.17 22.95 19.43 70.15 satimage 18.49 39.10 59.88 20.18 158.31 segment -1.43a 5.71 36.74 -1.07 154.25 vehicle 10.98 11.99 32.85 11.34 170.66 svmguide2 27.85 39.67 40.07 27.92 204.30 vowle 11.75 6.24 25.59 11.77 108.43 housing 3.68 7.44 15.53 3.81 16.07 bodyfat 16.38 20.06 21.96 16.59 87.23 abalone 2.53 2.57 10.17 2.75 19.15 mix3 100 2.42 2.09 2.12 2.55 2.49 mix3 500 1.91 1.92 1.92 1.94 1.93 mix3 1000 1.86 1.87 1.88 1.88 1.87 POL3 45.23 68.79 92.35 22.73 87.28 63.61 47.99 121.27 128.74 200.22 59.22 47.45 90.51 171.53 21.77 2.43 1.91 1.87 RBF 62.37 46.49 112.21 7.27 105.04 88.61 48.94 52.41 28.38 83.35 36.08 26.18 39.88 53.33 16.29 2.41 1.91 1.86

that KMM leads to sparse solutions. Data Generation We generated 2 dimensional mixtures of 3 Gaussians with centers (0, 0) , (3, 3) and (-6, 4) , and covariances 0.82 I, 1.22 I and I respectively. The mixing proportions were 0.2, 0.3 and 0.5 respectively. We varied the training sample size while always testing on samples of size 1000. For each fixed training size, we randomly instantiated the experiments 100 times. Experimental Protocol All training data points were used as prototypes for RSDE and KMM. Their initial covariances were set to be identical, and were initialized in both cases using the approach of RSDE. We used the RBF instance of KMM and set the bandwidth  of the kernel to be the same as that for the prototypes. GMM used 3 centers. Negative Log Likelihood The results are plotted in Figure 1. GMM performs best in general, while KMM is superior for small sample sizes. This is not surprising since we used a correct generative model of the data for GMM. When the sample size is small (less than 30 data points for each cluster), GMM is susceptible to local maxima and does not result in good estimates. Sparsity of the Solution KMM also leads to sparse solutions (e.g., Figure 1). When using all data points as candidate prototypes, KMM automatically prunes away most of them and results in a much reduced representation of the data. In terms of both likelihood and sparsity, KMM is superior to other reduction method such as RSDE (Table 2). 7.4. UCI Dataset We used 15 UCI datasets to compare various methods based on the discrepancies between function expectations. Data Description We only included real-valued dimensions of the data, and normalized each dimension of the data separately to zero mean and unit variance. For each dataset, we randomly shuffled the data points for 50 times. In each shuffle, we used the first half of the data for training and the remaining data for testing. In each shuffle, we randomly generated 100 functim s f to evaluate the discrepon ancy criterion, i.e., f = i=01 wi k (xi , ˇ) where m0  N was uniformly random in [1, m], wi  R was uniformly random in [-1, 1], and the xi were uniformly sampled from the test points. Thus, each method resulted in 5000 numbers for each dataset. Experimental Protocol Both GMM and KMM used 10 prototypes and diagonal covariances, and both were initialized using k -means clustering. We used all four instances

a Some numbers are negative, which is possible since unlike probability mass function, density can take values greater than 1.

of KMM, namely LIN, POL2, POL3 and RBF, for the experiments, depending on the function class where we evaluated the expectations. When we used the RBF instance of KMM, we set the bandwidth  of the kernel to the median of the distances between data points. Besides optimizing the mixing proportions of the prototypes of KMM, we also used conjugate gradient descent to optimize the center positions and covariances of the prototypes. Negative Log Likelihood Kernel moment matching can be a very different objective from the likelihood (Table 3). Except for the LIN instance, KMM results in much larger negative log-likelihood. This suggests that if the purpose of density estimation is to approximate the function expectations, likelihood is no longer a good criterion. We confirm this observation in our next experiment. Discrepancy between Function Expectations We used four classes of functions corresponding to the RKHS of the LIN, POL2, POL3 and RBF instances of KMM. For nonlinear functions KMM clearly outperforms other density estimators, while for linear functions KMM has equivalent performance to PZ and GMM (Table 4). These results are not surprising, since KMM is explicitly optimized for approximating the function expectations well. Note that PZ is the second best for polynomial functions. This is reasonable since PZ retains all training points in the density, and should perform better than compressed representations such as GMM and RSDE. We also applied this new experimental protocol to the synthetic mixture of 3 Gaussians from the last section. We instantiate the synthetic data with 3 different sample sizes: 100, 500 and 1000. The results are shown in the last three rows of Table 3 and 4, which are consistent with those for UCI data. A closer view of the difference between GMM and KMM using "covertype" dataset is shown in Figure 2. We chose to compare GMM and KMM because they are initialized similarly. As an aside, we remark that PZ and GMM also match the


Tailoring Density Estimation via Reproducing Kernel Moment Matching
2 1777

0 1989

KMM

KMM

Note that our setting is a modification of de Freitas's demo4 where we only change the process noise from a unimodal gamma distribution to a more complicated mixture model. The task of particle filtering (Doucet et al., 2001) is to infer the hidden state given past and current observations. This can be carried out by estimating the filtering density p(st |Yt ) := p(st |y1 , . . . , yt ) recursively in a two-stage procedure. First, the current filtering density p(st |Yt ) is propagated into the future via the transition density p(st+1 |st ) to produce the prediction density p(st+1 |Yt ), i.e., p Est p(st |Yt ) [p(st+1 |st )] := (st+1 |st )p(st |Yt )dst . (15) Second, p(st |Yt ) is updated via Bayes' law, p(st+1 |Yt+1 )  p(yt+1 |st+1 )p(st+1 |Yt ).

-5 -5 GMM

3223 2

-7 -7 GMM

3011 0

Figure 2. Scatter plot of the discrepancies between function expectations (in log scale) for the `covertype' dataset, with GMM discrepancies on the horizontal axis and KMM discrepancies on the vertical axis. Left: plot for polynomial functions (d = 2); Right: plot for RBF functions. The distribution of the points is skewed below the red diagonal line, which means KMM is better. The numbers near the corners count respectively the number of points falling above and below the red diagonal.

m 1 empirical mean m j =1 xj . This is obvious for PZ. For e GMM, in each EM iteration, the centers ľi and wmights mi i i of each pi are updated via ľi  j =1 j xj / j =1 j mi 1 i and i  m j =1 j . Here j is the probability iof xj being generated by pi . It follows that Ep [x] = i ľi m 1 also matches m j =1 xj .

(16)

8. Applications
In this section, we employ KMM for two different applications: message compression in graphical models, and image processing. The common feature of these two applications is that they involve density estimation for computing the expectation of a function, which is the relevant setting for KMM. 8.1. Message Compression We use density estimation to compress messages in graphical models. This is of particular interest for applications such as distributed inference in sensor networks. It is our desire to compress the messages to cater for limited power supply and communication bandwidth. We will use a particle filtering example to compare GMM and KMM only, since they performed best in our earlier experiments. We model a one dimensional time series yt (t = 1 . . . 100) as being conditionally independent given an unobserved state st , which is itself Markovian. This system evolves as follows: st = f (st-1 ) = 1 + sin(0.04 t) + 0.5st-1 +  (12) 0 .2s2 +  , if t < 50, t yt = g (st ) = (13) 0.5st - 2 +  , otherwise. The random variables  and  represent process and measurement noise, respectively, and are modeled as mixtures of Gaussians, 5 1 5 N (ľi ,  ),   N (0,  ). (14)
i=1

The integral in (15) is usually intractable since the filtering density p(st |Yt ) can take a complicated form. Therefore, p(st |Yt ) is often approximated with a set of samples called particles. For distributed inference, it is these samples that need to be passed around. We want to compress the samples using density estimation such that we still do well in computing Est p(st |Yt ) [p(st+1 |st )]. In our example, p(st+1 |st ) takes the form . - 5 (st+1 -f (st )-ľi )2 (17) p(st+1 |st )  exp 2 2
i=1

In terms of variable s- p(st+1 |st ) is in the RKHS with kert, .
)2

(x-x nel k (x, x ) = exp We can customize KMM 2(2 )2 using this kernel, and compress messages by targeting a good approximation of Est p(st |Yt ) [p(st+1 |st )].

We use 5 centers for both GMM and KMM to compress the messages. We compare the filtering results with the true states. The error is measured as the root mean square of the deviations. The results for compressing different numbers of particles are reported in Table 5. We find that filtering results after compression even outperform those obtained from the full set of particles (PF). In particular, the results for KMM are slightly better than those for GMM. By compression, we have extracted the information most essential to statistical inference, and actually made the inference more robust. If the compression is targeted to Est p(st |Yt ) [p(st+1 |st )] (as we do in KMM), we can simply get better results. The shortcomings of general purpose density estimation also arise in the more general settings of message passing and belief propagation. This is due to the way messages are constructed: given a clique, the incoming messages are multiplied by the clique potential and all variables not in the receiver are integrated out. In most cases, this makes the outgoing messages very complicated, causing significant computational problems. Popular methods include
4

Throughout this experiment, we fix  to 0.2 and choose ľi to be {-1.5, -0.5, 0.5, 1.5, 2}. We initialize s0 with  .

http://www.cs.ubc.ca/nando/software/upf demos.tar.gz


Tailoring Density Estimation via Reproducing Kernel Moment Matching Table 4. Discrepancy between function expectations over randomizations. Smaller numbers are not necessarily statistical significant.
Data covertype ionosphere sonar australian specft wdbc wine satimage segment vehicle svmguide2 vowle housing bodyfat abalone mix3 100 mix3 500 mix3 1000 PZ 2.003 2.006 2.000 2.000 2.000 2.004 2.017 2.000 2.003 2.005 2.005 2.000 2.000 2.000 2.005 2.000 2.000 2.000 Linear Functions GMM RSDE 2.003 10.280 2.006 17.995 2.000 12.288 2.000 14.217 2.000 3.594 2.004 16.447 2.017 9.489 2.000 27.561 2.003 23.388 2.005 26.331 2.005 7.248 2.000 12.913 2.000 7.668 2.000 7.295 2.005 17.010 2.000 2.164 2.000 2.069 2.000 2.035 LIN 2.003 2.006 2.000 2.000 2.000 2.004 2.017 2.000 2.003 2.005 2.005 2.000 2.000 2.000 2.005 2.000 2.000 2.000 Polynomials (d = PZ GMM RSDE 0.185 0.194 0.396 0.159 0.232 0.383 0.971 0.354 0.933 0.369 0.380 0.587 0.891 0.515 0.522 0.209 0.233 0.406 0.822 0.236 1.027 0.146 0.126 0.533 0.258 0.245 0.803 0.126 0.135 0.780 3.468 0.247 3.341 0.131 0.150 0.642 0.117 0.126 0.399 0.288 0.243 0.595 0.105 0.101 0.234 0.153 0.152 0.164 0.064 0.062 0.064 0.052 0.051 0.051 2) POL2 0.150 0.169 0.242 0.380 0.488 0.166 0.211 0.122 0.263 0.119 0.183 0.131 0.121 0.242 0.103 0.152 0.062 0.050 Polynomials (d = PZ GMM RSDE 0.418 0.539 1.240 0.615 0.664 1.659 0.691 0.745 2.558 0.832 0.837 1.031 0.922 0.878 1.265 0.519 0.612 1.362 0.679 0.718 2.782 0.260 0.281 1.230 0.590 0.572 1.021 0.496 0.478 1.686 0.866 0.782 2.603 0.348 0.394 1.741 0.393 0.421 0.890 1.029 1.017 1.200 0.629 0.636 3.308 0.248 0.242 0.271 0.094 0.092 0.097 0.082 0.081 0.082 3) POL3 0.412 0.626 0.673 0.833 0.867 0.512 0.682 0.256 0.588 0.493 0.729 0.352 0.391 1.015 0.628 0.244 0.091 0.080 PZ 0.073 0.120 0.857 0.089 0.903 0.482 0.471 0.307 0.053 0.095 0.798 0.028 0.044 0.430 0.049 0.046 0.020 0.015 RBF Functions GMM RSDE 0.023 0.071 0.024 0.142 0.030 0.873 0.028 0.106 0.067 0.904 0.027 0.456 0.040 0.545 0.028 0.359 0.025 0.247 0.028 0.325 0.019 0.808 0.019 0.111 0.027 0.091 0.038 0.432 0.044 0.294 0.044 0.046 0.019 0.020 0.014 0.015 RBF 0.020 0.022 0.029 0.024 0.062 0.023 0.039 0.026 0.022 0.027 0.018 0.018 0.025 0.037 0.043 0.044 0.019 0.014

Table 5. Root mean square error and standard deviation of the filtering results before and after particle compression. We randomly instantiated the system 50 times and concatenate the times to produce the results. Statistical tests are done by viewing each time point as a data point.
Particle # PF GMM KMM 100 0.683ą0.114 0.558ą0.084 0.546ą0.072 500 0.679ą0.111 0.556ą0.076 0.530ą0.070 1000 0.685ą0.111 0.556ą0.082 0.526ą0.070

particle filtering, which uses a discrete approximation of the messages, and expectation propagation, which uses a single Gaussian approximation of the messages (Minka, 2001). We plan to further investigate KMM in these general settings. Our key benefit is that we can customize the approximation properties for a particular graphical model. 8.2. Image Retrieval and Categorization Following the work of (Rubner et al., 2000; Greenspan et al., 2002), we use density estimation as an intermediate step for image retrieval and categorization. 8 . 2 . 1 . I M AG E R E T R I E VA L Image retrieval is the task of finding from a given database the set of images similar to a given query image. An image is normally characterized by the distribution over features (e.g., color, texture) of pixels, patches, etc. It is thus helpful to compress the distribution by density estimation into more compact forms (e.g., mixtures of Gaussians), on which the query is based. In particular, the advantage is that density estimation can be computed offline before the query takes place, thus offering computational and storage savings. Method Greenspan et al. (2002) used GMM for density estimation; we propose KMM as an alternative. After density estimation, the dissimilarity between two distributions needs to be measured and the Earth Mover's Distance (EMD) is a state-of-the-art measure. Given two distributions represented by sets of weighted prototypes, EMD regards one collection as mass of earth spread in the feature space, while the other is a collection of holes. The EMD is defined as the least amount of work needed to fill the holes with earth. A unit of work corresponds to

the ground distance between two prototypes. If we represent the distributions by mixtures of Gaussians, then a sensible ground distance D(pi , pj ) between two Gaussians ´ pi = N (ľi , i ) and pj = N (ľj , j ) is the Frechet distance used in (Greenspan et al., 002), 2 . D2 (pi , pj ) := |ľi - ľj |2 + tr i + j - 2(i j )1/2 i i pi where j i is a Gaussian p Based on D(pi , pj ), if p = j pj , then the and i is its weight, and similarly p = i EMD between p and p s i j EMD(p, p ) := min ij D(pi , pj ),
ij feasible

where iji  0 is the flow j etween pi and qj . Feasibility b ij  i for all i and j . ij  j and means Settings In this experiment, the distance measure is fixed to EMD. We plug the densities estimated by GMM and KMM into EMD5 , and compare the retrieval results. Parameters for KMM and GMM were chosen in the same way as in Section 7.4. Here KMM used POL3. For each image, we sampled 103 pixels and each pixel's feature vector was the CIE-Lab value of a 5×5 window centered on it. Results We collected L = 10537 images from various sources including FIRE and CIRES6 . The dataset included 10 labeled categories like horse, beach, and each category has 100 images. For each image Ic (i) from class c (c  {1, ..., 10}, i  {1, ..., 100}), we retrieved r (r  {1, ..., L}) closest images (in terms of EMD) from the whole database and counted how many among them are also from class c, denoted as gc (i, r) for GMM and kc (i, r) for KMM. For each c and r, we performed a paired sign test 0 0 between {gc (i, r)}1=0 and {kc (i, r)}1=0 . Since p-value is i1 i1 always in (0,1], we report in Figure 3 the log p-value if the 0 median of {kc (i, r) - gc (i, r)}1=0 is higher than 0. Othi1 erwise, we plot the negative log p-value. Negative values are in favor of KMM. In Figure 3, performance of KMM is superior to or competitive with GMM in 8 categories and for most values of r (number of retrieved images).
EMD code from http://ai.stanford.edu/rubner/emd FIRE: http://www-i6.informatik.rwth-aachen.de/ deselaers/fire.html, CIRES: http://cires.matthewriley.com
6 5


Tailoring Density Estimation via Reproducing Kernel Moment Matching
8 6 4 0 -2 -5 2 -2 -3 0 -3 -9 -5 -2 -7 -5 -13 -7 -4 horse beach bus architecture snow -7 -9 -6 -11 -17 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0
3 2 0 -2 -3 3 2 0 -2

3 2

3 2 0

Error rate of KMM

30 25 20

498 dots

tie131

3 2 0 -2 -3 -5

3 2 0 -2 -3

3 2 0 -2 -3

4 2 0

18 14 10 6

871 dots Error rate of GMM

2 -2 0 -5 -5 -2 dinosaur elephant flower food aboriginal -7 -7 -7 -4 -6 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000

20

25

30

Figure 3. Log sign test p-value (vertical axis) v.s. # retrieved images (horizontal axis). Negative if KMM is better than GMM, and positive otherwise. ą2 for significance level 0.01.

Figure 4. Error rate of image categorization using KMM and GMM.

´ It is important to note that the Frechet distance is not in the class of functions used in KMM, and KMM still performs reasonably well. In the next section, we learn an image classifier using the same kernel as used in KMM. 8 . 2 . 2 . I M AG E C AT E G O R I Z AT I O N A closely related but different task is learning to categorize images using multi-class classification, particularly by SVM. Here all we need is a kernel between pairs of image densities p and q , which is readily given by ľ[p], ľ[q ] H. The SVM classifier tiakes the form f (pj ) = i i ľ[pi ], ľ[pj ]i = Expj [ i ľ[pi ](x)] for some i  R. Since i ľ[pi ]  H, KMM ensures that pj is estimated such that this expectation is well approximated. Our 10-class classification used 1000 images from the 10 categories. We randomly divided each category into 70 images for training and 30 images for testing. We used LibSVM to train a multi-class SVM with one-against-one criterion on the combined 700 training images. The loss and regularization tradeoff parameter was determined by an inner loop 10-fold cross validation on the training data. Finally we test the accuracy of the learned model on the 300 test images. The whole process is repeated for 1500 times. We use POL3 for both KMM and SVM, because for both GMM and KMM, POL3 significantly outperforms POL2 and RBF in practice7 . By using paired sign test, KMM yields lower error rate than GMM at significance level 0.01. Figure 4 shows the scatter plot of the resulting error rates. Acknowledgements NICTA is funded by the Australian
Government's Backing Australia's Ability and the Centre of Excellence programs. This work is also supported by the IST Program of the European Community, under the FP7 Network of Excellence, ICT-216886-NOE.

References
Altun, Y., & Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In COLT 2006. The error rate of GMM using POL3 is 23.3 ą 2.24%, outperforming POL2 (23.8 ą 2.14%) at significance level 0.01. KMM has 22.9ą2.19% error using POL3, beating POL2 (24.3ą2.14%) at significance level 0.01. Using an RBF kernel where the fixed value of the bandwidth  was tested over 0.01, 0.1, 1, 10, and 100 times the median distance between 1000 images, both GMM and KMM incur over 50% error.
7

Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Bartlett, P. L., & Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3, 463­482. de Farias, N., & Roy, B. (2004). On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res., 29(3). Doucet, A., de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer-Verlag. Dud´k, M., Phillips, S., & Schapire, R. (2004). Performance guari antees for regularized maximum entropy density estimation. In COLT 2004. DuMouchel, W., Volinsky, C., Cortes, C., Pregibon, D., & Johnson, T. (1999). Squashing flat files flatter. In KDD 1999. Girolami, M., & He, C. (2003). Probability density estimation from optimally condensed data samples. IEEE TPAMI, 25(10), 1253­1264. Greenspan, H., Dvir, G., & Rubner, Y. (2002). Context-based image modeling. In ICPR 2002. ¨ Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., & Smola, A. (2007). A kernel method for the two-sample-problem. In NIPS 2007. Jebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. JMLR, 5, 819­844. McLachlan, G., & Basford, K. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker. Minka, T. (2001). Expectation Propagation for approximate Bayesian inference. Ph.D. thesis, MIT. Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Stat., 33, 1065­1076. Rubner, Y., Tomasi, C., & Guibas, L. (2000). The earth mover's distance as a metric for image retrieval. Intl. J. Computer Vision, 40(2), 99­121. Shawe-Taylor, J., & Dolia, A. (2007). A framework for probability density estimation. In AISTATS 2007. Silverman, B. W. (1986). Density estimation for statistical and data analysis. Monographs on statistics and applied probability. Chapman and Hall. ¨ Smola, A., Gretton, A., Song, L., & Scholkopf, B. (2007). A Hilbert space embedding for distributions. In ALT 2007. Steinwart, I. (2002). The influence of the kernel on the consistency of support vector machines. JMLR, 2, 463­482. Vapnik, V., & Mukherjee, S. (2000). Support vector method for multivariate density estimation. In NIPS 12. Wainwright, M. J., & Jordan, M. I. (2003). Graphical models, exponential families, and variational inference. Tech. Rep. 649, UC Berkeley.