Injective Hilbert Space Embeddings of Probability Measures ¨ Bharath K. Sriperumbudur1 Arthur Gretton2 , Kenji Fukumizu3 , Gert Lanckriet1 and Bernhard Scholkopf2 , 1 Department of ECE, UC San Diego, La Jolla, CA 92093, USA. ¨ MPI for Biological Cybernetics, Spemannstraße 38, 72076 Tubingen, Germany. 3 Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan. 2 bharathsv@ucsd.edu, {arthur,bernhard.schoelkopf}@tuebingen.mpg.de fukumizu@ism.ac.jp, gert@ece.ucsd.edu Abstract A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). The embedding function has been proven to be injective when the reproducing kernel is universal. In this case, the embedding induces a metric on the space of probability distributions defined on compact metric spaces. In the present work, we consider more broadly the problem of specifying characteristic kernels, defined as kernels for which the RKHS embedding of probability measures is injective. In particular, characteristic kernels can include non-universal kernels. We restrict ourselves to translation-invariant kernels on Euclidean space, and define the associated metric on probability measures in terms of the Fourier spectrum of the kernel and characteristic functions of these measures. The support of the kernel spectrum is important in finding whether a kernel is characteristic: in particular, the embedding is injective if and only if the kernel spectrum has the entire domain as its support. Characteristic kernels may nonetheless have difficulty in distinguishing certain distributions on the basis of finite samples, again due to the interaction of the kernel spectrum and the characteristic functions of the measures. 1 Introduction The concept of distance between probability measures is a fundamental one and has many applications in probability theory and statistics. In probability theory, this notion is The author wishes to acknowledge the support from the Max Planck Institute (MPI) for Biological Cybernetics, National Science Foundation (grant DMS-MSPA 0625409), the Fair Isaac Corporation and the University of California MICRO program. Part of this work was done while the author was an intern at MPI. The authors thank anonymous reviewers for their comments to improve the paper. used to metrize the weak convergence (convergence in distribution) of probability measures defined on a metric space. Formally, let S be the set of all Borel probability measures defined on a metric measurable space (M , , M ) and let be its metric, i.e., (S, ) is a metric space. Then Pn is said to n converge weakly to P if and only if (Pn , P ) - 0, where P, {Pn }n1 S. When M is separable, examples for include the Levy-Prohorov distance and the dual-bounded Lip´ schitz distance (Dudley metric) [Dud02, Chapter 11]. Other popular examples for include the Monge-Wasserstein distance, total variation distance and the Hellinger distance, which yield a stronger notion of convergence of probability measures [Sho00, Chapter 19]. In statistics, the notion of distance between probability measures is used in a variety of applications, including homogeneity tests (the two-sample problem), independence tests, and goodness-of-fit tests. The two-sample problem involves testing the null hypothesis H0 : P = Q versus the alternative H1 : P = Q, using random samples {Xl }m 1 l= and {Yl }n 1 drawn i.i.d. from distributions P and Q on a l= measurable space (M , M). If is a metric (or more generally a semi-metric1 ) on S, then (P, Q) can be used as a test statistic to address the two-sample problem. This is because (P, Q) takes the unique and distinctive value of zero only when P = Q. Thus, the two-sample problem can be reduced to testing H0 : (P, Q) = 0 versus H1 : (P, Q) > 0. The problems of testing independence and goodness-of-fit can be posed in an analogous form. Several recent studies on kernel methods have focused on applications in distribution comparison: the advantage being that kernels represent a linear way of dealing with higher order statistics. For instance, in homogeneity testing, differences in higher order moments are encoded in mean differences computed in the right reproducing kernel Hilbert space (RKHS) [GBR+ 07]; in kernel ICA [BJ02, GHS+ 05], general nonlinear dependencies show up as linear correlations once they are computed in a suitable RKHS. Instrumental to these studies is the notion of a Hilbert space embedding for probability measures [SGSS07], which involves representing any probability measure as a mean element in an RKHS (H, k ), where k is the reproducing kernel [Aro50, Given a set M , a metric for M is a function : M × M R+ such that (i) x, (x, x) = 0, (ii) x, y , (x, y ) = (y , x), (iii) x, y , z , (x, z ) (x, y ) + (y , z ), and (iv) (x, y ) = 0 x = y [Dud02, Chapter 2]. A semi-metric only satisfies (i), (ii) and (iv). 1 SS02]. For this reason, the RKHSs used have to be "sufficiently large" to capture all nonlinearities that are relevant to the problem at hand, so that differences in embeddings correspond to differences of interest in the distributions. The question of how to choose such RKHSs is the central focus of the present paper. Recently, Fukumizu et al. [FGSS08] introduced the concept of a characteristic kernel, this being an RKHS kernel for which the mapping : S H from the space of Borel probability measures S to the associated RKHS H is injective (H is denoted as a characteristic RKHS). Clearly, a characteristic RKHS is sufficiently large in the sense we have described: in this case (P, Q) = 0 implies P = Q, where is the induced metric on S by , defined as the RKHS distance between the mappings of P and Q. Under what conditions, then, is injective? As discussed in [GBR+ 07, SGSS07], when M is compact, the RKHS is characteristic when its kernel is universal in the sense of Steinwart [Ste02, Definition 4]: the induced RKHS should be dense in the Banach space of bounded continuous functions with respect to the supremum norm (examples include the Gaussian and Laplacian kernels). Fukumizu et al. [FGSS08, Lemma 1] considered injectivity for non-compact M , and showed to be injective if the direct sum of H and R is dense in the Banach space of p-power (p 1) integrable functions (we denote RKHSs satisfying this criterion as F -characteristic). In addition, for M = Rd , Fukumizu et al. provide sufficient conditions on the Fourier spectrum of a translation-invariant kernel for it to be characteristic [FGSS08, Theorem 2]. Using this result, popular kernels like Gaussian and Laplacian can be shown to be characteristic on all of Rd . In the present study, we provide an alternative means of determining whether kernels are characteristic, for the case of translation-invariant kernels on Rd . This addresses several limitations of the previous work: in particular, it can be difficult to verify the conditions that a universal or F characteristic kernel must satisfy; and universality is in any case an overly restrictive condition because universal kernels assume M to be compact. In other words, they induce a metric only on the space of probability measures that are compactly supported on M . In addition, there are compactly supported kernels which are not universal, e.g. B2n+1 -splines, which can be shown to be characteristic. We provide simple verifiable rules in terms of the Fourier spectrum of the kernel that characterize the injective behavior of , and derive a relationship between the family of kernels and the family of probability measures for which (P, Q) = 0 implies P = Q. In particular, we show that a translation-invariant kernel on Rd is characteristic if and only if its Fourier spectrum has the entire domain as its support. We begin our presentation in §2 with an overview of terminology and notation. In §3, we briefly describe the approach of Hilbert space embedding of probability measures. Assuming the kernel to be translation-invariant in Rd , in §4, we deduce conditions on the kernel and the set of probability measures for which the RKHS is characteristic. We show that the support of the kernel spectrum is crucial: H is characteristic if and only if the kernel spectrum has the entire domain as its support. We note, however, that even using such a kernel does not guarantee that one can easily distinguish dis- tributions based on finite samples. In particular, we provide two illustrations in §5 where interactions between the kernel spectrum and the characteristic functions of the probability measures can result in an arbitrarily small (P, Q) = > 0 for non-trivial differences in distributions P = Q. Proofs of the main theorems and related lemmas are provided in §6. The results presented in this paper use tools from distribution theory and Fourier analysis: the related technical results are collected in Appendix A. 2 Notation For M Rd and µ a Borel measure on M , Lp (M , µ) denotes the Banach space of p-power (p 1) µ-integrable functions. We will also use Lp (M ) for Lp (M , µ) and dx for dµ(x) if µ is the Lebesgue measure on M . Cb (M ) denotes the space of all bounded, continuous functions on M . The space of all q -continuously differentiable functions on M is denoted by C q (M ), 0 q . For x C, x represents the comex conjugate of x. We denote as i the complex pl number -1. The set of all compactly supported functions in C (Rd ) is denoted by Dd and the space of rapidly decreasing functions in Rd is denoted by Sd . For an open set U Rd , Dd (U ) denotes the subspace of Dd consisting of the functions with support contained in U . The space of linear continuous functionals on Dd (resp. Sd ) is denoted by Dd (resp. Sd ) and an element of such a space is called as a distribution (resp. tempered distribution). md denotes the normald ized Lebesgue measure defined by dmd (x) = (2 )- 2 dx. ^ f and f represent the Fourier transform and inverse Fourier transform of f respectively. For a f easurablM function f and a signed measure P , m e P f := dP = f (x) dP (x). x represents the Dirac measure at x. The symbol is overloaded to represent the Dirac measure, the Dirac-delta function, and the Kronockerdelta, which should be distinguishable from the context. 3 Maximum Mean Discrepancy We briefly review the theory of RKHS embedding of probability measures proposed by Smola et al. [SGSS07]. We lead to these embeddings by first introducing the maximum mean discrepancy (MMD), which is based on the following result [Dud02, Lemma 9.3.2], related to the weak convergence of probability measures on metric spaces. Lemma 1 ([Dud02]) Let (M , ) be a metric space with Borel probability measures P and Q defined on M . Then P = Q if and only if P f = Qf , f Cb (M ). Originally, Gretton et al. [GBR+ 07] defined the maximum mean discrepancy as follows. Definition 2 (Maximum Mean Discrepancy) Let F = {f | f : M R} and let P, Q be Borel probability measures defined on (M , ). Then the maximum mean discrepancy is defined as F (P, Q) = sup |P f - Qf | . (1) f F With this definition, one can derive various metrics (mentioned in §1) that are used to define the weak convergence of probability measures on metric spaces. To start with, it is easy to verify that, independent of F , F in Eq. (1) is a pseudometric2 on S. Therefore, the choice of F determines whether or not F (P, Q) = 0 implies P = Q. In other words, F determines the metric property of F on S. By Lemma 1, F is a metric on S when F = Cb (M ). When F is the set of bounded, -uniformly continuous functions on M , by the Portmanteau theorem [Sho00, Chapter 19, Theorem 1.1], F is not only a metric on S but also metrizes the weak topology on S. F is a Dudley metric [Sho00, Chapter 19, Definition 2.2] when F = {f : f B L 1} where f B L = f + f L with f := sup{|f (x)| : x M } and f L := sup{|f (x) - f (y )|/(x, y ) : x = y in M }. f L is called the Lipschitz seminorm of a realvalued function f on M . By the Kantorovich-Rubinstein theorem [Dud02, Theorem 11.8.2], when (M , ) is separable, F equals the Monge-Wasserstein distance for F = {f : f L 1}. F is the total variation metric when F = {f : f 1} while it is the Kolmogorov distance when F = {1(-,t] : t Rd }. If F = {ei ,. : Rd }, then F (P, Q) reduces to finding the maximal difference between the characteristic functions of P and Q. By the uniqueness theorem for characteristic functions [Dud02, Theorem 9.5.1], we have F (P, Q) = 0 P = Q P = Q, where P and Q represent the characteristic functions of P and Q, respectively.3 Therefore, the function class F = {ei ,. : Rd } induces a metric on S. Gretton et al. [GBR+ 07, Theorem 3] showed F to be a metric on S when F is chosen to be a unit ball in a universal RKHS H. This choice of F yields an injective map, : S H, as proposed by Smola et al. [SGSS07]. A similar injective map can also be obtained by choosing F to be a unit ball in an RKHS induced by kernels satisfying the criteria in [FGSS08, Lemma 1, Theorem 2] (which we denote F -characteristic kernels). We henceforth assume F to be a unit ball in an RKHS (H, k ) (not necessarily universal or F -characteristic) defined on (M , M) with k : M × M R, i.e., F = {f H : f H 1}. The following result provides a different representation for F defined in Eq. (1) by exploiting the reproducing property of H, and will be used later in deriving our main results. Theorem 3 Let F be a unit ball in an RKHS (H, k ) defined on a measurable space (M , M) with k measurable and bounded. Then F (P, Q) = P k - Qk H, where . H Consider |TP [f ]| = M f (x) dP (x) M = M |f (x)| dP (x) C f H, | f, k (·, x) H| dP (x) where we have exploited the reproducing property and boundedness of the kernel to show TP is a bounded linear functional on H. Here, C > 0 is the bound on k , i.e., |k (x, y )| C < , x, y M . Therefore, by the Riesz representation theorem [RS72, Theorem II.4], there exists a unique P H such that TP [f ] = f, P H, f H. Let f = k (·, u) for some u M . Then, TP [k (·, u)] = k(·, u),MP H = P (u), which implies P = TP [k ] = P k = k (·, x) dP (x). Therefore, with |P f - Qf | = | f, P - Q H|, we have F (P, Q) = sup f H1 |P f - Qf | = P - Q H = P k - Qk H. The repreM ntation of F in Eq. (2) yields the embedding, se [P ] = k (·, x) dP (x) as proposed in [SGSS07, FGSS08], which is injective when k is characteristic. While the representation of F in Eq. (2) holds irrespective of the characteristic property of k , it need not be a metric on S, as is not guaranteed to be injective. The obvious question to ask is "For what class of kernels is injective?". To understand this in detail, we are interested in the following questions which we address in this paper. Q1. Let D S be a set of Borel probability measures defined on (M , M). Let K be a family of positive definite kernels defined on M . What are the conditions on D M and K for which : D Hk , P k (·, x) dP (x) is injective, i.e., F (P, Q) = 0 P = Q for P, Q D? Here, Hk represents the RKHS induced by k K. Q2. What are the conditions on K so that is injective on S? Note that Q1 is a restriction of Q2 to D. The idea is that the kernels that do not make F as a metric on S may make it as a metric on some restricted class of probability measures, D S. Our next step, therefore, is to characterize the relationship between classes of kernels and probability measures, which is addressed in the following section. 4 Characteristic Kernels & Main Theorems (2) represents the RKHS norm. In this section, we present main results related to the behavior of MMD. We start with the following definition of characteristic kernels, which was recently introduced by Fukumizu et al. [FGSS08] in the context of measuring conditional (in)dependence using positive definite kernels. Definition 4 (Characteristic kernel) A positive definite kernel k is characteristic to a set D of probability measures defined on (M , M) if F (P, Q) = 0 P = Q for P, Q D. b Remark 5 Equivalently, k is said toM e characteristic to D if the map, : D H, P k (·, x) dP (x), is injective. When M = Rd , the notion of characteristic kernel is a generalization of the characteristic function, P ( ) = R iT x dP (x), Rd , which is the expectation of the de Proof: Let TP : H R be a linear functional defined as M P TP [f ] := f (x) dP (x) with TP := supf H |Tf [f ]| . H A pseudometric only satisfies (i)-(iii) of the properties of a metric (see footnote 1). Unlike a metric space (M , ), points in a pseudometric space need not be distinguishable: one may have (x, y ) = 0 for x = y [Dud02, Chapter 2]. 3 The characteristic function of a probability measure, P on Rd R T is defined as ( ) := d ei x dP (x), Rd . 2 complex-valued positive definite kernel, k ( , x) = ei x . Thus, the definition of a characteristic kernel generalizes the well-known property of the characteristic function that P uniquely determines a Borel probability measure P on Rd . See [FGSS08] for more details. It is obvious from Definition 4 that universal kernels defined on a compact M and F -characteristic kernels on M are characteristic to the family of all probability measures defined on (M , M). The characteristic property of the kernel relates the family of positive definite kernels and the family of probability measures. We would like to characterize the positive definite kernels that are characteristic to S. Among the kernels that are not characteristic to S, we would like to determine those kernels that are characteristic to some appropriately chosen subset D, of S. Intuitively, the smaller the set D, larger is the family of kernels that are characteristic to D. To this end, we make the following assumption. Assumption 1 k (x, y ) = (x - y ) where is a bounded continuous real-valued positive definite function4 on M = Rd . The above assumption means that k is translation-invariant in Rd . A whole family of such kernels can be generated as the Fourier transform of a finite non-negative Borel measure, given by the following result due to Bochner, which we quote from [Wen05, Theorem 6.6]. Theorem 6 (Bochner) A continuous function : Rd C is positive definite if and only if it is the Fourier transform of a finite nonnegative Borel measure on Rd , i.e. R T (x) = e-ix d( ), x Rd . (3) d T Theorem 7 Let F be a unit ball in an RKHS (H, k ) defined on Rd . Suppose k satisfies Assumption 1. Then k is a characteristic kernel to the family, S, of all probability measures defined on Rd if and only if supp() = Rd . We provide a sketch of the proof of the above theorem, which is proved in §6.2.1 using a number of intermediate lemmas. The first step is to derive an alternate representation for F in Eq. (2) under Assumption 1. Lemma 13 provides the Fourier representation of F in terms of the kernel spectrum, and the characteristic functions of P and Q. The advantage of this representation over the one in Eq. (2) is that it is easy to obtain necessary and sufficient conditions for the existence of P = Q, P, Q S such that F (P, Q) = 0, which are captured in Lemma 15. We then show that if supp() = Rd , the conditions mentioned in Lemma 15 are violated, meaning P = Q such that F (P, Q) = 0, thereby proving the sufficient condition in Theorem 7. Proving the converse is equivalent to proving that k is not characteristic to S when supp() Rd . So, when supp() Rd , the result is proved using Lemma 19, which shows the existence of P = Q such that F (P, Q) = 0. Theorem 7 shows that the embedding function , associated with a positive definite translation-invariant kernel in Rd is injective if and only if the kernel spectrum has the entire domain as its support. Therefore, this result provides a simple verifiable rule for to be injective, unlike the results in [SGSS07, FGSS08] where the universality and F characteristic properties of a given kernel are not easy to verify. In addition, the universality and F -characteristic properties are sufficient conditions for a kernel to induce an injective map , whereas Theorem 7 provides supp() = Rd as the necessary and sufficient condition. Therefore, we have answered question Q2 posed in §3. Examples of kernels that are characteristic to S include the Gaussian, Laplacian and B2n+1 -splines. In fact, the whole family of compactly supported translation-invariant kernels on Rd are characteristic to S, as shown by the following corollary of Theorem 7. Corollary 8 Let F be a unit ball in an RKHS (H, k ) defined on Rd . Suppose k satisfies Assumption 1 and supp( ) is compact. Then k is a characteristic kernel to S. Proof: Since supp( ) is compact in Rd , by Lemma 25, which is a corollary of the Paley-Wiener theorem (see also [GW99, Theorem 31.5.2, Proposition 31.5.4]), we deduce that supp() = Rd . Therefore, the result follows from Theorem 7. The above result is interesting in practice because of the computational advantage in dealing with compactly supported kernels. By Theorem 7, it is clear that kernels with supp() R d are not characteristic to S. However, they can be characteristic to some D S (see Q1 in §3). The following result addresses this setting. Theorem 9 Let F be a unit ball in an RKHS (H, k ) defined on Rd . Let D be the set of all compactly supported probability measures on Rd with characteristic functions in L1 (Rd ) L2 (Rd ). Suppose k satisfies Assumption 1 and supp() Rd has a non-empty interior. Then k is a characteristic kernel to D. Since the translation-invariant kernels in Rd are characterized by the Bochner's theorem, it is theoretically interesting to ask which subset in the Fourier images gives characteristic kernels. Before we describe such kernels k that are characteristic to S, in the following example, we show that there exist kernels that are not characteristic to S. Here, S represents the family of all Borel probability measures defined on (Rd , B (Rd )), where B (Rd ) represents the Borel -algebra defined by open sets in Rd (see Assumption 1). Example 1 (Trivial kernel) Let k (x, y ) = (x - y ) = C , x, y Rd with C > 0. It can be shown that is the Fourier transform of R= C 0 with support {0}. R Consider P k = d k (·, x) dP (x) = C d dP (x) = C . Since P k = C irrespective of P S, the map is not injective. In addition, F (P, Q) = 0 for any P, Q S. Therefore, the trivial kernel, k is not characteristic to S. 4.1 Main theorems The following theorem characterizes all translation-invariant kernels in Rd that are characteristic to S. Let M be a nonempty set. A function : M R is called n positive definite if and only if j,l=1 cj cl (xj - xl ) 0, xj M , cj R, n N. 4 (x), = supp() = Rd supp( ) is compact Rd has a non-empty interior Rd D S S {P : supp(P ) is compact, P L1 (Rd ) L2 (Rd )} S Characteristic Yes Yes Yes No F Metric Metric Metric Pseudometric Reference Theorem 7 Corollary 8 Theorem 9 Theorem 7 Table 1: k satisfies Assumption 1 and is the Fourier transform of a finite nonnegative Borel measure on Rd . S is the set of all probability measures defined on (Rd , B(Rd )). P represents a probability measure in Rd and P is its characteristic function. If k is characteristic to S, then (S, F ) is a metric space, where F is a unit ball in an RKHS (H, k ). The proof is given in §6.2.2 and the strategy is similar to that of Theorem 7, where the Fourier representation of F (see Lemma 13) is used to derive necessary and sufficient conditions for the existence of P = Q, P, Q D such that F (P, Q) = 0 (see Lemma 17). We then show that if supp() Rd has a non-empty interior, the conditions mentioned in Lemma 17 are violated, which means P = Q, P, Q D such that F (P, Q) = 0, thereby proving the result. Although, by Theorem 7, the kernels with supp() Rd are not characteristic to S, Theorem 9 shows that there exists D S to which a subset of these kernels are characteristic. This type of result is not available for the methods studied in [SGSS07, FGSS08]. An example of a kernel that satisfies the conditions in Theorem 9 is the Sinc kernel, (x) = sin(x) which has supp() = [-, ]. The condix tion that supp() Rd has a non-empty interior is important for Theorem 9 to hold. If supp() has an empty interior (examples include periodic kernels), then one can construct P = Q, P, Q D such that F (P, Q) = 0. See §6.2.2 for the related discussion and an example. We have shown that the support of the Fourier spectrum of a positive definite translation-invariant kernel in Rd characterizes the injective or non-injective behavior of . In particular, supp() = Rd is the necessary and sufficient condition for the map to be injective on S, which answers question Q2 posed in §3. We also showed that kernels with supp() Rd can be characteristic to some D S even though they are not characteristic to S, which in turn answers question Q1 in §3. A summary of these results is given in Table 1. 4.2 A result on periodic kernels and discrete probability measures Proposition 10 Let F be a unit ball in an RKHS (H, k ) defined on Rd where k sat fies Assumption 1. Let D = {P : is P= n=1 n xn , n=1 n = 1, n 0, n} be the set of probability measures defined on M = {x1 , x2 . . .} Rd . Then P = Q, P, Q D such that F (P, Q) = 0 if the following conditions hold: (i) is -periodic5 in Rd , i.e., (x) = (x + · ), Zd , Rd , + (ii) xs - xt = lst · , lst Zd , s, t, where · represents the Hadamard multiplication. Proof: Let be -periodic in Rd and xs - xt = lst · , t Zd , s, t. Consider P, Q D given by P = ls ~ xn ~ ~~ n=1 pn and Q = n =1 qn xn such that pn , qn ~ ~ 0, n; n=1 pn = 1, n=1 qn = 1. Then F (P, Q) = P k -Qk H = Rd (.-x) d(P -Q)(x) H = 1 (pn n= ~ ~ -qn ) (. - xn ) H = n=1 (pn - qn ) (. - x1 - ln1 · ~ ~ ) H = (. - x1 ) n=1 (pn - qn ) H = 0. This holds for ~ ~ any P, Q D. The converse of Proposition 10, if true, would make the result more interesting. This is because any non-periodic translation invariant kernel on Rd would then be characteristic to the set of discrete probability measures on Rd . In order to prove the converse, we would need to show that (i) and (ii) in Proposition 10 hold when F (P, Q) = 0 for P = Q, P, Q D. However, this is not true as the trivial kernel yields F (P, Q) = 0 for any P, Q S and not just P, Q D. Let us consider F (P, Q) = 0 for P, Q D. This is equivalent to ~ ~ n=1 (pn - qn ) (. - xn ) H = 0. Squaring on bot sides and using the reproducing property of k , we h ~~ ~ ~ ~ n= get s,t=1 rt rs (xs - xt ) = 0 where {rn = pn - qn } 1 satisfy s=1 rs = 0 and {rs } 1 [-1, 1]. So, to prove ~ ~ s= the converse, we need to characterize all , {rn } 1 and ~ n= {xn } 1 that satisfy R = { s,t=1 rt rs (xs - xt ) = 0 : ~~ n= ~ ~ s=1 rs = 0, {rs }s=1 [-1, 1]}, which is not easy. However, choosing some , {rn } 1 and {xn } 1 is easy, as ~ n= n= shown in Proposition 10. Suppose there exists a class, K of positive definite translation-invariant kernels in Rd with supp() Rd and a class, E D of probability measures that jointly violate R, then any k K is characteristic to E. 5 A -periodic in R is the Fourier transform of = 2n n=- n 2n , where 2n is the Dirac measure at , n Z with n 0 and n=- n < . Thus, supp() = { 2n : n > 0, n Z} R. {n } are called the Fourier series - coefficients of . 0.3 10 -1 Uniform Gaussian 0.25 2 ,u(m,m) F 0.2 0.15 0.1 0.05 -8 Uniform Gaussian 2 ,u(m,m) F 10 -2 10 -3 10 -4 -8 -6 -4 -2 0 2 4 6 8 -6 -4 -2 0 2 4 6 8 (a) 2 F (P, Q) (b) w.r.t. for the (a) B1 -spline kernel and (b) Gaussian kernel. P is Figure 1: Behavior of the empirical estimate of constructed from Q as defined in Eq. (4). "Uniform" corresponds to Q = U [-1, 1] and "Gaussian" corresponds to Q = N (0, 2). 2 2 m = 1000 samples are generated from P and Q to estimate F (P, Q) through F ,u (m, m). See Example 2 for details. 5 Dissimilar Distributions with Small Mean Discrepancy So far, we have studied the behavior of F and have shown that it depends on the support of the spectrum of the kernel. As mentioned in §1, applications like homogeneity testing exploit the metric property of F to distinguish between probability distributions. Since the metric nature of F is guaranteed only for kernels with supp() = Rd , tests based on other kernels can fail to distinguish between different probability distributions. However, in the following, we show that the characteristic kernels, while guaranteeing F to be a metric on S, may nonetheless have difficulty in distinguishing certain distributions on the basis of finite samples. Before proving the result, we motivate it through the following example. Example 2 Let P be defined as p(x) = q (x) + q (x) sin( x), (4) where q is a symmetric probability density function with R, R\{0}. Consider a B1 -spline kernel on R given by k (x, y ) = (x - y ) where 1 - |x|, |x| 1 , (5) (x) = 0, otherwise 2 with its Fourier transform given by ( ) = 2 2 2 (see footnote 10 for the definition of ). Since is characteristic to S, F (P, Q) > 0 (see Theorem 7). However, it would be of interest to study the behavior of F (P, Q) as a function of . We do this through an unbiased, consistent estimator6 of 2 F (P, Q) as proposed by Gretton et al. [GBR+ 07, Lemma 7]. sin2 Figure 1(a) shows the behavior of the empirical estimate 2 of F (P, Q) as a function of for q = U [-1, 1] and q = N (0, 2) using the B1 -spline kernel in Eq. (5). Since the 2 Gaussian kernel, k (x, y ) = e-(x-y) is also a characteristic 2 kernel, its effect on the behavior of F ,u (m, m) is shown in Figure 1(b) in comparison to that of the B1 -spline kernel. From Figure 1, we observe two circumstances under whi2 ch the mean discrepancy may be small. First, F ,u (m, m) decays with increasing | |, and can be made as small as desired by choosing a sufficiently large | |. Second, in Fig2 ure 1(a), F ,u (m, m) has troughs at = 0 where 0 = 2 { : ( ) = 0}. Since F ,u (m, m) is a consistent esti2 mate of F (P, Q), one would expect similar behavior from 2 F (P, Q). This means that though the B1 -spline kernel is characteristic to S, in practice, it becomes harder to distinguish between P and Q with finite samples, when P is constructed as in Eq. (4) with = 0 . In fact, one can observe from a straightforward spectral argument that the troughs in 2 F (P, Q) can be made arbitrarily deep by widening q , when q is Gaussian. For characteristic kernels, although F (P, Q) > 0 when P = Q, Example 2 demonstrates that one can construct 2 distributions such that F ,u (m, m) is indistinguishable from zero with high probability, for a given sample size m. Below, in Theorem 12, we investigate the decay mode of MMD for large | | (see Example 2) by explicitly constructing P = Q such that |P l - Ql | is large for some large l, but F (P, Q) is arbitrarily small, making it hard to detect a non-zero value of the population MMD on the basis of a finite sample. Here, l L2 (M ) represents the bounded orthonormal eigenfunctions of a positive definite integral operator7 associated with k. Consider the formulation of MMD in Eq. (1). The construction of P for a given Q such that F (P, Q) is small, though not zero, can be intuitively seen by re-writing Eq. (1) as |P f - Qf | . (6) F (P, Q) = sup fH f H See [SS02, Theorem 2.10] for definition of positive definite integral operator and its corresponding eigenfunctions. 7 Starting from the expression for F in Eq. (2), we get = EX,X P k(X, X ) - 2EX P,Y Q k(X, Y ) + EY ,Y Q k(Y , Y ), where X, X are independent random variables with distribution P and Y , Y are independent random variables with distribution Q. An unbiased empirical estimate 2 2 2 of F , denoted as F ,u (m, m) is given by F ,u (m, m) = m 1 h(Zl , Zj ), which is a one-sample U -statistic with l=j m(m-1) h(Zl , Zj ) := k(Xl , Xj ) + k(Yl , Yj ) - k(Xl , Yj ) - k(Xj , Yl ), where Z1 , . . . , Zm are m i.i.d. random variables with Zj := (Xj , Yj ) (see [GBR+ 07, Lemma 7]). 2 F (P, Q) 6 When P = Q, |P f - Qf | can be large for some f H. However, F (P, Q) can be made small by selecting P such that the maximization of |P ff-Qf | over H requires an f with H large f H. More specifically, higher order eigenfunctions of the kernel (l for large l) have large RKHS norms, and so if they are prominent in P, Q (i.e., highly non-smooth distributions), one can expect F (P, Q) to be small even when there exists an l for which |P l - Ql | is large. To this end, we need the following lemma, which we quote from [GSB+ 04, Lemma 6]. Lemma 11 ([GSB+ 04]) Let F be a unit ball in an RKHS (H, k ) defined on compact M . Let l L2 (M ) be orthonormal eigenfunctions (assumed to be absolutely bounded), and l be the corresponding eigenvalues (arranged in a decreasing order for increasing l) of a positive definite integral operator associated with k . Assume -1 increases superlinl ~ early with l. Then for f F where f (x) := j =1 fj j (x), ~ we have {|fj |} 1 1 and for every > 0, l0 N such j= ~ that |fl | < if l > l0 . Theorem 12 (P = Q can give small MMD) Assume the conditions in Lemma 11 hold. Then there exists a probability distribution P = Q defined on M for which |P l - Ql | > - for some non-trivial and arbitrarily small > 0, yet for which F (P, Q) < for an arbitrarily small > 0. Proof: Let us construct p(x) = q (x) + l e(x) + l (x) where e(x) = 1M (x). For P to be a probability distribution, the following conditions need to be satisfied: M [l e(x) + l (x)] dx = 0, (7) xM , t, for any arbitrary > 0. Therefore, |P t - Qt | > - for t = l and |P t - Qt | < for t = l. By appealing to Lemma 1, we therefore establish that P = Q. In the following we prove that F (P, Q) can be arbitrarily small, though non-zero. Recall that F (P, Q) = sup f H1 |P f - Qf |. Substituting for l in Eq. (9), we have j ~2 j fj ~ F (P, Q) = sup j l fj : 1 , (12) j =1 =1 where we used the definition of RKHS norm as f H := ~2 fj j =1 j and j l := j l - j l . Eq. (12) is a convex quadratic ~ ~ program in {fj } 1 . Solving the Lagrangian yields fj = j= j l j 2 2 . Therefore, F (P, Q) = j =1 j l j = j =1 jl j 2 l - 2ll l + j =1 j l j 0 as l because (i) by choosing sufficiently large l, |j l | < , j, for any arbitrary > 0, (ii) l 0 as l [SS02, Theorem 2.10]. 6 Proofs of the Main Theorems In this section, we prove the main theorems in Section 4. 6.1 Preliminary lemmas Using the Fourier characterization of given by Eq. (3), under Assumption 1, we derive the following result that provides the Fourier representation of MMD. This result requires tools from distribution theory related to the Fourier transforms of distributions.9 We refer the reader to [Rud91, Chapters 6,7] for the detailed treatment of distribution theory. Another good and basic reference on distribution theory is [Str03]. Lemma 13 (Fourier representation of MMD) Let F be a unit ball in an RKHS (H, k ) defined on Rd with k satisfying Assumption 1. Let P and Q be the characteristic functions of probability measures P and Q defined on Rd . Then F (P, Q) = [(P - Q )] H, (13) min [q (x) + l e(x) + l (x)] 0. (8) Expanding e(x) and f (x) in the orthonormal basis {l } 1 , ~ l= el l (x) and f (x) = ~ we get e(x) = l=1 fl l (x), l=1 ~ where el := e, l M (M ) and fl := f, l L2 (M ) . There~ L2 fore, P f - Qf = f (x)[l e(x) + l (x)] dx reduces to P f - Qf = l 8 j =1 ~ ej f j + f l , ~~ (9) where we used the fact that j , t L2M ) = j t . Rewriting (M Eq. (7) and substituting for e(x) gives [l e(x)+ l (x)] dx M = e(x)[l e(x) + l (x)] dx = l j =1 e2 + el = 0, ~j ~ which implies el ~ l = - 2 . (10) ~ j =1 ej Now, let us consider P t - Qt = l et + tl . Substituting ~ for l gives P t - Qt = tl - et el ~~ where - represents complex conjugation, represents the inverse Fourier transform and represents the finite nonnegative Borel measure on Rd as defined in Eq. (3). (P - Q ) represents a finite Borel measure defined by Eq. (26). Proof: From Theorem 3, we have F (P, Q) = P k -Qk H. R R Consider P k = d k (·, x) dP (x) = d (·-x) dP (x). By R Eq. (23), d (· - x) dP (x) represents the convolution of and P , denoted as P . By appealing to the convolution the^ ^ orem (Theorem 22), we have ( P ) = P , where P ( ) = Here, the term distribution should not be confused with probability distributions. In short, distributions refer to generalized functions which cannot be treated as functions in the Lebesgue sense. Classical examples of distributions are the Dirac-delta function and Heaviside's function, for which derivatives and Fourier transforms do not exist in the usual sense. 9 et el ~~ ~2 j =1 e j = tl - tl , {|el |} 1 ~ l= (11) where tl := . By Lemma 11, 1 ~2 j =1 ej 2 ~ j =1 ej < , and choosing large enough l gives |tl | < 8 Here is used in the Kronecker sense. e-i x dP (x), Rd (by Lemma 20). Note that ^ = P . Therefore, F (P, Q) = P - Q H = P( H P ) - (Q ) . Using the linearity of the Fourier inverse, we get the desired result. d R T (i). If we relax this assumption, then the result is trivial as P = Q F (P, Q) = 0. For the results we derive later, it is important to understand the properties of , which we present in the following proposition. Proposition 16 (Properties of ) in Lemma 15 satisfies the following properties: (a) is a conjugate symmetric, bounded and uniformly continuous function on Rd . (b) (0) = 0. (c) supp() Rd \ where := supp(). In addition, if = {a1 , a2 , . . .}, then (aj ) = 0, aj . Proof: (a) From Lemma 15, we have = P - Q . Therefore, the result in (a) follows from Lemma 20, which shows that P , Q are conjugate symmetric, bounded, and uniformly continuous functions on Rd . (b) By Lemma 20, P (0) = Q (0) = 1. Therefore, (0) = P (0) - Q (0) = 0. (c) Let W := {x Rd | (x) = 0}. It suffices to show that W Rd \. Suppose W is not contained in Rd \. Then there is a non-empty open subset U such that U W ( ). Fix further a non-empty open subset V with V U . Since V , there is Dd (V ) with () = 0. Take h Dd (U ) such that h = 1 on V , and define a continuous function = h on Rd , which is welldefined from supp(h) U and = 0 on U . By (ii) of Lemma 15, = 0, where is a finite Borel measure on Rd as defined by Eq. (26). Therefore, R (x)(x) d(x) = 0. (15) d Remark 14 (a) If is the distributional derivative10 of , then Eq. (13) can also be written as F (P, Q) = [(P - Q )] H, (14) where the term inside the RKHS norm is the Fourier inverse of a tempered distribution. (b) By Assumption 1, is real-valued and symmetric in Rd . Therefore, by (ii) in Lemma 20, and are real-valued, symmetric tempered distributions. The representation of MMD in terms of the kernel spectrum as in Eq. (13) will be central to deriving our main theorems. It is easy to see that characteristic kernels can be described indirectly by deriving conditions for the existence of P = Q such that F (P, Q) = 0. Using the Fourier representation of F , the following result provides necessary and sufficient conditions for the existence of P = Q such that F (P, Q) = 0. Lemma 15 Let F be a unit ball in an RKHS (H, k ) defined on Rd , and let P, Q be probability distributions on Rd such that P = Q. Suppose that k satisfies Assumption 1 and supp() Rd . Then F (P, Q) = 0 if and only if there exists Sd that satisfies the following conditions: (i) p - q = , (ii) = 0, where p and q represent the distributional derivatives of P and Q respectively, and represents a finite Borel measure defined by Eq. (26). Proof: The proof follows directly from the formulation of F in Eq. (13). ( ) Let Sd satisfy (i) and (ii). Since Sd , we have ^ = = (p - q ) = p - q = - . Therefore, by (ii), ^^ P Q The left hand side of Eq. (15) simplifies to R U h(x)(x) (x)(x) d(x) = (x) d(x) (x) d U = (x) d(x) = () = 0, resulting in a contradiction. So, supp()a Rd \. , >0 If = {a1 , a2 , . . .}, then = R j j aj j j j < . = 0 implies d (x)(x) d(x) = and j j (aj )(aj ) = 0 for any continuous function in Rd . This implies (aj ) = 0, aj . Lemma 15 provides conditions under which F (P, Q) = 0 when P = Q. It shows that the kernel k cannot distinguish between P and Q if P is related to Q by condition (i). Condition (ii) in Lemma 15 says that has to be chosen such that its support is disjoint with that of the kernel spectrum. This is what is precisely captured by (c) in Proposition 16. So, for a given Q, one can construct P such that P = Q and F (P, Q) = 0 by choosing that satisfies the properties in Proposition 16. However, P should be a positive distribution so that it corresponds to a positive measure.11 Therefore, A positive distribution is defined to be as the one that takes nonnegative values on nonnegative test functions. So, D Dd (M ) 11 we have F (P, Q) = [(P - Q )] H = [] H = 0. ( ) Let F (P, Q) = [(P - Q )] H = 0, which implies [(P - Q )] = 0. Since (P - Q ) is a finite Borel measure as defined by Eq. (26), it is therefore a tempered distribution and so (P - Q ) = [[(P - Q )] ] = 0. Let := P - Q . Clearly Sd as by Lemma 20, P , Q Sd . So, p - q = (P ) - (Q ) = (P - Q ) = . = 0 trivially satisfies (ii) in Lemma 15. However, it violates our assumption of P = Q when it is used in condition 10 If is absolutely continuous w.r.t. the Lebesgue measure, then represents the Radon-Nikodym derivative of w.r.t. the Lebesgue measure. In such a case, is the Fourier transform of R T in the usual sense; i.e., (x) = d e-ix ( ) dmd ( ). On the other hand, if is the distributional derivative of , then is a symbolic representation of the derivative of and will make sense only under the integral sign. should also be such that q + is a positive distribution. Imposing such a constraint on is not straightforward, and therefore Lemma 15 does not provide a procedure to construct P = Q given Q. However, by imposing some conditions on P and Q, we obtain the following result wherein the conditions on can be explicitly specified, yielding a procedure to construct P = Q such that F (P, Q) = 0. Lemma 17 Let F be a unit ball in an RKHS (H, k ) defined on Rd . Let D be the set of probability measures on Rd with characteristic functions either absolutely integrable or square integrable, i.e., for any P D, P L1 (Rd ) R L2 (Rd ). Suppose that k satisfies Assumption 1 and supp() d . Then for any Q D, P = Q, P D given by p=q+ (16) P, Q D implies P , Q (L1 (Rd ) L2 (Rd )) Cb (Rd ) and p, q L1 (Rd ) (L2 (Rd ) Cb (Rd )). Therefore, = P - Q (L1 (Rd ) L2 (Rd )) Cb (Rd ) and = p - q L1 (Rd ) (L2 (Rd ) Cb (Rd )). By Lemma 20, P and Q are conjugate symmetric and so is . Therefore Rsatisfies (i) anR satisfies (ii). satisfies (iv) as (0) = d (x) dx = d (p(x) - q (x)) dx = 0. Non-negativity of d p yields (v). F (P, Q) = 0 implies (iii), with a proof similar to that of Lemma 15. Remark 18 Conditions (iii) and (iv) in Lemma 17 are the same as those of Proposition 16. Conditions (i) and (ii) are required to satisfy our assumption P, Q D and Eq. (16). Condition (v) ensures that P is a positive measure, which was the condition difficult to impose in Lemma 15. In the above result, we restricted ourselves to probability measures P with characteristic functions P in L1 (Rd ) L2 (Rd ). This ensures that the inverse Fourier transform of P exists in the L1 or L2 sense. Without this assumption, P is not guaranteed to have a Fourier transform in the L1 or L2 sense, and therefore has to be treated as a tempered distribution for the purpose of computing its Fourier transform. This implies = P - Q has to be treated as a tempered distribution, which is the setting in Lemma 15. Since we wanted to avoid dealing with distributions where the required positivity constraint is difficult to impose, we restricted ourselves to D.13 Though this result explicitly captures the conditions on , it is a very restricted result as it only deals with continuous (a.e.) probability measures. However, we use this result in Lemma 19 to construct P = Q such that F (P, Q) = 0. Lemmas 15 and 17 are the main results that provide conditions for the existence of P = Q such that F (P, Q) = 0. This means that if there exists a satisfying these conditions, then k cannot distinguish between P and Q where P is defined as in Eq. (16). Thus, the existence (resp. nonexistence) of results in a non-injective (resp. injective) map . It is clear from Lemmas 15 and 17 that the dependence of F on the kernel appears in the form of the support of the kernel spectrum. Therefore, two scenarios exist: (a) supp() = Rd and (b) supp() Rd . The d case of supp() = R is addressed by Theorem 7 while that of supp() Rd is addressed by Theorem 9. Using Lemma 17, the following result proves the existence of P = Q such that F (P, Q) = 0 while using a kernel with supp() Rd . Lemma 19 Let F be a unit ball in an RKHS (H, k ) defined on Rd . Let D be the set of all non-compactly supported probability measures on Rd with characteristic functions in L1 (Rd ) L2 (Rd ). Suppose k satisfies Assumption 1 and supp() Rd . Then P = Q, P, Q D such that F ( P , Q) = 0. Choosing D to be the set of all probability measures with characteristic functions in L1 (Rd ) L2 (Rd ) is the best possible restriction that avoids treating as a tempered distribution. The classical Fourier transforms on Rd are defined for functions in Lp (Rd ), 1 < p 2. For p > 2, the only reasonable way to define Fourier transforms on Lp (Rd ) is through distribution theory. 13 such that F (P, Q) = 0 if and only if there exists a non-zero function : Rd C that satisfies the following conditions: (i) (L1 (Rd ) L2 (Rd )) Cb (Rd ) is conjugate symmetric, (ii) L1 (Rd ) (L2 (Rd ) Cb (Rd )), (iii) = 0, (iv) (0) = 0, (v) inf xRd {(x) + q (x)} 0. Proof: ( ) Suppose there exists a non-zero function sat isfying (i) ­ (v). We need to show that p = q + is in D for q D and F (P, Q) = 0. For any Q D, Q (L1 (Rd ) L2 (Rd )) Cb (Rd ). When Q L1 (Rd )Cb (Rd ), the Riemann-Lebesgue lemma (Lemma 23) implies that q = [Q ] L1 (Rd ) Cb (Rd ). When Q L2 (Rd ) Cb (Rd ), the Fourier transform in the L2 sense12 implies that q = [Q ] L1 (Rd ) L2 (Rd ). Therefore, q L1 (Rd ) (L2 (Rd ) Cb (Rd )). Define p := q + . Clearly p L1 (Rd ) (L2 (Rd ) Cb (Rd )). In addition, P = p = q + = Q + (L1 (Rd ) L2 (Rd )) Cb (Rd ). ^^^ Since is conR gate symmetriR, is real valued and so is ju c R p. Consider d p(x) dx = d q (x) dx + d (x) dx = 1 + (0) = 1. (v) implies that p is non-negative. Therefore, P represents a probability measure such that P = Q and P D. Since P, Q are probability measures, F (P, Q) is computed as F (P, Q) = [(P - Q )] H = [] H = 0. ( ) Suppose that P, Q D and p = q + gives F (P, Q) = 0. We need to show that satisfies (i) ­ (v). is a positive distribution if D() 0 for 0 Dd (M ). If µ is M a positive measure that is locally finite, then Dµ () = dµ defines a positive distribution. Conversely, every positive distribution comes from a locally finite positive measure [Str03, §6.4]. 12 ^ If f L2 (Rd ), the Fourier transform [f ] := f of f is 2 ^ defined to be the limit, in the L -norm, of the sequence {fn } of Fourier transforms of any sequence {fn } of functions belonging to Sd , such that fn converges in the L2 -norm to the given function ^ f L2 (Rd ), as n . The function f is defined almost everyd 2 d where on R and belongs to L (R ). Thus, is a linear operator, mapping L2 (Rd ) into L2 (Rd ). Proof: We claim that there exists a non-zero function, satisfying (i) ­ (v) in Lemma 17 which therefore proves the result. Consider the following function, g ,0 C (Rd ) supported in [0 - , 0 + ], g ,0 ( ) = jd =1 1[-j ,j ] (j - 0,j ) e - 2 j 2 -(j -0,j )2 j , (17) where = (1 , . . . , d ), 0 = (0,1 , . . . , 0,d ) and = (1 , . . . , d ). Since supp() Rd , there exists an open d set U R on which is null. So, there exists and 0 = 0 with 0 > such that [0 - , 0 + ] U . Choose = (g ,0 + g ,-0 ), R\{0}, which implies supp() = [-0 - , -0 + ] [0 - , 0 + ] is compact. Therefore, by the Paley-Wiener theorem (The orem 24), is a rapidly decaying function, i.e., Sd . Since (0) = 0 (by construction), will take negative val ues. However, decays faster than some Q D of the d form q (x) j =1 1+|x1 |l+ , l N, > 0 where x = j (x1 , . . . , xd ). It can be verified that satisfies conditions (i) ­ (v) in Lemma 17. We conclude, there exists a non-zero as claimed earlier, which completes the proof. The above result shows that k with supp() Rd is not characteristic to the class of non-compactly supported probability measures on Rd with characteristic functions in either L1 (Rd ) or L2 (Rd ). 6.2 Main theorems: Proofs We are now in a position to prove Theorems 7 and 9. 6.2.1 Proof of Theorem 7 ( ) Let supp() = Rd . k is a characteristic kernel to S if F (P, Q) = 0 P = Q for P, Q S. We only need to show the implication F (P, Q) = 0 P = Q as the other direction is trivial. Assume that P = Q such that F (P, Q) = 0. Then by Lemma 15, satisfying (i) and (ii) given in Lemma 15. By Proposition 16, = 0 implies supp() Rd \supp(). Since supp() = Rd and is a uniformly continuous function in Rd , we have supp() = which means = 0 a.e. Therefore, by (i) of Theorem 15, we have P = Q, leading to a contradiction. Thus, P = Q such that F (P, Q) = 0. ( ) Suppose k is characteristic to S. We then need to show that supp() = Rd . This is equivalent to proving that k is not characteristic to S when supp() Rd . Let supp() Rd . Choose D S as the set of all non-compactly supported probability measures on Rd with characteristic functions in L1 (Rd ) L2 (Rd ). By Lemma 19, P = Q, P, Q D S such that F (P, Q) = 0. Therefore, k is not characteristic to S. 6.2.2 Proof of Theorem 9 Suppose P = Q, P, Q D S such that F (P, Q) = 0. Then by Lemma 15, there exists a Sd such that = p- q where p and q are the distributional derivatives of P and Q, respectively. Since P, Q D, we can apply Lemma 17 and so is a non-zero function that satisfies conditions (i) ­ (v) in Lemma 17. The condition = 0 implies supp() Rd \supp(). Since supp() has a non-empty interior, we have supp() Rd . Thus, there exists an open set, U Rd such that (x) = 0, x U . By Lemma 25, this means that Ris not compactly supported in Rd . Condition (iv) implies d (x) dx = 0, which means that takes negative values. Since q is compactly supported in Rd , q (x) + (x) < 0 for some x Rd \supp(Q), which violates condition (v) in Lemma 17. In other words, there does not exist a non-zero that satisfies conditions (i) ­ (v) in Lemma 17, thereby leading to a contradiction. As discussed in §4.1, the condition that supp() has a nonempty interior is important for Theorem 9 to hold. This is because if supp() has an empty interior, then supp() = Rd . In principle, one can construct such a by selecting Sd so that it satisfies conditions (i) ­ (iv) of Lemma 17 while satisfying the decay conditions (Eq. (29) and Eq. (30)) given in the Paley-Wiener theorem (see Theorem 24). Therefore, by the Paley-Wiener theorem, is a C function with compact support. If is chosen such that supp() supp(Q), then condition (v) of Theorem 17 will be satisfied. Thus, one can construct P = Q, P, Q D (D being defined in Theorem 9) such that F (P, Q) = 0. Note that conditions (i) and (ii) of Lemma 17 are automatically satisfied (except for conjugate symmetry) by choosing Sd . However, choosing such that it is also an entire function (so that the Paley-Wiener theorem can be applied) is not straightforward. In the following, we provide a simple example to show that P = Q, P, Q D can be constructed such that F (P, Q) = 0, where F corresponds to a unit ball in an RKHS (H, k ) induced by a periodic translation-invariant kernel for which supp() Rd has an empty interior. Example 3 Let Q be a uniform distribution on [- , ] R, i.e., q (x) = 21 1[- , ] (x) with its characteristic function, Q ( ) = sin( ) 1 2 in L2 (R). Let be the Dirichlet kersin (2l+1) x nel with period , where , i.e., (x) = sin x and 2 j w l ( ) = - ith supp() = { 2j , j j =-l {0, ±1, . . . , ±l}}. Clearly, supp() has an empty interior. Let be 2 sin 8 2 4 ( ) = sin , (18) 2 2 i with 21 . It is easy to verify that L1 (R) L2 (R) Cb (R) and so satisfies (i) in Lemma 17. Since ( ) = 0 at = 2 l , l Z, also satisfies (iii) and (iv ) in Lemma 17. is given by 2|x+ 2 | - , - x 0 2|x- | (19) (x) = 2 , 0x - 0, otherwise, where L1 (R)L2 (R)Cb (R) satisfies (ii) in Lemma 17. Now, consider p = q + which is given as 1 x [- , - ] [ , ] 2 , 2|x+ | 2 1 + 2 - , x [- , 0] p(x) = 2|x- | 2 + 1 - , x [0, ] 2 0, otherwise. R Clearly, p(x) 0, x and p(x) dx = 1. P = Q + = Q + iI where I = I m[] and P L2 (R). We have therefore constructed P = Q such that F (P, Q) = 0, where P and Q are compactly supported in R with characteristic functions in L2 (R). The condition of the compact support for probability measures mentioned in Theorem 9 is also critical for the result to hold. If this condition is relaxed, then k with supp() Rd is no longer characteristic to D, as shown in Lemma 19. 7 Concluding Remarks Previous works have studied the Hilbert space embedding for probability measures using universal kernels, which form a restricted family of positive definite kernels. These works showed that if the kernel is universal, then the embedding function from the space of probability measures to a reproducing kernel Hilbert space is injective. In this paper, we extended this approach to a larger family of kernels which are translation-invariant on Rd . We showed that the support of the Fourier spectrum of the kernel determines whether the embedding is injective. In particular, the necessary and sufficient condition for the embedding to be injective is that the Fourier spectrum of the kernel should have the entire domain as its support. Our study in this paper was limited to kernels and probability measures that are defined on Rd , and the results have been derived using Fourier analysis in Rd . Since Fourier theory is available for more general groups apart from Rd , one direction for future work is to extend the analysis to positive definite kernels defined on other groups. which proves Eq. (20). Clearly µ is bounded as |µ( )| 1. ^ ^ By Lebesgue's dominated convergence theorem, µ is uni^ formly continuous on Rd as limh0 |µ( + h) - µ( )| ^ ^ R T limh0 d |e-j h x - 1| dµ(x) = 0, for any Rd . R T (i) µ( ) = d ei x dµ(x) = µ(- ). ^ ^ R (iR) ( ) For Sd , Dµ () = Dµ () = d (x) dµ(x) = i ^ ^ ^ ^ ~ d µ(x)(x) dmd (x). Since Sd and Dµ () = Dµ (), R ~ Sd , we have Dµ () = Dµ () = d (-x) dµ(x). ^ ^ ^ Substituting for (-x), we get ^ R R µ(x)(x) dmd (x), ^ µ(-x)(x) dmd (x) = ^ Dµ () = ^ d d for every Sd , which implies µ(x) = µ(-x), x Rd . ^ ^ ( R ) For Sd , we have Dµ () = (Dµ ) () = Dµ () = R ^ ^ d µ(-x)(x) dmd (x). Applyd µ(x)(x) dmd (x) = ing Fubini's theorem after substituting for µ(-x) and (x) ^ gives RR (y + )(y ) dmd (y ) dµ( ) Dµ () = d d R = (- ) dµ( ) = Dµ (), ~ d for every Sd . Remark 21 (a) Property (i) in Lemma 20 shows that the Fourier transform of a finite Borel measure on Rd is "conjugate symmetric", which means that Re[µ] is an even function ^ and I m[µ] is an odd function. ^ (b) Property (ii) shows that real symmetric tempered distributions have real symmetric Fourier transforms. This can be easily understood when µ is absolutely continuous w.r.t. the Lebesgue measure. Suppose dµ = dmd . Then property (ii) implies that µ is real and symmetric if and only if is ^ real and symmetric. The following result is popularly known as the convolution theorem. Before providing the result, we first define convolution: if f and g are complex functions in Rd , their convolution f g is R (f g )(x) = f (y )g (x - y ) dy , (22) d Appendix A Supplementary Results We show five supplementary results used to prove the results in §4 and §6. The first two are basic, and deal with the Fourier transform of a measure and the convolution theorem. The remaining three (the Riemann-Lebesgue lemma, the Paley-Wiener theorem, and its corollary) are stated without proof. Lemma 20 (Fourier transform of a measure) Let µ be a finite Borel measure on Rd . The Fourier transform of µ is a tempered distribution given by R T µ( ) = ^ e-i x dµ(x), Rd (20) d which is a bounded, uniformly continuous function on Rd . In addition, µ satisfies the following properties: ^ (i) µ( ) = µ(- ), Rd , ^ ^ (ii) µ( ) = µ(- ), Rd if and only if Dµ () = ^ ^ Dµ (), Sd where Dµ is the tempered distribu~ tion defined by µ and (x) := (-x), x Rd . ~ Proof: Let Dµ denote a tempered distribution defined by µ. R For Sd , we have Dµ () = Dµ () = d ( ) dµ( ) = ^ ^ R R -iT x (x) dmd (x) dµ( ). From Fubini's theorem, de d R R T Dµ () = e-ix dµ( ) (x) dmd (x), (21) d d provided that the integral exists for almost all x Rd , in the Lebesgue sense. Let µ be a finite Borel measure on Rd and f be a bounded measurable function on Rd . The convolution of f and µ, f µ, which is a bounded measurable function, is defined by R (f µ)(x) = f (x - y ) dµ(y ). (23) d Theorem 22 (Convolution Theorem) Let µ be a finite Borel measure and f be a bounded function on Rd . Suppose f is written as R T f (x) = eix d( ), (24) d with a finite Borel measure on Rd . Then (f µ) = µ, ^ 14 References (25) [Aro50] [BJ02] [Dud02] [FGSS08] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337­404, 1950. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1­48, 2002. R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, UK, 2002. K. Fukumizu, A. Gretton, X. Sun, and ¨ B. Scholkopf. Kernel measures of conditional dependence. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 489­ 496, Cambridge, MA, 2008. MIT Press. A. Gretton, K. M. Borgwardt, M. Rasch, ¨ B. Scholkopf, and A. Smola. A kernel method ¨ for the two sample problem. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 513­520. MIT Press, 2007. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, ¨ and B. Scholkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075­2129, December 2005. A. Gretton, A. Smola, O. Bousquet, R. Herbrich, ¨ B. Scholkopf, and N. Logothetis. Behaviour and convergence of the constrained covariance. Technical Report 130, MPI for Biological Cybernetics, 2004. C. Gasquet and P. Witomski. Fourier Analysis and Applications. Springer-Verlag, New York, 1999. S. G. Mallat. A Wavelet Tour of Signal Processing. Academic Press, San Diego, 1998. M. Reed and B. Simon. Functional Analysis. Academic Press, New York, 1972. W. Rudin. Functional Analysis. McGraw-Hill, USA, 1991. A. J. Smola, A. Gretton, L. Song, and ¨ B. Scholkopf. A Hilbert space embedding for distributions. In Proc. 18th International Conference on Algorithmic Learning Theory, pages 13­31. Springer-Verlag, Berlin, Germany, 2007. G. R. Shorack. Probability for Statisticians. Springer-Verlag, New York, 2000. ¨ B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67­ 93, 2002. R. S. Strichartz. A Guide to Distribution Theory and Fourier Transforms. World Scientific Publishing, Singapore, 2003. H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005. where the right hand side is a finite Borel measure and the equality holds as a tempered distribution. Proof: Since the Fourier and inverse Fourier transform give one-to-one correspondence of Sd , it suffices to show f µ = ( µ ) . ^ For an arbitrary Sd , (µ) () = (µ)() = ^ ^ d (27) R (x)µ(x) d(x). ^ (28) Substituting for µ in Eq. (28) and applying Fubini's theorem, ^ we have (µ) () = ^ R RR i( -y )T x e d(x) ( ) dmd ( ) dµ(y ), d d d [GBR+ 07] RR which reduces to d [ d f ( - y ) dµ(y )]( ) dmd ( ) = (f µ)() and therefore proves Eq. (27). The following result, called the Riemann-Lebesgue lemma, is quoted from [Rud91, Theorem 7.5]. ^ Lemma 23 (Riemann-Lebesgue) If f L1 (Rd ), then f d Cb (R ), and f f 1. ^ The following theorem is a version of the Paley-Wiener theorem for C functions, and is proved in [Str03, Theorem 7.2.2]. Theorem 24 (Paley-Wiener) Let f be a C function sup^ ported in [- , ]. Then f ( + i ) is a entire function of exponential type , i.e., C such that ^ f ( + i ) C e | | , (29) ^ and f ( ) is rapidly decreasing, i.e., cn such that ^ cn , n N. f ( ) (1 + | |)n [GHS+ 05] [GSB+ 04] [GW99] [Mal98] [RS72] [Rud91] [SGSS07] (30) Conversely, if F ( + i ) is an entire function of exponential ^ type , and F ( ) is rapidly decaying, then F = f for some such function f . The following lemma is a corollary of the Paley-Wiener theorem, and is proved in [Mal98, Theorem 2.6]. Lemma 25 ([Mal98]) If g = 0 has compact support, then its Fourier transform g cannot be zero on a whole interval. ^ Similarly, if g = 0 has compact support, then g cannot be ^ zero on a whole interval. Let µ be a finite Borel measure and f be a bounded measurable function on Rd . We then define a finite Borel measure f µ by R IE (x)f (x) dµ(x), (26) (f µ)(E ) = d [Sho00] [SS02] [Ste02] [Str03] [Wen05] 14 where E is an arbitrary Borel set and IE is its indicator function.