Kernel Measures of Conditional Dependence

Anonymous Author(s) Affiliation Address email

Abstract
We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments.

1 Introduction
Measuring dependence of random variables is one of the main concerns of statistical inference. A typical example is the inference of a graphical model, which expresses the relations among variables in terms of independence and conditional independence. Independent component analysis employs a measure of independence as the objective function, and feature selection in supervised learning looks for a set of features on which the response variable most depends. Kernel methods have been successfully used for capturing (conditional) dependence of variables [1, 5, 8, 9, 16]. With the ability to represent high order moments, mapping of variables into reproducing kernel Hilbert spaces (RKHSs) allows us to infer properties of the distributions, such as independence and homogeneity [7]. A drawback of previous kernel dependence measures, however, is that their value depends not only on the distribution of the variables, but also on the kernel, in contrast to measures such as mutual information. In this paper, we propose to use the Hilbert-Schmidt norm of the normalized conditional crosscovariance operator, and show that this operator encodes the dependence structure of random variables. Our criterion includes a measure of unconditional dependence as a special case. We prove in the limit of infinite data, under assumptions on the richness of the RKHS, that this measure has an explicit integral expression which depends only on the probability densities of the variables, despite being defined in terms of kernels. We also prove that its empirical estimate converges to the kernelindependent value as the sample size increases. Furthermore, we provide a general formulation for the "richness" of an RKHS, and a theoretically motivated kernel selection method. We successfully apply our measure in experiments on synthetic and real data.

2 Measuring conditional dependence with kernels
In this paper, the probability law of a random variable is denoted by È , and the functional space of the square integrable functions with probability È by Ä¾ ´È µ. The symbol indicates the independence of and , and indicates the conditional independence of and given . The null space and the range of an operator Ì are written Æ ´Ì µ and Ê´Ì µ, respectively. 1


2.1 Dependence measures with normalized cross-covariance operators Covariance operators on RKHSs have been successfully used for capturing dependence and conditional dependence of random variables, by incorporating high order moments [5, 8, 16]. We give a brief review here; see [5, 6, 2] for further detail. Suppose we have a random variable ´ µ on ¢ , and RKHSs À and À on and , respectively, with measurable positive definite kernels and . Throughout this paper, we assume the integrability ( A- 1 )

´

µ
ÓÚ

½

´

µ

½.

This assumption ensures À À Ä¾´È µ and À Ä¾ ´È µ. The cross-covariance operator ¦ À is defined by the unique bounded operator that satisfies for all ¾ À and ¾ À . If and positive. The operator ¦ naturally extends the covariance matrix and represents higher order correlations of and through ´ µ and ´

¦

À

´ µ ´ µ ´ ´ µ ´ µ   ´ µ ´ µµ (1 ) , ¦ is called the covariance operator, which is self-adjoint µ with nonlinear kernels.
on Euclidean spaces,

It is known [2] that the cross-covariance operator can be decomposed into the covariance of the marginals and the correlation; that is, there exists a unique bounded operator Î such that

Ê´Î

to 1. We call Î

µ

Ê´¦

Ê´¦ µ. The operator norm of Î is less than or equal and Æ ´Î µ the normalized cross-covariance operator (NOCCO, see also [4]).
µ,

¦

¦

½

¾

Î

¦

½

¾

(2 )

Suppose we have another random variable on and RKHS ´À µ, which satisfy the analog to (A-1). We then define the normalized conditional cross-covariance operator,

and as While the operator Î encodes the same information regarding the dependence of ¦ , the former rather expresses the information more directly than ¦ , with less influence of the marginals. This relation can be understood as an analogue to the difference between the covariance ÓÚ  and the correlation ÓÚ  ´Î Ö´ µÎ Ö´ µµ½ ¾ . Note also that kernel canonical correlation analysis [1] uses the largest eigenvalue of Î and its corresponding eigenfunctions [4]. Î
 ½

Î

 

Î

Î ar e

(3 )

for measuring the conditional dependence of and given , where Î and Î similarly to Eq. (2). The operator Î may be better understood by expressing it as
w

defined

Î  
¦

h er e ¦  ¦ ¦ tional covariance matrix

 ½

¦

¾

 

¦

 

¦

¦

 ½

¦

¡

¦ ½ ¾

¦

The operator ¦ and : roughly speaking, can be used to determine the independence of . Similarly, a relation between ¦ Ç if and only if and conditional independence, ´ µ an d ´ µ ar e , has been established in [5]: if the extended variables used, is equivalent to ¦ Ç . We will give a rigorous treatment in Section 2.2

 ½

can

be interpreted as a nonlinear extension of the condiof Gaussian random variables.

¦

Noting that the conditions ¦ Ç an d ¦ Ç an d Î Ç are equivalent to Î Ç, respectively, we propose to use the Hilbert-Schmidt norms of the latter operators as dependence measures. Recall that an operator À½ À¾ is called Hilbert-Schmidt if for complete orthonormal systems (CONSs) of À½ and of À¾ , the sum È À ¾ ¾ is finite (see À Ë is defined by [13]). For a Hilbert-Schmidt operator , the Hilbert-Schmidt (HS) norm ¾Ë È À À ¾ ¾ . It is easy to see that this sum is independent of the choice of CONSs. Provided that Î nd Î a are Hilbert-Schmidt, we propose the following measures:

Á

ÇÆ

´
Ç

µ

ÁÆÇ

´

µ

Î Î

¾Ë À

(4) (5)

¾Ë À

A sufficient condition that these operators are Hilbert-Schmidt will be discussed in Section 2.3. 2


It is easy to provide empirical estimates of the measures. Let ´ ½ ½ ½µ ´ÒÒ ´Òµ be an i.i.d. sample from the joint distribution. Using the empirical mean elements Ñ ½ ÈÒ ÈÒ ´ ¡ µ and Ñ´Òµ Ò ´ ¡ µ, an estimator of ¦ is ½ ½ ½ Ò

Òµ

¦

¦´Òµ
´Òµ

½ Ò

ÈÒ

½

´ ´¡
·

µ   Ñ´Òµ µ

ª

´¡

µ   Ñ´Òµ

«

¡

w

¦´Òµ re defined similarly. The estimators of Î an d Î are respectively an d a ¡ ½ ¾ ´Òµ ¡ ½ ¾ Î ´Òµ ¦´Òµ ¦ ¦´Òµ ÒÁ ÒÁ

À

h er e

Ò

¼ is a regularization constant used in the same way as [1, 5], and
Î ´Òµ
Î
´Òµ

 

 

·

 

Î ´Òµ

Î

´Òµ

(6 )

from Eq. (3). The HS norm of the finite rank operator Î

´Òµ

be the centered Gram matrices, such that ´¡ µ  Ñ µ Ñ À ´¡ and so on, and define Ê , Ê , and Ê as Ê ·Ò ÒÁÒ µ ½ Ê ·Ò Ò ÁÒ µ ½  ½ . The empirical depe´ndence measures are then ´ nd Ê · Ò ÒÁÒ µ ´ ¾ ÁÒ ÇÆ Î ´Òµ À Ë ÌÖ¢Ê Ê   ¾Ê Ê Ê · Ê Ê Ê Ê £  (7) 
a
´Òµ ´Òµ

is

easy to calculate. Let

,

, an d

where the extended variables are used for ÁÒ . These empirical estimators, and use of Ò, will be justified in Section 2.4 by showing the convergence to Á Æ Ç Ç and Á ÇÆ . With the incomplete Cholesky decomposition [17] of rank Ö, the complexity to compute ÁÒ ÇÆ is Ç´Ö¾ Òµ.
ÇÆ

Æ ÁÒ Ç

Ç

´

µ

 Î ´Òµ ¾ Ë ÌÖ Ê À

¢

Ê

£

(8)

2.2 Inference on probabilities by characteristic kernels To relate Á Æ Ç Ç and Á ÇÆ with independence and conditional independence, respectively, the RKHS should contain a sufficiently rich class of functions to represent all higher order moments. Similar notions have already appeared in the literature: universal kernel on compact domains [15] and Gaussian kernels on the entire ÊÑ characterize independence via the cross-covariance operator [8, 1]. We now discuss a unified class of kernels for inference on probabilities. L et ´ µ be a measurable space, a random variable on , and ´À µ an RKHS on satisfying assumption (A-1). The mean element of on À is defined by the unique element Ñ ¾ À such ´ µ for all ¾ À (see [7]). If the distribution of is È , we also use ÑÈ to that Ñ À denote Ñ . Letting È be the family of all probabilities on ´ µ, we define the map Å by

linear hull of ´Ù ¡µ Ù ¾ is dense in À. Thus, the definition of a characteristic kernel generalizes the well-known property of the characteristic function that È ´Ù µ uniquely determines a Borel probability È on ÊÑ . The next lemma is useful to show that a kernel is characteristic. Lemma 1. Let Õ m ½. Suppose that ´À µ is an RKHS on a measurable space ´ µ with easurable and bounded. If À · Ê (the direct sum of the two RKHSs) is dense in ÄÕ ´ Èµ for any probability È on ´ µ, the kernel is characteristic. Proof. Let È and É be probabilities with ÑÈ ÑÉ , and be any measurable set. By the assumpt aion, for any ¼ there is a function ¾ À and  ¾ Ê such that È ´ µ ·    È ´ µ nd É ´ µ ·    É´ µ . From ÑÈ ÑÉ , we have ´ µ ´ µ, and thus È´ µ   É´ µ ¾ . Since ¼ is arbitrary, this means È ´ µ É´ µ.
1 Although the same notion was called probability-determining in [5], we call it "characteristic" by analogy with the characteristic function.

ÅÈÀ È ÑÈ is said to be characteristic1 if the map Å is injective, or equivalently, if the condition È ´ µ É ´ µ ( ¾ À) implies È É. Ì The notion of a characteristic kernel is a generalization of the characteristic function È Ô ½Ù , which is the expectation of the (complex-valued) positive definite kernel ´Ü Ùµ Ô ½ÙÌ Ü. Note that the condition ÑÈ ÑÉ is equivalent to È ´Ù µ É ´Ù µ for all Ù ¾ , since the
T
he kernel

3


Many popular kernels are characteristic. For a compact metric space, it is easy to see that the RKHS given by a universal kernel [15] is dense in Ä¾ ´È µ for any È , and thus characteristic (see also [7] Theorem 3). It is also important to consider kernels on non-compact spaces, since many standard random variables, such as Gaussian variables, are defined on non-compact spaces. By the next theorem, it is easy to see that many kernels on the entire ÊÑ , including Gaussian and Laplacian, are characteristic. The proof is an extension of Theorem 2 in [1], and is given in the supplementary material. Theorem 2. Let ´Þµ be a continuous positive function on ÊÑ with the Fourier transform ´Ùµ, and be a kernel of the form ´Ü Ý µ ´Ü   Ýµ. If for any ¾ ÊÑ there exists ¼ such that

for all ÊÑ . Hence

´Ù · µµ¾ Ù½ ´Ùµ ¼, then the RKHS associated with is dense in Ä¾´È µ for any Borel probability È is characteristic with respect to the Borel -field.

´

on

(

The assumptions to relate the operators with independence are well described by using characteristic kernels and denseness. The next result is a generalization of Corollary 9 in [5]. It is easily proved for independence, and in the same manner as [5, Corollary 9] for conditional independence. We omit the proof: see [5, 6] for details. Theorem 3. (i) Assume (A-1) for the kernels. If the product is characteristic, then we have

ii) Denote µ and characteristic kernel on ´ ¢

´

Î

Ç

F

µ ¢ , and À · Ê is dense in Ä¾´È µ. Then, Î ´µ Ç

. In addition to (A-1), assume that the product

´µ

is a

rom the above results, we can guarantee that Î an d Î will detect independence and conditional independence, if we use a Gaussian or Laplacian kernel either on a compact set or the whole of ÊÑ . Note also that we can substitute Î fo r Î in Theorem 3 (ii). 2.3 Kernel-free integral expression of the measures A remarkable property of Á Æ Ç Ç and Á ÇÆ is that they do not depend on the kernels under some assumptions, having integral expressions containing only the probability density functions. The Ê p Þrobability È ª È  on ¢ is defined by È ª È ´ ¢ µ ´µ  ´µ Þ È ´Þ µ for ¾ and ¾ . Theorem 4. Let and be measures on and , respectively, and assume that the probabilities È and È ª È  are absolutely continuous with respect to ¢ with probability density functions Ô and Ô , respectively. If À · Ê and ´À ª À µ · Ê are dense in Ä¾ ´È µ and Ä¾ ´È ª È µ, respectively, and Î and Î Î are Hilbert-Schmidt, then we have

Á

ÇÆ

Î

¾Ë À

¢

Ü Ýµ Ô   Ô ÜµÔ ´Ü ÝÝµ ÔÔ ´ÜµÔ ´Ýµ ´ ´µ
´

¾

Ô ´ÜµÔ ´Ýµ
and

(

where Ô and Ô are the density functions of the marginal distributions È As a special case of , we have

È

,

9) respectively.

ÁÆÇ

Ç

Î

¾Ë À

¢

Ü Ýµ   ½ Ô ´ÜµÔ ´Ýµ ÔÔ ´ÜµÔ ´Ýµ
´

¾

(10)

Sketch of the proof (see the supplement for the complete proof). Since it is known [8] that ¦ is Hilbert-Schmidt under (A-1), there exist CONSs ½ ½ À and ½ ½ À consisting of the eigenfunctions of ¦ ¼µ and ¦ ( an d ¦ , respectively, with ¦ ( ¼). Then, Î ¾ Ë admits the expansion À
È½ Ò
½

Î

À  ¾
¾

Î

À

4

Î

Î

À

·

Î

Î

ÀÓ
¾


´  and Á· ¾Æ ¼ , and define ´ µµ Ô · and ¾ Á· . For simplicity, Ä¾ denotes Ä ´È ´  nd ´ µµ Ô for ¾ Á ¾ ª È µ. ¼ ½ and ¼ ½, it is easy to see that the class With the notations ¾Á· ¼ ¾Á· ¼ is a ¾ CONS of Ä . From Parseval's equality, the first term of the above expansion is rewritten as
Let a

Á·

¾Æ

¼

È

¾Á·

¾Á·

B  ÔÔ ´ÜµÔ 

 ´Ü Ýµ ¾

¦  

´Ý µ Ä¾

È

¾ À ¾Á·

¢ ´ µ  

È ¾Á·

¾Á·

È

£ ¢ ´ µ ´ µ¾

¾Á·

¢ ´ µ   ½

È ¾Á·
Ô 

Ô ´ÜµÔ ´Ý µ Ä¾

 ´Ü Ýµ ¾

¾Á·

 

Ô

 ½

ÔÔ

¡

Ä¾

y a similar argument, the second and third term of the expansion are rewritten as     ¡  ¾ ÔÔ Ô Ô Ô Ô Ä¾ · ¾ and  Ô Ô Ô ¾ ¾   ½ respectively. This completes the proof. Ä Many practical kernels, such as the Gaussian and Laplacian, satisfy the assumptions in the above theorem, as we saw in Theorems 2 and the remark after Lemma 1. While the empirical estimate from finite samples depends on the choice of kernels, it is a desirable property for the empirical dependence measure to converge to a value that depends only on the distributions of the variables. Eq. (10) shows that, under the assumptions, Á Æ Ç Ç is equal to the mean square contingency, a well-known dependence measure[14] commonly used for discrete variables. As we show in Section Æ 2.4, ÁÒ Ç Ç works as a consistent kernel estimator of the mean square contingency. The expression of Eq. (10) can be compared with the mutual information,

B

ÅÁ´

µ

¢

Ô

´

Ü Ýµ Ü Ýµ ÐÓ Ô ´ Ô ´ÜµÔ ´Ýµ

oth the mutual information and the mean square contingency are nonnegative, and equal to zero if and only if and are independent. Note also that from ÐÓ Þ Þ   ½, the inequality ÅÁ´ µ ÁÆÇ Ç´ µ holds under the assumptions of Theorem 4. While the mutual information is the best known dependence measure, its finite sample empirical estimate is not straightforward, especially for continuous variables. The direct estimation of a probability density function is infeasible if the joint space has even a moderate number of dimensions. In the same way as the above proof, we can show that Î r. h. s. of Eq. (10) and (9) are finite. The condition for Î 2.4 Consistency of the measures It is important to ask whether the empirical measures converge to the population value Á ÇÆ and Á Æ Ç Ç , since this provides a theoretical justification for the empirical measures. It is known [4] ´Òµ that Î in operator norm. The next theorem asserts convergence converges in probability to Î in HS norm, provided that Î is Hilbert-Schmidt. Although the proof is analogous to the case of operator norm, it is more involved to discuss the HS norm. We give it in the supplementary material. Theorem 5. Assume that constant Ò satisfies Ò
an d ÀË

Î

are Hilbert-Schmidt if the has been already obtained in [4].

Î

Î ´Òµ   Î
Æ In particular, ÁÒ Ç

ÀË Ç

¼ and Ò Ò ½. Then, we have the convergence in probability ¿ ¼ and Î ´Òµ   Î ¼ ´Ò ½µ ÀË Á Æ Ç Ç and ÁÒ ÇÆ Á ÇÆ (Ò ½) in probability.

,

Î

,

and

Î

are

Hilbert-Schmidt, and that the regularization (1 1 )

2.5 Choice of kernels
Æ As with all empirical measures, the sample estimates ÁÒ Ç Ç and ÁÒ ÇÆ are dependent on the kernel, and the problem of choosing a kernel has yet to be solved. Unlike supervised learning, there are no easy criteria to choose a kernel for dependence measures. We propose a method of choosing a kernel by considering the large sample behavior. We explain the method only briefly in this paper.

The basic idea is that a kernel should be chosen so that the covariance operator detects independence of variables as effectively as possible. It has been recently shown [10], under the independence of 5


4

4

1.6 1.4

2

2

1.2 N OC C O 1 0.8 0.6 0.4

0

0

-2

-2

I

-4 -4

-2

0

2

4

-4 -4

-2

0

2

4

0.2 0

0.2

0.4

0.6

0.8

Angle

Figure 1: Left and Middle: Examples of data ( ¼ an d µ. Right: The marks "o" and "+" Æ ÁÒ Ç Ç for each angle and the 95th percentile of the permutation test, respectively. show that the measure À Ë Á ¦ ¾ Ë ([8]) multiplied by Ò converges to an infinite À mixture of ¾ distributions with variance Î ÖÐ Ñ ÒÀËÁ  ¾ ¦ ¾Ë ¦ ¾ À À Ë . We choose a kernel so that the bootstrapped variance Î Ö ÒÀ Ë Á  of ÒÀ Ë Á is close to this theoretical limit variance. More precisely, we compare the ratio Ì Î Ö ÒÀ Ë Á  Î ÖÐ Ñ ÒÀËÁ  for various candidate kernels. In preliminary experiments for choosing the variance parameter of Gaussian kernels, we often observed the ratio decays and saturates below 1, as increases. Therefore, we use ´ starting the saturation by choosing the minimum of among all candidates that satisfy Ì   « ½ · Æµ Ñ Ò Ì   « for Æ ¼ « ¾ ´¼ ½. We always use Æ ¼ ½ and « ¼ . We can expect that the chosen kernel uses the data effectively. While there is no rigorous theoretical guarantee, in Æ the next section we see that the method gives a reasonable result for ÁÒ Ç Ç and ÁÒ ÇÆ .

and ,

´Òµ

3 Experiments
To evaluate the dependence measures, we use a permutation test of independence for data sets with various degrees of dependence. The test randomly permutes the order of ½ Ò to make many samples independent of ´ ½ Òµ, thus simulating the null distribution under independence. For the evaluation of ÁÒ ÇÆ , the range of is partitioned into ½ Ä with the same number ¾ within the -th bin is randomly permuted. The of data, and the sample ´ µ significance level is always set to ±. In the following experiments, we always use Gaussian kernels ½   ¾ ¾ Ü½  Ü¾ ¾ and choose by the method proposed in Section 2.5. Synthetic data for dependence. The random variables ´¼µ and ´¼µ are independent and uniformly distributed on  ¾ ¾ and      , respectively, so that ´ ´¼µ ´¼µ µ has a scalar µ µ µ is the rotation of ´ ´¼µ covariance matrix. ´ ´ ´ ´¼µ µ by ¾ ¼  (see Figure 1). ´ µ µ are always uncorrelated, but dependent for an d ´ ¼. We generate 100 sets of 200 data. Æ We perform permutation tests with ÁÒ Ç Ç , À Ë Á ¦´Òµ ¾ Ë , and the mutual information À (MI). For the empirical estimates of MI, we use the advanced method from [11], with no need for Æ explicit estimation of the densities. Since ÁÒ Ç Ç is an estimate of the mean square contingency, we also apply a relevant contingency-table-based independence test ([12]), partitioning the variables Æ into bins. Figure 1 shows the values of ÁÒ Ç Ç for a sample. In Table 1, we see that the results f ÁÆÇ Ç are stable w.r.t. the choice of Ò, provided it is sufficiently small. We fix Ò of Ò ½¼  ÆÇ Ç or all remaining experiments. While all the methods are able to detect the dependence, ÁÒ with the asymptotic choice of is the most sensitive to very small dependence. We also observe she chosen parameters for increase from 0.58 to 2.0 as increases. The small for small t eems reasonable, because the range of is split into two small regions. Chaotic time series. We evaluate a chaotic time series derived from the coupled Henon map. The ´ variables and are four dimensional: the components ½ ¾ ½, and ¾ follow the dynamics

´ ½´Ø · ½µ ¾´Ø · ½µµ ´½   ½´Øµ¾ · ¼ ¿ ¾´Øµ ½´Øµµ, ´ ½´Ø · ½µ ¾´Ø · ½µµ ´½    ½´Øµ ½´Øµ · ´½    µ ¾´Øµ¾ · ¼ ½ ¾ ´Øµ ½´Øµµ, and ¿ ¿ are independent noise with Æ ´¼ ´¼ µ¾µ. and are independent for  ¼, while they are synchronized chaos for  ¼
(see Figure 2 for examples). A sample consists of 100 data generated from this system. Table 2 6


Angle (degree) Ç( ÁÆÇ ½¼  , Median) Ò ÆÇ Ç ( Á ½¼  , Median) Ò Ç( ÁÆÇ ½¼  , Median) Ò Ç (Asymp. Var.) ÁÆÇ Ò HSIC (Median) HSIC (Asymp. Var.) MI (#Nearest Neighbors = 1) MI (#Nearest Neighbors = 3) MI (#Nearest Neighbors = 5) Conting. Table (#Bins ¿) Conting. Table (#Bins ) Conting. Table (#Bins )

0 94 92 93 94 93 93 93 96 97 100 98 98

4.5 23 20 15 11 92 44 62 43 49 96 29 82

9 0 1 0 0 63 1 11 0 0 46 0 5

13.5 0 0 0 0 5 0 0 0 0 9 0 0

18 0 0 0 0 0 0 0 0 0 1 0 0

22.5 0 0 0 0 0 0 0 0 0 0 0 0

27 0 0 0 0 0 0 0 0 0 0 0 0

31.5 0 0 0 0 0 0 0 0 0 0 0 0

36 0 0 0 0 0 0 0 0 0 0 0 0

40.5 0 0 0 0 0 0 0 0 0 0 0 0

45 0 0 0 0 0 0 0 0 0 0 0 0

Table 1: Comparison of dependence measures. The number of times independence is accepted out of 100 permutation tests is shown. "Asymp. Var." is the method in Section 2.5. "Median" is a heuristic method [8] which chooses as the median of pairwise distances of the data.
2

2

5 1

I(X

t+1

,Y |X )
t t

I(Yt+1,Xt|Yt) Thresh (=0.05)

1

1

0.8 0.6

Thresh (=0.05)

4 3 2 1 0 0

X2(t)

0

Y1(t)

0

0.4
-1

-1
-2 -2

0.2
-1 0 1 2

-1

0

1

2

-2

X1(t)

(a) Plot of Henon map (b) ´

Ø ½ - Ø ½ (

X1(t)

0 0

0.2

0.4

0.6

0.2

0.4

0.6

¼¾ )

( c) Á ´

Ø·½ Ø Øµ

(d ) Á ´

Ø·½

Ø Øµ

Figure 2: Chaotic time series. (a,b): examples of data. (c,d) examples of ÁÒ ÇÆ the threshholds of the permutation test with significance level ± (black "+").

(colored "o") and

Next, we apply ÁÒ ÇÆ to detect the causal structure of the same time series. Note that the series i s a cause of for  ¼, but there is no opposite causality, i.e., Ø·½ Ø Ø and Ø·½ Ø Ø. In Table 3, it is remarkable that ÁÒ ÇÆ detects the small causal influence from Ø to Ø·½ for  ¼ ½, while for  ¼ the result is close to the theoretical value of 95%. Graphical modeling from medical data. The next example is the inference of a graphical model from data with no time structure. The data consist of three variables, creatinine clearance (C), digoxin clearance (D), urine flow (U). These were taken from 35 patients, and analyzed with graphical models in [3, Section 3.1.4.]. From medical knowledge, D should be independent of U when controlling C. Table 4 shows the results of the permutation tests and a comparison with the linear method. The relation Í is strongly affirmed by ÁÒ ÇÆ , while the partial correlation does not find it.
 (strength of coupling)
ÁÆÇ

shows the results of permutation tests of independence for the instantaneous pairs ´ ´Øµ Æ The proposed ÁÒ Ç Ç outperforms the other methods in capturing small dependence.

´Øµµ½¼¼ . Ø½

HS I C MI ( ¿) MI ( ) MI ( )

Ò

Ç

0.0 97 75 87 87 87

0.1 66 70 91 88 86

0.2 21 58 83 75 75

0.3 1 52 73 67 64

0.4 0 13 23 23 21

0.5 1 1 6 5 5

0.6 0 0 0 0 0

Table 2: Results for the permutation test of independence for the chaotic time series. The number of times independence was accepted out of 100 permutation tests is shown.  ¼ implies independence. 7


 (coupling)
ÁÆÇ

Ò

Ç

HS I C

0.0 97 94

H¼ : Ø is not a cause of 0.1 0.2 0.3 0.4 96 93 85 81 94 92 81 60

0.5 68 73

Ø·½

0.6 75 66

0.0 96 93

H¼ : 0.1 0 95

Ø is not a cause of Ø·½
0.2 0 85 0.3 0 56 0.4 0 1 0.5 0 1

0.6 0 1

Table 3: Results of the permutation test of non-causality for the chaotic time series. The number of times non-causality was accepted out of 100 tests is shown.
Kernel measure
Á

Í Í Í

Ò

1.458
0.776
0.194 0.343

ÇÆ

È -value
0.924 0.117 0.023

0.001

È ÖÓÖÖ´ ÓÖÖ´ ÓÖÖ´ ÓÖÖ´

Í µ Íµ Íµ

Linear method (partial) correl. 0.4847 0.7754 0.3092 0.5309

µ

È -value
0.0037 0.0000 0.0707 0.0010

Table 4: Graphical modeling from the medical data. Higher È -values indicate (conditional) independence more strongly.

4 Concluding remarks
There are many dependence measures among variables, and further theoretical and experimental comparison is important. That said, one unambiguous strength of the kernel measure we propose is its kernel-free population expression. It is interesting to ask if other classical dependence measures, such as the mutual information, can be estimated by kernels (in a broader sense than the expansion about independence of [9]). A relevant measure is the kernel generalized variance (KGV [1]), which is based on a sum of the logarithm of the eigenvalues of Î , while Á Æ Ç Ç is their squared sum. It is also interesting to investigate whether the KGV has a kernel-free expression. Another important topic for further study is causal inference with the proposed measure, both with and without time information ([16]).

References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] F. Bach and M. Jordan. Kernel independent component analysis. J. Machine Learning Res., 3:1­48, 2002. C. Baker. Joint measures and cross-covariance operators. Trans. Amer. Math. Soc., 186:273­289, 1973. D. Edwards. Introduction to graphical modelling. Springer verlag, New York, 2000. K. Fukumizu, F. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis. J. Machine Learning Res., 8:361­383, 2007. K. Fukumizu, F. Bach, and M. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Machine Learning Res., 5:73­99, 2004. K. Fukumizu, F. Bach, and M. Jordan. Kernel dimension reduction in regression. Tech Report 715, Dept. Statistics, University of California, Berkeley, 2006. A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel method for the two-sample¨ problem. Advances in NIPS 19. MIT Press, 2007. A. Gretton, O. Bousquet, A. Smola, and B. Scholkopf. Measuring statistical dependence with Hilbert¨ Schmidt norms. 16th Intern. Conf. Algorithmic Learning Theory, pp.63­77. Springer, 2005. A. Gretton, R. Herbrich, A. Smola, O. Bousquet and B. Scholkopf. Kernel Methods for Measuring ¨ Independence. J. Machine Learning Res., 6:2075­2129, 2005. Anonymous Authors. A Kernel Statistical Test of Independence. Submitted.. A. Kraskov, H. Stogbauer, and P. Grassberger. Estimating mutual information. Physical Review E, 69, ¨ 066138-1­16, 2004. T. Read and N. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer-Verlag, 1988. M. Reed and B. Simon. Functional Analysis. Academic Press, 1980. A. Renyi. Probability Theory. Horth-Holland, 1970. ´ I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Machine Learning Res., 2:67­93, 2001. X. Sun, D. Janzing, B. Scholkopf, and K. Fukumizu. A kernel-based causal learning algorithm. Proc. ¨ 24th Intern. Conf. Machine Learning, 2007 to appear. S. Fine and K. Scheinberg Efficient SVM Training using Low-Rank Kernel Representations J. Machine Learning Res., 2:243­264, 2001.

8