Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization XuanLong Nguyen SAMSI & Duke University Martin J. Wainwright UC Berkeley Michael I. Jordan UC Berkeley Abstract We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1 Introduction An important class of "distances" between multivariate probability distributions P and Q are the AliSilvey or f -divergences [1, 6]. These divergences, to be defined formally in the sequel, are all of the form D (P, Q) = (dQ/dP)dP, where is a convex function of the likelihood ratio. This family, including the Kullback-Leibler (KL) divergence and the variational distance as special cases, plays an important role in various learning problems, including classification, dimensionality reduction, feature selection and independent component analysis. For all of these problems, if f -divergences are to be used as criteria of merit, one has to be able to estimate them efficiently from data. With this motivation, the focus of paper is the problem of estimating an f -divergence based on i.i.d. samples from each of the distributions P and Q. Our starting point is a variational characterization of f -divergences, which allows our problem to be tackled via an M -estimation procedure. Specifically, the likelihood ratio function dP/dQ and the divergence functional D (P, Q) can be estimated by solving a convex minimization problem over a function class. In this paper, we estimate the likelihood ratio and the KL divergence by optimizing a penalized convex risk. In particular, we restrict the estimate to a bounded subset of a reproducing kernel Hilbert Space (RKHS) [17]. The RKHS is sufficiently rich for many applications, and also allows for computationally efficient optimization procedures. The resulting estimator is nonparametric, in that it entails no strong assumptions on the form of P and Q, except that the likelihood ratio function is assumed to belong to the RKHS. The bulk of this paper is devoted to the derivation of the algorithm, and a theoretical analysis of the performance of our estimator. The key to our analysis is a basic inequality relating a performance metric (the Hellinger distance) of our estimator to the suprema of two empirical processes (with respect to P and Q) defined on a function class of density ratios. Convergence rates are then obtained using techniques for analyzing nonparametric M -estimators from empirical process theory [20]. Related work. The variational representation of divergences has been derived independently and exploited by several authors [5, 11, 14]. Broniatowski and Keziou [5] studied testing and estimation problems based on dual representations of f -divergences, but working in a parametric setting as opposed to the nonparametric framework considered here. Nguyen et al. [14] established a one-to-one correspondence between the family of f -divergences and the family of surrogate loss functions [2], through which the (optimum) "surrogate risk" is equal to the negative of an associated f -divergence. Another link is to the problem of estimating integral functionals of a single density, with the Shannon entropy being a well-known example, which has been studied extensively dating back to early 1 work [9, 13] as well as the more recent work [3, 4, 12]. See also [7, 10, 8] for the problem of (Shannon) entropy functional estimation. In another branch of related work, Wang et al. [22] proposed an algorithm for estimating the KL divergence for continuous distributions, which exploits histogram-based estimation of the likelihood ratio by building data-dependent partitions of equivalent (empirical) Q-measure. The estimator was empirically shown to outperform direct plug-in methods, but no theoretical results on its convergence rate were provided. This paper is organized as follows. Sec. 2 provides a background of f -divergences. In Sec. 3, we describe an estimation procedure based on penalized risk minimization and accompanying convergence rates analysis results. In Sec. 4, we derive and implement efficient algorithms for solving these problems using RKHS. Sec. 5 outlines the proof of the analysis. In Sec. 6, we illustrate the behavior of our estimator and compare it to other methods via simulations. 2 Background We begin by defining f -divergences, and then provide a variational representation of the f divergence, which we later exploit to develop an M -estimator. Consider two distributions P and Q, both assumed to be absolutely continuous with respect to Lebesgue measure µ, with positive densities p0 and q0 , respectively, on some compact domain X Rd . The class of Ali-Silvey or f -divergences [6, 1] are "distances" of the form: p D (P, Q) = (1) 0 (q0 /p0 ) dµ, ¯ where : R R is a convex function. Different choices of result in many divergences that play important roles in information theory and statistics, including the variational distance, Hellinger distance, KL divergence and so on (see, e.g., [19]). As an important example, the Kullback-Leibler p (KL) divergence between P and Q is given by DK (P, Q) = 0 log(p0 /q0 ) dµ, corresponding to the choice (t) = - log(t) for t > 0 and + otherwise. Variational representation: Since is a convex function, by Legendre-Fenchel convex duality [16] we can write (u) = supvR (uv - (v )), where is the convex conjugate of . As a result, , f p dQ - (f ) dP D (P, Q) = 0 sup(f q0 /p0 - (f )) dµ = sup f f f where the supremum is taken over all measurable functions f : X R, and dP denotes the expectation of f under distribution P. Denoting by the subdifferential [16] of the convex function , it can be shown that the supremum will be achieved for functions f such that q0 /p0 (f ), where q0 , p0 and f are evaluated at any x X . By convex duality [16], this is true if f (q0 /p0 ) for any x X . Thus, we have proved [15, 11]: Lemma 1. Letting F be any function class in X R, there holds: f D (P, Q) sup dQ - (f ) dP, (2) f F with equality if F (q0 /p0 ) = . To illustrate this result in the special case of the KL divergence, here the function has the form (u) = - log(u) for u > 0 and + for u 0. The convex dual of is (v ) = supu (uv -(u)) = -1 - log(-v ) if u < 0 and + otherwise. By Lemma 1, l g f ( og g dP - dQ + 1. (3) dQ - -1 - log(-f )) dP = sup DK (P, Q) = sup f <0 g >0 In addition, the supremum is attained at g = p0 /q0 . 3 Penalized M-estimation of KL divergence and the density ratio Let X1 , . . . , Xn be a collection of n i.i.d. samples from the distribution Q, and let Y1 , . . . , Yn be n i.i.d. samples drawn from the distribution P. Our goal is to develop an estimator of the KL divergence and the density ratio g0 = p0 /q0 based on the samples {Xi }n 1 and {Yi }n 1 . i= i= 2 The variational representation in Lemma 1 motivates the following estimator of the KL divergence. First, let G be a function class of X R+ . We then compute l g ^ og g dPn - dQn + 1, (4) DK = sup g G where Pn and Qn denote the expectation under empirical measures Pn and Qn , respectively. If the supremum is attained at gn , then gn serves as an estimator of the density ratio g0 = p0 /q0 . ^ ^ In practice, the "true" size of G is not known. Accordingly, our approach in this paper is an alternative approach based on controlling the size of G by using penalties. More precisely, let I (g ) be a non-negative measure of complexity for g such that I (g0 ) < . We decompose the function class G as follows: G = 1M GM , (5) where GM := {g | I (g ) M } is a ball determined by I (·). The estimation procedure involves solving the following program: l g n 2 dQn - og g dPn + ^ gn = argmingG I (g ), 2 (6) d d where n > 0 is a regularization parameter. The minimizing argument gn is plugged into (4) to ^ obtain an estimate of the KL divergence DK . ^ For the KL divergence, the difference |DK - DK (P, Q)| is a natural performance measure. For estimating the density ratio, various metrics are possible. Viewing g0 = p0 /q0 as a density function with respect to Q measure, one useful metric is the (generalized) Hellinger distance: ( 1 1/ 2 h2 (g0 , g ) := g0 - g 1/2 )2 dQ. (7) Q 2 For the analysis, several assumptions are in order. First, assume that g0 (not all of G ) is bounded from above and below: 0 < 0 g0 1 for some constants 0 , 1 . Next, the uniform norm of GM is Lipchitz with respect to the penalty measure I (g ), i.e.: g GM (8) sup |g | cM for any M 1. (9) Finally, on the bracket entropy of G [21]: For some 0 < < 2, B H (GM , L2 (Q)) = O(M / ) for any > 0. (10) The following is our main theoretical result, whose proof is given in Section 5: Theorem 2. (a) Under assumptions (8), (9) and (10), and letting n 0 so that: -1 = OP (n2/(2+ ) )(1 + I (g0 )), n then under P: hQ (g0 , gn ) = OP (1/2 )(1 + I (g0 )), ^ n I (gn ) = OP (1 + I (g0 )). ^ (b) If, in addition to (8), (9) and (10), there holds inf gG g (x) 0 for any x X , then ^ |DK - DK (P, Q)| = OP (1/2 )(1 + I (g0 )). n (11) 4 Algorithm: Optimization and dual formulation G is an RKHS. Our algorithm involves solving program (6), for some choice of function class G . In our implementation, relevant function classes are taken to be a reproducing kernel Hilbert space induced by a Gaussian kernel. The RKHS's are chosen because they are sufficiently rich [17], and as in many learning tasks they are quite amenable to efficient optimization procedures [18]. 3 Let K : X × X R be a Mercer kernel function [17]. Thus, K is associated with a feature map : X H, where H is a Hilbert space with inner product ., . and for all x, x X , K (x, x ) = (x), (x ) . As a reproducing kernel Hilbert space, any function g H can be expressed as an inner product g (x) = w, (x) , where g H = w H . A kernel used in our simulation is the Gaussian kernel: K (x, y ) := e- x-y 2 / , where . is the Euclidean metric in Rd , and > 0 is a parameter for the function class. Let G := H, and let the complexity measure be I (g ) = g min J := min w w H. Thus, Eq. (6) becomes: n w 2 2 H, 1 n in =1 w, (xi ) - 1 n jn =1 log w, (yj ) + (12) Lemma 3. minw J has the following dual form: -min >0 where {xi } and {yj } are realizations of empirical data drawn from Q and P, respectively. The log function is extended take value - for negative arguments. jn i 11 1i 1 1i - - log nj + j K (xi , yj ). K (xi , xj )- i j K (yi , yj )+ 2 nn 2n ,j 2n n ,j n n ,j =1 1 n 1 w, (xi ) , j (w) := - n log w, (yj ) , and (w) = w n 2 Proof. Let i (w) := min J w = - max( 0, w - J (w)) = -J (0) = - min ui ,vj w 2 H. We have in i (ui ) + =1 jn =1 (vj ) + (- j in =1 ui - jn vj ), =1 where the last line is due to the inf-convolution theorem [16]. Simple calculations yield: 1 1 - log nj if v = -j (yj ) and + otherwise nn 1 i (u) = 0 if u = (xi ) and + otherwise n 1 2 (v ) = v H. 2n n n 1 1 1 1 n 2 So, minw J = - mini j =1 (- n - n log nj ) + 2n j =1 j (yj ) - n i=1 (xi ) H , which implies the lemma immediately. (v ) = - j If is solutnon of the dual formn lation, it is not difficult to show that the optimal w is attained at ^ i u ^ 1 w = 1 ( j =1 j (yj ) - n i=1 (xi )). ^ ^ n For an RKHS based on a Gaussian kernel, the entropy condition (10) holds for any > 0 [23]. Furthermore, (9) triviallyK olds via the Cauchy-Schwarz inequality: |g (x)| = | w, (x) | h (x, x) I (g ). Thus, by Theorem 2(a), w H = gn H = OP ( g0 H), ^ ^ w H (x) H I (g ) so the penalty term n w 2 vanishes at the same rate as n . We have arrived at the following esti^ mator for the KL divergence: ^ DK = 1 + jn (- jn 1 1 1 - log nj ) = ^ ^ - log nj . nn n =1 =1 log G is an RKHS. Alternatively, we could set log G to be the RKHS, letting g (x) = exp w, (x) , and letting I (g ) = log g H = w H . Theorem 2 is not applicable in this case, because condition (9) no longer holds, but this choice nonetheless seems reasonable and worth investigating, because in effect we have a far richer function class which might improve the bias of our estimator when the true density ratio is not very smooth. 4 A derivation similar to the previous case yields the following convex program: min J w := = min w n 1i e n =1 w, (xi ) - n 1j n w w, (yj ) + n =1 2 2 H n 1j 1i n i (xi ) - - min (yj ) H . 2 i log(ni ) - i + >0 2n =1 n =1 =1 in Letting be the solution of the above convex program, the KL divergence can be estimated by: ^ ^ DK = 1 + in i log i + i log ^ ^ ^ n . e =1 5 Proof of Theorem 2 We now sketch out the proof of the main theorem. The key to our analysis is the following lemma: Lemma 4. If gn is an estimate of g using (6), then: ^ ( 2 12 n 2 gn + g0 ^ n 2 ^ h (g0 , gn ) + I (gn ) - ^ gn - g0 )d(Qn - Q) + ^ log d(Pn - P) + I (g0 ). 4Q 2 2g0 2 ( g Proof. Define dl (g0 , g ) = g - g0 )dQ - log g0 dP. Note that for x > 0, 1 log x x - 1. Thus, 2 lg ( -1/2 og g0 dP 2 g 1/2 g0 - 1) dP. As a result, for any g , dl is related to hQ as follows: ( ( -1/2 dl (g0 , g ) g - g0 ) d Q - 2 g 1/2 g0 - 1) dP ( ( ( 1/ 2 1/ 2 = g - g0 ) d Q - 2 g 1/2 g0 - g0 ) dQ = g 1/2 - g0 )2 dQ = 2h2 (g0 , g ). Q By the definition (6) of our estimator, we have: ^ l n 2 gn dQn - og gn dPn + ^ I (gn ) ^ 2 g 0 dQn - l n 2 og g0 dPn + I (g0 ). 2 Both sides (modulo the regularization term I 2 ) are convex functionals of g . By Jensen's inequality, if F is a convex function, then F ((u + v )/2) - F (v ) (F (u) - F (v ))/2. We obtain: l g l ^ gn + g0 ^ n 2 n 2 gn + g0 dQn - og dPn + I (gn ) ^ og g0 dPn + I (g0 ). 0 dQn - 2 2 4 4 ^ l^ n Rearranging, gn -g0 d(Qn - Q) - og gn +0g0 d(Pn - P) + 4 I 2 (gn ) ^ 2 2g ^ gn - g0 n 2 g0 + gn ^ n 2 dQ + I (g0 ) = -dl (g0 , )+ I (g0 ) 2 4 2 4 g0 + gn ^ n 2 1 n 2 -2h2 (g0 , )+ I (g0 ) - h2 (g0 , gn ) + ^ I (g0 ), Q 2 4 8Q 4 where the last inequality is a standard result for the (generalized) Hellinger distance (cf. [20]). l gn + g0 ^ og dP - 2g0 + Let us now proceed to part (a) of the theorem. Define fg := log g2gg0 , and let FM := {fg |g GM }. 0 Since fg is a Lipschitz function of g , conditions (8) and (10) imply that B H (FM , L2 (P)) = O(M / ) . (13) Apply Lemma 5.14 of [20] using distance metric d2 (g0 , g ) = g - g0 L2 (Q) , the following is true under Q (and so true under P as well, since dP/dQ is bounded from above), ( | g - g0 )d(Qn - Q)| = OP (1). (14) sup 2 1- /2 g G n-1/2 d2 (g0 , g ) (1 + I (g ) + I (g0 )) /2 n- 2+ (1 + I (g ) + I (g0 )) 5 In the same vein, we obtain that under P measure: f | g d(Pn - P)| = OP (1). sup 2 1- /2 g G n-1/2 d2 (g0 , g ) (1 + I (g ) + I (g0 )) /2 n- 2+ (1 + I (g ) + I (g0 )) (15) By condition (9), we have: d2 (g0 , g ) = g - g0 L2 (Q) 2c1/2 (1 + I (g ) + I (g0 ))1/2 hQ (g0 , g ). Combining Lemma 4 and Eqs. (15), (14), we obtain the following: n 2 12 hQ (g0 , gn ) + ^ I (gn ) n I (g0 )2 /2+ ^ 4 2 n OP -1/2 hQ (g0 , g )1- /2 (1 + I (g ) + I (g0 ))1/2+ /4 n- 2+ (1 + I (g ) + I (g0 )) 2 . (16) From this point, the proof involves simple algebraic manipulation of (16). To simplify notation, let ^ h = hQ (g0 , gn ), I = I (gn ), and I0 = I (g0 ). There are four possibilities: ^^ ^ ^ ^ ^ Case a. h n-1/(2+ ) (1 + I + I0 )1/2 and I 1 + I0 . From (16), either which implies, respectively, either ^ h -1/2 OP (n-2/(2+ ) ), n ^ I -1 OP (n-2/(2+ ) ) or n ^ I OP (I0 ). 2 ^ ^ ^ ^ ^ ^ h2 /4 + n I 2 /2 OP (n-1/2 )h1- /2 I 1/2+ /4 or h2 /4 + n I 2 /2 n I0 /2, Both scenarios conclude the proof if we set -1 = OP (n2/( +2) (1 + I0 )). n ^ ^ ^ Case b. h n-1/(2+ ) (1 + I + I0 )1/2 and I < 1 + I0 . From (16), either which implies, respectively, either ^ h (1 + I0 )1/2 OP (n-1/( +2) ), ^ h OP (1/2 I0 ), n ^ I 1 + I0 or 2 ^ ^ ^ ^ ^ h2 /4 + n I 2 /2 OP (n-1/2 )h1- /2 (1 + I0 )1/2+ /4 or h2 /4 + n I 2 /2 n I0 /2, 1 ^ h OP (n/2 I0 ), Both scenarios conclude the proof if we set -1 = OP (n2/( +2) (1 + I0 )). n ^ ^ ^ Case c. h n-1/(2+ ) (1 + I + I0 )1/2 and I 1 + I0 . From (16) ^ ^ ^ h2 /4 + n I 2 /2 OP (n-2/(2+ ) )I , ^ I OP (I0 ). ^ ^ ^ ^ which implies that h OP (n-1/(2+ ) )I 1/2 and I -1 OP (n-2/(2+ ) ). This means that h n 1/ 2 ^ OP (1 + I0 ) if we set -1 = OP (n2/(2+ ) )(1 + I0 ). OP (n )(1 + I0 ), I n ^ ^ ^ Case d. h n-1/(2+ ) (1 + I + I0 )1/2 and I 1 + I0 . Part (a) of the theorem is immediate. Finally, part (b) is a simple consequence of part (a) using the same argument as in Thm. 9 of [15]. 6 Simulation results In this section, we describe the results of various simulations that demonstrate the practical viability of our estimators, as well as their convergence behavior. We experimented with our estimators using various choices of P and Q, including Gaussian, beta, mixture of Gaussians, and multivariate Gaussian distributions. Here we report results in terms of KL estimation error. For each of the eight estimation problems described here, we experiment with increasing sample sizes (the sample size, n, ranges from 100 to 104 or more). Error bars are obtained by replicating each set-up 250 times. For all simulations, we report our estimator's performance using the simple fixed rate n 1/n, noting that this may be a suboptimal rate. We set the kernel width to be relatively small ( = .1) for one-dimension data, and larger for higher dimensions. We use M1 to denote the method in which G is the RKHS, and M2 for the method in which log G is the RKHS. Our methods are compared to 6 Estimate of KL(Beta(1,2),Unif[0,1]) 0.8 0.4 0.7 0.6 0.3 0.5 0.2 0.4 0.3 0.1931 M1, = .1, = 1/n M2, = .1, = .1/n WKV, s = n WKV, s = n -0.1 100 200 500 1000 2000 5000 1/2 1/3 Estimate of KL(1/2 N (0,1)+ 1/2 N (1,1),Unif[-5,5]) t t 0.1 0.2 0.1 0.414624 M1, = .1, = 1/n M2, = 1, = .1/n WKV, s = n1/3 WKV, s = n 1/2 0 WKV, s = n2/3 50000 10000 20000 0 100 200 500 1000 2000 5000 10000 Estimate of KL(Nt(0,1),Nt(4,2)) 2.5 6 5 2 4 1.5 3 2 1 1.9492 M1, = .1, = 1/n M2, = .1, = .1/n WKV, s = n1/3 1/2 WKV, s = n WKV, s = n2/3 100 200 500 1000 2000 5000 10000 Estimate of KL(Nt(4,2),Nt(0,1)) 1 0 -1 -2 100 200 500 1000 4.72006 M1, = 1, = .1/n M2, = 1, = .1/n WKV, s = n1/4 1/3 WKV, s = n WKV, s = n1/2 2000 5000 10000 0.5 0 Estimate of KL(Nt(0,I2),Unif[-3,3] ) 1.5 2 Estimate of KL(Nt(0,I2),Nt(1,I2)) 1.5 1 1 0.5 0.777712 M1, = .5, = .1/n M2, = .5, = .1/n WKV, n1/3 WKV, n1/2 0.5 0.959316 M1, = .5, = .1/n M2, = .5, = .1/n WKV, n1/3 WKV, n1/2 0 100 200 500 1000 2000 5000 10000 0 100 200 500 1000 2000 5000 10000 Estimate of KL(Nt(0,I3),Unif[-3,3]3) 2 1.8 1.6 1.5 1.4 1.2 1 1.16657 M1 = 1, = .1/n1/2 M2, = 1, = .1/n M2, = 1, = .1/n2/3 0 100 200 500 1000 Estimate of KL(Nt(0,I3),Nt(1,I3)) 1 0.8 0.6 0.4 0.2 0 5000 10000 0.5 1.43897 M1, = 1, = .1/n M2, = 1, = .1/n WKV, n1/2 WKV, n1/3 100 200 500 1000 2000 5000 10000 WKV, n1/3 WKV, n1/2 2000 -0.2 Figure 1. Results of estimating KL divergences for various choices of probability distributions. In all plots, the X-axis is the number of data points plotted on a log scale, and the Y-axis is the estimated value. The error bar is obtained by replicating the experiment 250 times. Nt (a, Ik ) denotes a truncated normal distribution of k dimensions with mean (a, . . . , a) and identity covariance matrix. 7 The first four plots present results with univariate distributions. In the first two, our estimators M 1 and M 2 appear to have faster convergence rate than WKV. The WKV estimator performs very well in the third example, but rather badly in the fourth example. The next four plots present results with two and three dimensional data. Again, M1 has the best convergence rates in all examples. The M2 estimator does not converge in the last example, suggesting that the underlying function class exhibits very strong bias. The WKV methods have weak convergence rates despite different choices of the partition sizes. It is worth noting that as one increases the number of dimensions, histogram based methods such as WKV become increasingly difficult to implement, whereas increasing dimension has only a mild effect on our method. algorithm A in Wang et al [22], which was shown empirically to be one of the best methods in the literature. Their method, denoted by WKV, is based on data-dependent partitioning of the covariate space. Naturally, the performance of WKV is critically dependent on the amount s of data allocated to each partition; here we report results with s n , where = 1/3, 1/2, 2/3. References [1] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. J. Royal Stat. Soc. Series B, 28:131­142, 1966. [2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138­156, 2006. [3] P. Bickel and Y. Ritov. Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhya Ser. A, 50:381­393, 1988. ¯ ´ [4] L. Birge and P. Massart. Estimation of integral functionals of a density. Ann. Statist., 23(1):11­29, 1995. [5] M. Broniatowski and A. Keziou. Parametric estimation and tests through divergences. Technical report, ´ LSTA, Universite Pierre et Marie Curie, 2004. ´ [6] I. Csiszar. Information-type measures of difference of probability distributions and indirect observation. Studia Sci. Math. Hungar, 2:299­318, 1967. [7] L. Gyorfi and E.C. van der Meulen. Density-free convergence properties of various estimators of entropy. Computational Statistics and Data Analysis, 5:425­436, 1987. [8] P. Hall and S. Morton. On estimation of entropy. Ann. Inst. Statist. Math., 45(1):69­88, 1993. [9] I. A. Ibragimov and R. Z. Khasminskii. On the nonparametric estimation of functionals. In Symposium in Asymptotic Statistics, pages 41­52, 1978. [10] H. Joe. Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math., 41:683­697, 1989. [11] A. Keziou. Dual representation of -divergences and applications. C. R. Acad. Sci. Paris, Ser. I 336, pages 857­862, 2003. [12] B. Laurent. Efficient estimation of integral functionals of a density. Ann. Statist., 24(2):659­681, 1996. [13] B. Ya. Levit. Asymptotically efficient estimation of nonlinear functionals. Problems Inform. Transmission, 14:204­209, 1978. [14] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On divergences, surrogate losses and decentralized detection. Technical Report 695, Dept of Statistics, UC Berkeley, October 2005. [15] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric estimation of the likelihood ratio and divergence functionals. In International Symposium on Information Theory (ISIT), 2007. [16] G. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. [17] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman, Harlow, UK, 1988. ¨ [18] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [19] F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46:1602­1609, 2000. [20] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. [21] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, NY, 1996. ´ [22] Q. Wang, S. R. Kulkarni, and S. Verdu. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Transactions on Information Theory, 51(9):3064­3074, 2005. [23] D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739­767, 2002. 8