Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation Masashi Sugiyama Tokyo Institute of Technology sugi@cs.titech.ac.jp Hisashi Kashima IBM Research hkashima@jp.ibm.com Shinichi Nakajima Nikon Corporation nakajima.s@nikon.co.jp Motoaki Kawanabe Fraunhofer FIRST nabe@first.fhg.de Paul von Bunau ¨ Tec hnical University Berlin buenau@cs.tu-berlin.de Abstract When training and test samples follow different input distributions (i.e., the situation called covariate shift), the maximum likelihood estimator is known to lose its consistency. For regaining consistency, the log-likelihood terms need to be weighted according to the importance (i.e., the ratio of test and training input densities). Thus, acc urately estimating the importance is one of the key tasks in covariate shift adaptation. A naive approach is to first estimate training and test input densities and then estimate the importance by the ratio of the density estimates. However, since density estimation is a hard problem, this approach tends to perform poorly espec ially in high dimensional cases. In this paper, we propose a direct importance estimation method that does not require the input density estimates. Our method is equipped with a natural model selection procedure so tuning parameters such as the kernel width can be objectively optimized. This is an advantage over a recently developed method of direct importance estimation. Simulations illustrate the usefulness of our approach. 1 Introduction A common assumption in supervised learning is that training and test samples follow the same distribution. However, this basic assumption is often violated in practice and then standard mac hine learning methods do not work as desired. A situation where the input distribution È ´Üµ is different in the training and test phases but the conditional distribution of output values È ´Ý ܵ remains unchanged is called covariate shift [7]. In many real-world applications such as robot control [6], bioinformatics [1], brain-computer interfacing [9], or econometrics [4], covariate shift is conceivable and thus learning under covariate shift is gathering a lot of attention these days (e.g., the NIPS2006 workshop). The influence of covariate shift could be alleviated by weighting the log likelihood terms acc ording to the importance ۴ܵ Ô Ø ´Ü µ ÔØÖ ´Ü µ [7, 10], where ÔØ ´Üµ and ÔØÖ ´Ü µ are test and training input densities. Since the importance is usually unknown, the key issue of covariate shift adaptation is how to accurately estimate the importance from samples1 . Covariate shift matters in parameter learning only when the model used for function learning is misspecified (i.e., the model is so simple that the true learning target function can not expressed) [7]--when the model is correctly (or overly) specified, ordinary maximum likelihood estimation is still consistent. Following this fact, there is a criticism that importance weighting is not needed--just the use of a complex enough model can settle the problem. However, too complex models result in huge variance and thus we practically need to choose a 1 1 A naive approach to importance estimation would be to first estimate the training and test densities separately from training and test input samples, and then estimate the importance by taking the ratio of the estimated densities. However, estimating densities is known to be a hard problem particularly in high-dimensional cases. Therefore, this naive approach may not be effective--directly estimating the importance without going through density estimation would be more promising. Following this spirit, the kernel mean matching (KMM) method has been proposed recently [5], which directly estimates the importance based on a special property of universal reproducing kernel Hilbert spaces. KMM is shown to work well if tuning parameters such as the kernel width are chosen appropriately. Intuitively, model selection of importance estimation algorithms seem s possible by ordinary cross validation (CV) over the performance of subsequent learning algorithms. However, this is highly unreliable since CV is heavily biased under covariate shift--for an unbiased estimation of the performance of subsequent learning algorithms, the CV procedure itself needs to be importance-weighted [10, 8]. Since the importance weight has to have been fixed when model selection is carried out by importance weighted CV, it can not be used for model selection of importance estimation algorithms2 . The above fact implies that model selection of importance estimation algorithms should be performed within the importance estimation step in an unsupervised manner. However, since KMM can only estimate the values of the importance at training input points, it can not be directly applied in the CV framework; an out-of-sample extension is needed, but this seems to be an open research issue currently. In this paper, we propose a new importance estimation method which can overcome the above problems, i.e., the proposed method directly estimates the importance without density estimation and is equipped with a natural model selection procedure. Our basic idea is to find ۴ܵ such that the Kullback-Leibler divergence from Ô Ø ´Üµ to its estimate ÔØ ´Üµ ۴ܵÔØÖ ´Ü µ is minimized. We propose an algorithm that can carry out this minimization without explicitly modeling Ô ØÖ ´Üµ and ÔØ ´Üµ. We call the proposed method Kullback-Leibler Importance Estimation Procedure (KLIEP). The optimization problem involved in KLIEP is convex, so the unique global solution can be obtained. Furthermore, the solution tends to be sparse, which contributes to reducing the computational cost in the test phase. Since KLIEP is based on the minimization of the Kullback-Leibler divergence, its model selection can be naturally carried out through a variant of likelihood CV, which is a popular technique in density estimation [3]. A key advantage of our CV procedure is that, not the training samples, but the test input samples are cross-validated. This highly contributes to improving the accuracy since the number of training samples is typically limited while test input samples are abundantly available. The usefulness of KLIEP is demonstrated through simulations. 2 New Importance Estimation Method In this section, we propose a new importance estimation method. 2.1 Formulation and Notation µ be the input domain and suppose we are given i.i.d. training input samples Ü ØÖ ØÖ Let ´Ê Ò½ from a training input distribution with density Ô ØÖ ´Üµ and i.i.d. test input samples Ü Ø ÒØ ½ from a ¼for all Ü ¾ . Typically, test input distribution with density Ô Ø ´Üµ. We assume that ÔØÖ ´Üµ the number ÒØÖ of training samples is rather small, while the number Ò Ø of test input samples is very large. The goal of this paper is to develop a method of estimating the importance ۴ܵ from complex enough but not too complex model. For choosing such an appropriate model, we usually use a model selection technique such as cross validation (CV). However, the ordinary CV score is heavily biased due to covariate shift and we need to also importance-weight the CV score (or any other model selection criteria) for unbiasedness [7, 10, 8]. For this reason, estimating the importance is indispensable when covariate shift occurs. 2 Once the importance weight has been fixed, importance weighted CV can be used for model selection of subsequent learning algorithms. 2 Ü ØÖ Ò½ ØÖ and Ü Ø ÒØ ½: 3 ۴ܵ Ô Ø ´Üµ ÔØÖ ´Üµ (1) 2.2 Kullback-Leibler Importance Estimation Proce dure (KLIEP) Let us model the importance ۴ܵ by the following linear model: ۴ܵ ½ « ³ ´Üµ (2) where « ½ are basis functions ½ are parame ters to be learned from data samples and ³ ´Üµ ¼ for all Ü ¾ and for ½¾ . Note that and ³ ´Üµ ½ could such that ³ ´Üµ be dependent on the samples ÜØÖ ÒØÖ½ and ÜØ ÒØ ½, i.e., kernel models are also allowed (see Section 2.3). We explain how the basis functions ³ ´Üµ ½ are chosen in Section 2.3. Using the model ۴ܵ, we can estimate the test input density Ô Ø ´ ܵ by ÔØ ´Üµ ۴ܵÔØÖ ´ µ Ü (3) ÔØ ´Üµ to ÔØ ´Üµ is minimized4: Ã Ä ÔØ ´Üµ ÔØ ´ We learn the parame ters « ½ in the model (2) so that the Kullback-Leibler divergence from ܵ ÔØ ´ µ ÐÓ ÔØ ´ µ ÐÓ Ü Û´ÜµÔØÖ ´Üµ Ô Ø´ Ô Ø´ ܵ Ü Ü Üµ Ü ÔØִܵ « Ô ½, ؽ Ø´ ܵ ÐÓ Û´Üµ Ü (4) ( Since the first term in the last equation is independent of second term. We denote it by  : we ignore it and focus on the  ÔØ ´ µ ÐÓ Û´Üµ Ü Ü Ø½ ÒØ ½ Ò ÐÓ Û´Ü Øµ ÒØ ½ Ò ÐÓ ½ « ³ ´Ü ص 5) where the empirical approximation based on the test input samples Ü Ø ÒØ ½ is used. objective function to be maximized with respect to the parameters « ½, which is Note that the above objective function only involves the test input samples Ü Ø ÒØ ½, not use the training input samples Ü ØÖ ÒØÖ½ yet. As shown below, ÜØÖ ÒØÖ½ will be constraint. This is our convex [2]. i.e., we did used in the ۴ܵ is an estimate of the importance ۴ܵ which is non-negative by definition. Therefore, we need to guarantee that ۴ܵ ¼ for all Ü ¾ ; this can be achieved by restricting « ¼ for ½ ¾ (6) ÒØÖ ½ ½ In addition to the non-negativity, ۴ܵ should be normalized so that its integral with respect to Ü is one since ۴ܵÔØÖ ´Üµ ´ ÔØ ´Üµµ is a probability density function: ½ ۴ܵÔØÖ ´ µ ÜÜ ÒØÖ ½ ÒØÖ ½ Û´ÜØÖ µ ÒØÖ ½ « ³ ´ÜØÖ µ (7) where the empirical approximation based on the training input samples Ü ØÖ ÒØÖ½ is used. Due to the approximation error, the normalization constraint (7) is not generally satisfied exac tly. However, 3 Importance estimation is a pre-processing step of supervised learning tasks where training output samples ÝØÖ ÒØÖ½ at the training input points Ü ØÖ ÒØÖ½ are also available [7, 5, 8]. However, we do not use Ý ØÖ ÒØÖ½ here since they are irrelevant to the importance. 4 One may also consider an alternative scenario where the inverse importance weight Û ½´Üµ is parameterized and the parameters are learned so that the Kullback-Leibler divergence from Ô ØÖ ´Üµ to ÔØÖ ´Üµ ½ ´ Û ´Ü µÔØ ´Ü µ) is minimized. However, this inverse approach (iKLIEP) turns out to be less attractive in the model selection stage (see Section 2.3 for detail). 3 Input: Ñ ³ output: ۴ܵ ´ ܵ ½ Ü ØÖ ØÖ Ò½ ÜØ ÒØ ½ Input: Å Ñ output: ۴ܵ Ü ØÖ ØÖ Ò½ ÜØ ÒØ ½ ; ½ I ³ ´ÜØÖ µ ÒØÖ nitiali ze « ´ ¼µ, ´¼ ½); Repeat until convergence « «· ´½ «µ; « « · ´½ «µ ´ µ; « Ñ Ü´¼ «µ; « « ´ «µ; end ۴ܵ È ½ « ³ ´Üµ; ÒØÖ ½ ³È Ø ´Ü µ Split ÜØ ÒØ ½ into Ê disjoint subsets Ö Ö ½; Ê for each model Ñ ¾ Å for each split Ö ½ ¾ Ê ÛÖ ´Üµ KLIEP´Ñ ÜØÖ ÒØÖ½ Ö µ; È ÂÖ ´Ñµ ÐÓ ÛÖ ´Üµ; ½Ö ܾ Ö end ÈÊ Â Â ´Ñµ ½ Ö ½ Ö ´Ñµ; Ê end Ñ Ö Ñ ÜѾŠ ´Ñµ; ۴ܵ KLIEP´Ñ ÜØÖ ÒØÖ½ ÜØ ÒØ ½ µ; (b) KLIEP with model selection (a) KLIEP main code Figure 1: KLIEP algorithm in pseudo code. this may not be critical in practice since the scale of the importance is often irrelevant in subsequent learning procedures [7, 5, 8]. Now our optimization criterion is summarized as follows. ÒØ ½ « s maximize ¼´ ½¾ µ ÐÓ ½ « ³ ´Ü ص ubject to ÒØÖ ½ ½ « ³ ´ÜØÖ µ ÒØÖ (8) This is a convex optimization problem and the global solution can be obtained, e.g., by simply performing gradient ascent and feasibility satisfaction iteratively 5 . A pseudo code is described in Figure 1 (a). Note that the solution « ½ tends to be sparse [2], which contributes to reducing the computational cost in the test phase. We refer to the above method as Kullback-Leibler Importance Estimation Proce dure (KLIEP). 2.3 Model Selection by Likelihood Cross Validation The performance of KLIEP depends on the choice of basis functions how they can be appropriately chosen from data samples. ³ ´Üµ ½. Here we explain Since KLIEP is based on the maximization of the score  (see Eq.(5)), it would be natural to select the model such that  is maximized. The expectation over ÔØ ´Üµ involved in  can be numerically approximated by CV as follows: First, divide the test samples Ü Ø ÒØ ½ into Ê disjoint subsets Ö Ö Ø Ê ½ . Then obtain an importance estimate ÛÖ ´Üµ from Ø and estimate the score Â Ö Ö using Ø : ÂÖ Â ½ Ö Ø Ü Ø¾ Ö This procedure is repeated for Ö ½¾ Ê and choose the model such that the avera ge score ÈÊ Â ½ Ö is maximized. A pseudo code of the CV procedure is summarized in Figure 1 (b). Ö½ Ê Ø ÐÓ ÛÖ ´ ÜØ µ (9) One of the potential limitations of CV in general is that it is not reliable when the number of samples is small since data splitting by CV further reduces the sample size. On the other hand, in our CV procedure, the data splitting is performed over the test input samples, not over the training samples. Since we typically have a large number of test input samples, we do not suffer from the small sample problem. Therefore, our CV procedure is accurate and useful in model selection. A good model may be chosen by the above CV procedure, given that a set of promising model candidates is prepared. As model candidates, we propose using a Gaussian kernel model centered at 5 If necessary, we may regularize the solution, e.g., by adding a penalty term to the objective function or by imposing an upper bound on the solution; the normalization constraint (7) may also be weakened by allowing a small deviation. These modification is possible without sacrificing the convexity, but we do not go into the detail any further since the experimental performance seems 4 the test input points Ü Ø ÒØ ½ , i.e., ۴ܵ where à ´Ü ؽ Ò « à ´Ü ÜØ µ (10) ( 11) ܼµ is the Gaussian kernel with kernel width : ¼ Ü Ü Ã ´Ü ܼ µ ÜÔ ¾¾ ¾ The reason why we chose the test input points Ü Ø ÒØ ½ as the Gaussian centers, not the training input points Ü ØÖ ÒØÖ½ , is as follows. By definition, the importance ۴ܵ tends to take large values if the training input density Ô Øִܵ is small and the test input density Ô Ø ´Üµ is large; conversely, ۴ܵ tends to be small if ÔØÖ ´Üµ is large and ÔØ ´Üµ is small. When a function is approximated by a Gaussian kernel model, many kernels may be needed in the region where the output of the target function is large; on the other hand, only a small number of kernels would be enough in the region where the output of the target function is small. Following this heuristic, we decided to allocate many kernels at high test input density regions, which can be achieved by setting the Gaussian centers at the test input points Ü Ø ÒØ ½. Alternatively, we may locate ´ÒØÖ · ÒØ µ Gaussian kernels at both ÜØÖ ÒØÖ½ and ÜØ ÒØ ½. However, in our preliminary experiments, this did not further improve the performance, but just slightly increased the computational cost. Actually, since ÒØ is typically very large, just using all the test input points Ü Ø ÒØ ½ as Gaussian centers is already computationally rather demanding. To ease this problem, we practically propose using a subset of Ü Ø ÒØ ½ as Gaussian centers for computational efficiency, i.e., ۴ܵ ½ « à ´Ü µ (12) where is a template point randomly chosen from Ü Ø ÒØ ½ and Ò Ø ½¼ and optimize the kernel width In the following, we set ´ ÒØ µ is a prefixed number. by the above CV procedure. 3 Discussion In this section, we discuss the relation between KLIEP and existing approaches. 3.1 Kernel Density Estimator and Likelihood Cross Validation The kernel density estimator (KDE) is a non-parametric technique to estimate a density Դܵ from samples Ü Ò ½ by Դܵ where à ´Ü ܼµ is a kernel function, e.g., the Gaussian kernel (11). ÒØÖ ´¾ ¾µ ½ Ò½ ¾ à ´Ü Ü µ (13) KDE can be used for importance estimation by first estimating Ô ØÖ ´Üµ and ÔØ ´Üµ separately from ØÖ Ø ÒØ ØÖ Ü Ô Ø ´Ü µ ÔØÖ ´Üµ. A potential Ò ½ and Ü ½ and then estimating the importance by ۴ܵ limitation of this approach is that KDE is known to suffer from the curse of dimensionality [3], i.e., the number of samples needed to maintain the same approximation quality grows exponentially as the dimension of the input space increases. This is particularly critical when estimating Ô ØÖ ´Üµ since the number of training input samples is typically limited. Furthermore, model selection by likelihood CV is unreliable in such cases since data splitting in the CV procedure further reduces the sample size. Therefore, the KDE-based approach may not be reliable in high-dimensional case s. 5 The performance of KDE depends on the choice of the kernel width . The kernel width can be optimized by likelihood CV [3], i.e., a subset of Ü Ò ½ is used for density estimation and the rest is used for estimating the likelihood of the held-out samples. Note that likelihood CV corresponds to choosing so that the Kullback-Le ibler divergence from Դܵ to Դܵ is minimized. 3.2 Kernel Mean Matching The kernel mean matching (KMM) method avoids density estimation and directly gives an estimate of the importance at training input points [5]. The basic idea of KMM is to find ۴ܵ such that the maximum discrepancy of the means of nonlinearly transformed samples drawn from Ô Ø ´Üµ and ÔØÖ ´Üµ is minimized in some feature space : minimize ۴ܵ ¼ supremum Ü ¾ ½Ø ´ ÜØ µ Ü ØÖ Û´Ü µ ØÖ ´Ü ØÖ µ s ubject to Ü ØÖ Û´ÜØÖ µ ½ (14) It is shown that the solution of the above problem agrees with the true importance if is a universal reproducing kernel Hilbert space. The Gaussian kernel (11) is known to induce a universal reproducing kernel Hilbert space and an empirical version of the above problem is expressed by the following quadratic program. minimize ½ ¼ ÒØÖ ¼ ½ Û ¾¾ Û Û ¼ à ´Ü ØÖ Ü ¼µ ØÖ ÒØÖ ½ Û Û ¿ ¬Ò ¬ ØÖ ½ subject to ¬ Û ¬ ¬ ØÖ ¬ ¬ ¬ ÒØÖ ¬ ¬ ÒØÖ ¯ (15) where È Ø ½ à ´ÜØÖ ÜØ µ. The solution training input points Ü ØÖ ÒØÖ½ . d a imensional case s. Ò Ò½ is an estimate of the importance at the Since KMM does not require to estimate the densities, it is expected to work well even in high However, the performance is dependent on the tuning parameters , ¯, and nd they can not be simply optimized, e.g., by CV, since estimates of the importance are available only at the training input points. Thus, an out-of-sample extension is needed to apply KMM in the CV framework, but this seem s to be an open researc h issue currently. 4 Experiments In this section, we compare the experimental performance of KLIEP and existing approaches. 4.1 Importance Estimation for Artificial Data Sets Let ÔØÖ ´Üµ be the -dimensional Gaussian density with mean ´¼ and ÔØ ´Üµ be the -dimensional Gaussian density with mean ´½ The task is to estimate the importance at training input points: ¼ ¼ ¾ ¼µ and covariance identity ¼µ and covariance identity. Ò ØÖ (16) Û Û ´Ü µ ØÖ Ô ÜØÖ µ ÔØÖ ´ÜØÖ µ Ø´ for ½ We compare the following methods: are estimated by KLIEP with the model (12). The performance of KLIEP is dependent on the kernel width , so we test several different values of . KLIEP(CV): The kernel width in KLIEP is chosen by CV (see Section 2.3). KDE(CV): Û ÒØÖ½ are estimated by KDE with the Gaussian kernel (11). The kernel widths are chosen by likelihood CV (see Section 3.1). KMM: Û ÒØÖ½ are estimated by KMM (see SectioÔ3.2). The performance of KMM is dependent n on , ¯, and . We set ½¼¼¼ and ¯ ´ ÒØÖ ½µ ÔÒØÖ following [5], and test several different values of . Û KLIEP: Ò½ ØÖ We fixed the number of test input points to Ò Ø ½¼¼¼ and consider the following two settings: (a) ½¼¼ and ½¾ ¾¼ and (b) ½¼ and ÒØÖ ¼ ¼ ½ ¼. We run the experiments ½¼¼ times for each , each ÒØÖ , and each method, and evaluate the quality of the estimates Û ÒØÖ½ by the normalized mean squared error: ÒØÖ NMSE ÒØÖ ½ ÒØÖ ½ È Û ÒØÖ ¼ ½Û ¼ È ÒØÖ ¼ ½Û ¼ Û ¾ (17) 6 0.015 0.016 0.014 Average NMSE over 100 Trials 0.012 0.01 0.008 0.006 0.004 0.002 0 Average NMSE over 100 Trials KLIEP(2) KLIEP(4) KLIEP(7) KLIEP(20) KLIEP(CV) KDE(CV) KMM(0.01) KMM(1) KMM(10) KMM(50) 0.01 KLIEP(2) KLIEP(4) KLIEP(7) KLIEP(20) KLIEP(CV) KDE(CV) KMM(0.01) KMM(1) KMM(10) KMM(50) 0.005 2 4 6 8 10 12 d (Input Dimension) 14 16 18 20 0 50 60 70 80 90 100 110 120 n (Number of Training Samples) tr 130 140 150 (a) When input dimension is changed. (b) When training sample size is changed. Figure 2: NMSE average d over ½¼¼ trials. Error bars are omitted since they are reasonably small. We set the number of folds in CV to Ê in all experiments. NMSEs averaged over ½¼¼ trials are plotted in Figure 2. Figure 2-(a) shows that the error of KDE(CV) sharply increases as the input dimension grows, while KLIEP and KMM tend to give much smaller errors than KDE. This would be the fruit of directly estimating the importance without estimating the densities. The performance of KLIEP and KMM is shown to be dependent on the kernel width , and KLIEP(CV) seem s to work quite well. Figure 2-(b) shows that the errors of all methods tend to decreas e as the number of training samples grows. Again KLIEP and KMM give much smaller errors than KDE, and KLIEP(CV) is shown to work very well. Overall, KLIEP tends to outperform KDE and is more advantageous than KMM since it is equipped with an automatic model selection procedure. 4.2 Covariate Shift Adaptation with Regression and Classification Benchmark Data Sets Here we employ importance estimation methods for compensating for the influence of covariate shift in regression and classification benchmark problems (see Table 1). Each data set consists of input-output samples i nto ¼ ½ and choose the test samples ÜØ Ñ Ò´½ ´Ü ´ ÝØ Ü where Ü is the -th element of Ü and is randomly determined and fixed in each trial. We choose the training samples ÜØÖ ÝØÖ ÒØÖ½ uniformly from the rest. That is, in this ´µ experiment, the test input density tends to be lower than the training input density when Ü is small. We set ÒØÖ ½¼¼ and ÒØ ¼¼ for all data sets. ¾ µ µ, ´ µ µ Ý Ø½ Ò . We normalize all the input samples Ü from the pool Ü Ý with probability We learn the target function by importance-weighted regularized least squares (IWRLS): ´Ü µ Ø ½ à ´Ü Ñ µ ´Ã Ï Ã · Á Ø µ ½ Ã Ï ÝØÖ (18) ´½ ¾ where Ø ÒØ ½¼, ص , à ´Ü ܼµ is the Gaussian kernel (11), Ñ is a template Ø ÒØ ØÖ point randomly chosen from Ü ´Û½ Û¾ ÛÒØÖ µ, and à ´Ü Ñ µ, Ï ½, Ã Ø Ø Ø ÝØÖ ´Ý½Ö Ý¾Ö ÝÒÖØÖ µ . Note that the plain RLS (i.e., uniform weight) may not be consistent under covariate shift--IWRLS with the true importance weight Û ÒØÖ½ is consistent [7]. The kernel width and the regularization parameter in IWRLS (18) are chosen by importanceÖ weighted CV [8]: First, divide the training samples Ü ØÖ ÝØÖ ÒØÖ½ into Ê disjoint subsets ØÖ Ö Ê ½. Ö Ö Then learn a function Ö ´Üµ from Ø Ø Ö and compute its mean test error for Ö : ½ ØÖ Ö´ ÜØÖ ÝØÖ µ¾ ØÖ Ö Ä´ Ö ´ÜØÖ µ ÝØÖ µ Ä´Ý Ýµ 7 ´ ´½ Ý Ýµ¾ sign ÝÝ µ ¾ (Regression) (Classification) (19) Table 1: Mean test error averaged over ½¼¼ trials. The numbers in the brackets are the standard deviation. All the error values are normalized by that of `NIW' (no importance weighting). For each data set, the best method and comparable ones based on the Wilcoxon signed rank test at the significance level ± are described in bold face. Upper half are regression data sets taken from DELVE and lower half are classification data sets taken from IDA. `KMM( )' denotes KMM with kernel width . Dim NIW KLIEP(CV) KDE(CV) KMM(0.01) KMM(0.3) KMM(1) Data kin-8fh ½ ¼¼´¼ ¿ µ ¼ ´¼ ¿½µ ½ ¾¾´¼ ¾µ ½ ¼¼´¼ ¿ µ ¼ ´¼ ¿ µ ½ ´¼ ¾ µ kin-8fm ½ ¼¼´¼ ¿ µ ¼ ´¼ ¿ µ ½ ½¾´¼ µ ½ ¼¼´¼ ¿ µ ¼ ¿´¼ ¿ µ ½ ¾´¼ ¿ µ kin-8nh ½ ¼¼´¼ ¾ µ ¼ ´¼ ¾¾µ ½ ¼ ´¼ ¾¼µ ½ ¼¼´¼ ¾ µ ½ ¼¼´¼ ¾¾µ ½ ½ ´¼ ¾ µ kin-8nm ½ ¼¼´¼ ¿¼µ ¼ ´¼ ¾ µ ½ ½ ´¼ ¾ µ ½ ¼¼´¼ ¿¼µ ½ ¼¼´¼ ¾ µ ½ ½ ´¼ ¾¾µ abalone ½ ¼¼´¼ ¼µ ¼ ´¼ µ ½ ¼¾´¼ ½µ ¼ ´¼ ¼µ ½ ¼¿´¼ µ ¼ ¿´¼ ¼µ image ½ ½ ¼¼´¼ ¼µ ¼ ¾´¼ ½µ ¼ ´¼ µ ¼ ´¼ ¼µ ¼ ´¼ µ ½ ¼ ´¼ ¼µ ringnorm ¾¼ ½ ¼¼´¼ ¼ µ ¼ ´¼ ¼ µ ¼ ´¼ ¼ µ ½ ¼¼´¼ ¼ µ ¼ ´¼ ¼ µ ¼ ´¼ ¼ µ twonorm ¾¼ ½ ¼¼´¼ µ ¼ ½´¼ ¾µ ½ ½ ´¼ ½µ ¼ ´¼ ¼µ ¼ ¾´¼ ¾µ ¼ ´¼ ¾µ waveform ¾½ ½ ¼¼´¼ µ ¼ ¿´¼ ¿ µ ½ ¼ ´¼ µ ½ ¼¼´¼ µ ¼ ´¼ ¿ µ ½ ¼¼´¼ ¿¿µ # of Bests 2 7 1 3 6 1 This procedure is repeated for Ö ½ ¾ Ê and choose the values of of the above mean test error over all Ö is minimized. We set Ê . ؽ and such that the ave rage We run the experiments ½¼¼ times for each data set and evaluate the mean test error: MTE ÒØ ½ Ò Ä´ ´ÜØ µ ÝØ µ (20) The results are summarize d in Table 1, where `NIW' denotes no importance weighting (or equivalent to uniform weighting). The table shows that KLIEP(CV) compares favorably with NIW, implying that the importance weighting methods combined with KLIEP(CV) are useful for improving the prediction performance under covariate shift. KLIEP(CV) works much better than KDE(CV); actually KDE(CV) tends to be worse than NIW, which may be due to high dimensionality. We tested ½¼ different values of kernel width for KMM and described three representative results in the table. KLIEP(CV) is comparable to or slightly better than KMM with the best kernel width. Given that KLIEP(CV) is equipped with an automatic model selection procedure, it is regarded as a promising method for covariate shift adaptation. References [1] P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, 1998. [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. [3] W. Hardle, M. Muller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. Springer ¨ ¨ Series in Statistics. Springer, Berlin, 2004. [4] J. J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153­162, 1979. [5] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias ¨ by unlabeled data. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information ¨ Processing Systems 19. MIT Press, Cambridge, MA, 2007. [6] C. R. Shelton. Importance Sampling for Reinforcement Learning with Multiple Objectives. PhD thesis, Massachusetts Institute of Technology, 2001. [7] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227­244, 2000. [8] M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance weighted cross ¨ validation. Journal of Machine Learning Research, 8:985­1005, May 2007. [9] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain-computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767­791, 2002. [10] B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, 2004. ACM Press. 8