Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

Masashi Sugiyama Tokyo Institute of Technology sugi@cs.titech.ac.jp Hisashi Kashima IBM Research hkashima@jp.ibm.com

Shinichi Nakajima Nikon Corporation nakajima.s@nikon.co.jp Motoaki Kawanabe Fraunhofer FIRST nabe@first.fhg.de

Paul von Bunau Ј Tec hnical University Berlin buenau@cs.tu-berlin.de

Abstract
When training and test samples follow different input distributions (i.e., the situation called covariate shift), the maximum likelihood estimator is known to lose its consistency. For regaining consistency, the log-likelihood terms need to be weighted according to the importance (i.e., the ratio of test and training input densities). Thus, acc urately estimating the importance is one of the key tasks in covariate shift adaptation. A naive approach is to first estimate training and test input densities and then estimate the importance by the ratio of the density estimates. However, since density estimation is a hard problem, this approach tends to perform poorly espec ially in high dimensional cases. In this paper, we propose a direct importance estimation method that does not require the input density estimates. Our method is equipped with a natural model selection procedure so tuning parameters such as the kernel width can be objectively optimized. This is an advantage over a recently developed method of direct importance estimation. Simulations illustrate the usefulness of our approach.

1

Introduction

A common assumption in supervised learning is that training and test samples follow the same distribution. However, this basic assumption is often violated in practice and then standard mac hine learning methods do not work as desired. A situation where the input distribution Ш ДмЕ is different in the training and test phases but the conditional distribution of output values Ш Дн мЕ remains unchanged is called covariate shift [7]. In many real-world applications such as robot control [6], bioinformatics [1], brain-computer interfacing [9], or econometrics [4], covariate shift is conceivable and thus learning under covariate shift is gathering a lot of attention these days (e.g., the NIPS2006 workshop). The influence of covariate shift could be alleviated by weighting the log likelihood terms acc ording to the importance лДмЕ д и Дм Е диж Дм Е [7, 10], where ди ДмЕ and диж Дм Е are test and training input densities. Since the importance is usually unknown, the key issue of covariate shift adaptation is how to accurately estimate the importance from samples1 .
Covariate shift matters in parameter learning only when the model used for function learning is misspecified (i.e., the model is so simple that the true learning target function can not expressed) [7]--when the model is correctly (or overly) specified, ordinary maximum likelihood estimation is still consistent. Following this fact, there is a criticism that importance weighting is not needed--just the use of a complex enough model can settle the problem. However, too complex models result in huge variance and thus we practically need to choose a
1

1


A naive approach to importance estimation would be to first estimate the training and test densities separately from training and test input samples, and then estimate the importance by taking the ratio of the estimated densities. However, estimating densities is known to be a hard problem particularly in high-dimensional cases. Therefore, this naive approach may not be effective--directly estimating the importance without going through density estimation would be more promising. Following this spirit, the kernel mean matching (KMM) method has been proposed recently [5], which directly estimates the importance based on a special property of universal reproducing kernel Hilbert spaces. KMM is shown to work well if tuning parameters such as the kernel width are chosen appropriately. Intuitively, model selection of importance estimation algorithms seem s possible by ordinary cross validation (CV) over the performance of subsequent learning algorithms. However, this is highly unreliable since CV is heavily biased under covariate shift--for an unbiased estimation of the performance of subsequent learning algorithms, the CV procedure itself needs to be importance-weighted [10, 8]. Since the importance weight has to have been fixed when model selection is carried out by importance weighted CV, it can not be used for model selection of importance estimation algorithms2 . The above fact implies that model selection of importance estimation algorithms should be performed within the importance estimation step in an unsupervised manner. However, since KMM can only estimate the values of the importance at training input points, it can not be directly applied in the CV framework; an out-of-sample extension is needed, but this seems to be an open research issue currently. In this paper, we propose a new importance estimation method which can overcome the above problems, i.e., the proposed method directly estimates the importance without density estimation and is equipped with a natural model selection procedure. Our basic idea is to find лДмЕ such that the Kullback-Leibler divergence from д и ДмЕ to its estimate ди ДмЕ лДмЕдиж Дм Е is minimized. We propose an algorithm that can carry out this minimization without explicitly modeling д иж ДмЕ and ди ДмЕ. We call the proposed method Kullback-Leibler Importance Estimation Procedure (KLIEP). The optimization problem involved in KLIEP is convex, so the unique global solution can be obtained. Furthermore, the solution tends to be sparse, which contributes to reducing the computational cost in the test phase. Since KLIEP is based on the minimization of the Kullback-Leibler divergence, its model selection can be naturally carried out through a variant of likelihood CV, which is a popular technique in density estimation [3]. A key advantage of our CV procedure is that, not the training samples, but the test input samples are cross-validated. This highly contributes to improving the accuracy since the number of training samples is typically limited while test input samples are abundantly available. The usefulness of KLIEP is demonstrated through simulations.

2

New Importance Estimation Method

In this section, we propose a new importance estimation method. 2.1 Formulation and Notation
Е be the input domain and suppose we are given i.i.d. training input samples м иж иж Let ДЪ вН from a training input distribution with density д иж ДмЕ and i.i.d. test input samples м и ви Н from a Мfor all м О . Typically, test input distribution with density д и ДмЕ. We assume that диж ДмЕ the number виж of training samples is rather small, while the number в и of test input samples is very large. The goal of this paper is to develop a method of estimating the importance лДмЕ from

complex enough but not too complex model. For choosing such an appropriate model, we usually use a model selection technique such as cross validation (CV). However, the ordinary CV score is heavily biased due to covariate shift and we need to also importance-weight the CV score (or any other model selection criteria) for unbiasedness [7, 10, 8]. For this reason, estimating the importance is indispensable when covariate shift occurs. 2 Once the importance weight has been fixed, importance weighted CV can be used for model selection of subsequent learning algorithms.

2


м

иж

вН

иж

and

м

и

ви

Н:

3

лДмЕ

д

и ДмЕ диж ДмЕ

(1)

2.2 Kullback-Leibler Importance Estimation Proce dure (KLIEP) Let us model the importance лДмЕ by the following linear model:

лДмЕ

Н

Ћ Г ДмЕ

(2)

where Ћ Н are basis functions Н are parame ters to be learned from data samples and Г ДмЕ М for all м О and for НО . Note that and Г ДмЕ Н could such that Г ДмЕ be dependent on the samples миж вижН and ми ви Н, i.e., kernel models are also allowed (see Section 2.3). We explain how the basis functions Г ДмЕ Н are chosen in Section 2.3. Using the model лДмЕ, we can estimate the test input density д и
Д

мЕ by

ди ДмЕ

лДмЕдиж Д Е

м

(3)

ди ДмЕ to ди ДмЕ is minimized4: У Ф ди ДмЕ
ди Д

We learn the parame ters

Ћ

Н

in the model (2) so that the Kullback-Leibler divergence from

мЕ

ди Д Е аг ди Д Е аг

м

лДмЕдиж ДмЕ
д
иД

д

иД

мЕ

м

м

мЕ м  дижДмЕ
Ћ

д
Н,
иН

иД

мЕ аг

лДмЕ

м

(4)
(

Since the first term in the last equation is independent of second term. We denote it by Т :

we ignore it and focus on the

Т

ди Д Е аг лДмЕ

м

м

иН

ви

Н

в

аг лДм

иЕ

ви

Н

в

аг

Н

Ћ Г Дм

иЕ

5)

where the empirical approximation based on the test input samples м и ви Н is used. objective function to be maximized with respect to the parameters Ћ Н, which is Note that the above objective function only involves the test input samples м и ви Н, not use the training input samples м иж вижН yet. As shown below, миж вижН will be constraint.

This is our convex [2]. i.e., we did used in the

лДмЕ is an estimate of the importance лДмЕ which is non-negative by definition. Therefore, we need to guarantee that лДмЕ М for all м О ; this can be achieved by restricting

Ћ

М for

Н

О

(6)
виж Н Н

In addition to the non-negativity, лДмЕ should be normalized so that its integral with respect to м is one since лДмЕдиж ДмЕ Д ди ДмЕЕ is a probability density function:
Н лДмЕдиж Д Е

мм

виж Н

виж
Н

лДмиж Е

виж

Н

Ћ Г Дмиж Е

(7)

where the empirical approximation based on the training input samples м иж вижН is used. Due to the approximation error, the normalization constraint (7) is not generally satisfied exac tly. However,
3 Importance estimation is a pre-processing step of supervised learning tasks where training output samples ниж вижН at the training input points м иж вижН are also available [7, 5, 8]. However, we do not use н иж вижН here since they are irrelevant to the importance. 4 One may also consider an alternative scenario where the inverse importance weight л  НДмЕ is parameterized and the parameters are learned so that the Kullback-Leibler divergence from д иж ДмЕ to диж ДмЕ  Н Д л Дм Еди Дм Е) is minimized. However, this inverse approach (iKLIEP) turns out to be less attractive in the model selection stage (see Section 2.3 for detail).

3


Input: б Г output: лДмЕ

Д

мЕ

Н м иж

иж вН

ми

ви Н

Input: Х б output: лДмЕ

м иж

иж вН

ми

ви Н

; Н I Г Дмиж Е виж nitiali ze Ћ Д МЕ, ДМ Н); Repeat until convergence Ћ ЋЗ ДН ЋЕ; Ћ Ћ З ДН   ЋЕ Д Е; Ћ б мДМ ЋЕ; Ћ Ћ Д ЋЕ; end лДмЕ Ш Н Ћ Г ДмЕ;
виж Н

ГШ и Дм

Е

Split ми ви Н into Ъ disjoint subsets ж ж Н; Ъ for each model б О Х for each split ж Н О Ъ лж ДмЕ KLIEPДб миж вижН ж Е; Ш Тж ДбЕ аг лж ДмЕ; Нж мО ж end ШЪ Т Т ДбЕ Н ж Н ж ДбЕ; Ъ end б ж б мбОХ Т ДбЕ; лДмЕ KLIEPДб миж вижН ми ви Н Е; (b) KLIEP with model selection

(a) KLIEP main code

Figure 1: KLIEP algorithm in pseudo code. this may not be critical in practice since the scale of the importance is often irrelevant in subsequent learning procedures [7, 5, 8]. Now our optimization criterion is summarized as follows.
ви Н Ћ

s

maximize
МД

НО

Е

аг

Н

Ћ Г Дм

иЕ

ubject to

виж Н

Н

Ћ Г Дмиж Е

виж

(8)

This is a convex optimization problem and the global solution can be obtained, e.g., by simply performing gradient ascent and feasibility satisfaction iteratively 5 . A pseudo code is described in Figure 1 (a). Note that the solution Ћ Н tends to be sparse [2], which contributes to reducing the computational cost in the test phase. We refer to the above method as Kullback-Leibler Importance Estimation Proce dure (KLIEP). 2.3 Model Selection by Likelihood Cross Validation The performance of KLIEP depends on the choice of basis functions how they can be appropriately chosen from data samples.
Г

ДмЕ

Н.

Here we explain

Since KLIEP is based on the maximization of the score Т (see Eq.(5)), it would be natural to select the model such that Т is maximized. The expectation over ди ДмЕ involved in Т can be numerically approximated by CV as follows: First, divide the test samples м и ви Н into Ъ disjoint subsets ж ж и Ъ Н . Then obtain an importance estimate лж ДмЕ from и and estimate the score Т ж ж using и :

Тж Т

Н

ж и

м

иО

ж

This procedure is repeated for ж НО Ъ and choose the model such that the avera ge score ШЪ Т Н ж is maximized. A pseudo code of the CV procedure is summarized in Figure 1 (b). жН Ъ

и

аг лж Д

ми

Е

(9)

One of the potential limitations of CV in general is that it is not reliable when the number of samples is small since data splitting by CV further reduces the sample size. On the other hand, in our CV procedure, the data splitting is performed over the test input samples, not over the training samples. Since we typically have a large number of test input samples, we do not suffer from the small sample problem. Therefore, our CV procedure is accurate and useful in model selection. A good model may be chosen by the above CV procedure, given that a set of promising model candidates is prepared. As model candidates, we propose using a Gaussian kernel model centered at
5 If necessary, we may regularize the solution, e.g., by adding a penalty term to the objective function or by imposing an upper bound on the solution; the normalization constraint (7) may also be weakened by allowing a small deviation. These modification is possible without sacrificing the convexity, but we do not go into the detail any further since the experimental performance seems

4


the test input points

м

и

ви

Н , i.e.,

лДмЕ
where У Дм

иН

в

Ћ У Дм ми

Е

(10)
(
11)

мМЕ is the Gaussian kernel with kernel width :   М м м У Дм мМ Е мд ОО

О

The reason why we chose the test input points м и ви Н as the Gaussian centers, not the training input points м иж вижН , is as follows. By definition, the importance лДмЕ tends to take large values if the training input density д ижДмЕ is small and the test input density д и ДмЕ is large; conversely, лДмЕ tends to be small if диж ДмЕ is large and ди ДмЕ is small. When a function is approximated by a Gaussian kernel model, many kernels may be needed in the region where the output of the target function is large; on the other hand, only a small number of kernels would be enough in the region where the output of the target function is small. Following this heuristic, we decided to allocate many kernels at high test input density regions, which can be achieved by setting the Gaussian centers at the test input points м и ви Н. Alternatively, we may locate Движ З ви Е Gaussian kernels at both миж вижН and ми ви Н. However, in our preliminary experiments, this did not further improve the performance, but just slightly increased the computational cost. Actually, since ви is typically very large, just using all the test input points м и ви Н as Gaussian centers is already computationally rather demanding. To ease this problem, we practically propose using a subset of м и ви Н as Gaussian centers for computational efficiency, i.e.,

лДмЕ

Н

Ћ У Дм

Е

(12)

where is a template point randomly chosen from м и ви Н and в и НМ and optimize the kernel width In the following, we set

Д ви Е is a prefixed number. by the above CV procedure.

3

Discussion

In this section, we discuss the relation between KLIEP and existing approaches. 3.1 Kernel Density Estimator and Likelihood Cross Validation The kernel density estimator (KDE) is a non-parametric technique to estimate a density дДмЕ from samples м в Н by

дДмЕ
where У Дм

мМЕ is a kernel function, e.g., the Gaussian kernel (11).

виж ДО ОЕ

Н

вН

О

У Дм м

Е

(13)

KDE can be used for importance estimation by first estimating д иж ДмЕ and ди ДмЕ separately from иж и ви иж м д и Дм Е диж ДмЕ. A potential в Н and м Н and then estimating the importance by лДмЕ limitation of this approach is that KDE is known to suffer from the curse of dimensionality [3], i.e., the number of samples needed to maintain the same approximation quality grows exponentially as the dimension of the input space increases. This is particularly critical when estimating д иж ДмЕ since the number of training input samples is typically limited. Furthermore, model selection by likelihood CV is unreliable in such cases since data splitting in the CV procedure further reduces the sample size. Therefore, the KDE-based approach may not be reliable in high-dimensional case s. 5

The performance of KDE depends on the choice of the kernel width . The kernel width can be optimized by likelihood CV [3], i.e., a subset of м в Н is used for density estimation and the rest is used for estimating the likelihood of the held-out samples. Note that likelihood CV corresponds to choosing so that the Kullback-Le ibler divergence from дДмЕ to дДмЕ is minimized.


3.2 Kernel Mean Matching The kernel mean matching (KMM) method avoids density estimation and directly gives an estimate of the importance at training input points [5]. The basic idea of KMM is to find лДмЕ such that the maximum discrepancy of the means of nonlinearly transformed samples drawn from д и ДмЕ and диж ДмЕ is minimized in some feature space : minimize лДмЕ М

  supremum м  О Ни

Д

ми

Е

 
м

иж

лДм Е

иж

Дм иж Е   


s

ubject to

м

иж

лДмиж Е

Н

(14)

It is shown that the solution of the above problem agrees with the true importance if is a universal reproducing kernel Hilbert space. The Gaussian kernel (11) is known to induce a universal reproducing kernel Hilbert space and an empirical version of the above problem is expressed by the following quadratic program. minimize Н
М

виж М
Н

л

ОО

л л М У Дм

иж

м МЕ 

иж

виж Н

л
л

П

Ќв Ќ иж Н subject to Ќ л Ќ Ќ
иж

Ќ Ќ Ќ   виж Ќ Ќ

виж

Џ

(15)

where Ш и Н У Дмиж ми Е. The solution training input points м иж вижН . d a imensional case s.

в

вН

is an estimate of the importance at the

Since KMM does not require to estimate the densities, it is expected to work well even in high However, the performance is dependent on the tuning parameters , Џ, and nd they can not be simply optimized, e.g., by CV, since estimates of the importance are available only at the training input points. Thus, an out-of-sample extension is needed to apply KMM in the CV framework, but this seem s to be an open researc h issue currently.

4

Experiments

In this section, we compare the experimental performance of KLIEP and existing approaches. 4.1 Importance Estimation for Artificial Data Sets Let диж ДмЕ be the -dimensional Gaussian density with mean ДМ and ди ДмЕ be the -dimensional Gaussian density with mean ДН The task is to estimate the importance at training input points:

М М
О

МЕ and covariance identity МЕ and covariance identity. в иж (16)

л

л Дм Е
иж

д

миж Е диж Дмиж Е
иД

for

Н

We compare the following methods: are estimated by KLIEP with the model (12). The performance of KLIEP is dependent on the kernel width , so we test several different values of . KLIEP(CV): The kernel width in KLIEP is chosen by CV (see Section 2.3). KDE(CV): л вижН are estimated by KDE with the Gaussian kernel (11). The kernel widths are chosen by likelihood CV (see Section 3.1). KMM: л вижН are estimated by KMM (see Sectioд3.2). The performance of KMM is dependent n on , Џ, and . We set НМММ and Џ Д виж   НЕ движ following [5], and test several different values of .
л

KLIEP:

вН

иж

We fixed the number of test input points to в и НМММ and consider the following two settings: (a) НММ and НО ОМ and (b) НМ and виж М М Н М. We run the experiments НММ times for each , each виж , and each method, and evaluate the quality of the estimates л вижН by the normalized mean squared error:

виж

NMSE

виж

Н

виж Н

Ш

л

виж М Нл М

 
Ш

виж М Нл М

л

О

(17)

6


0.015

0.016 0.014 Average NMSE over 100 Trials 0.012 0.01 0.008 0.006 0.004 0.002 0

Average NMSE over 100 Trials

KLIEP(2) KLIEP(4) KLIEP(7) KLIEP(20) KLIEP(CV) KDE(CV) KMM(0.01) KMM(1) KMM(10) KMM(50)

0.01

KLIEP(2) KLIEP(4) KLIEP(7) KLIEP(20) KLIEP(CV) KDE(CV) KMM(0.01) KMM(1) KMM(10) KMM(50)

0.005

2

4

6

8

10 12 d (Input Dimension)

14

16

18

20

0 50

60

70

80 90 100 110 120 n (Number of Training Samples)
tr

130

140

150

(a) When input dimension is changed.

(b) When training sample size is changed.

Figure 2: NMSE average d over НММ trials. Error bars are omitted since they are reasonably small. We set the number of folds in CV to Ъ
in all

experiments.

NMSEs averaged over НММ trials are plotted in Figure 2. Figure 2-(a) shows that the error of KDE(CV) sharply increases as the input dimension grows, while KLIEP and KMM tend to give much smaller errors than KDE. This would be the fruit of directly estimating the importance without estimating the densities. The performance of KLIEP and KMM is shown to be dependent on the kernel width , and KLIEP(CV) seem s to work quite well. Figure 2-(b) shows that the errors of all methods tend to decreas e as the number of training samples grows. Again KLIEP and KMM give much smaller errors than KDE, and KLIEP(CV) is shown to work very well. Overall, KLIEP tends to outperform KDE and is more advantageous than KMM since it is equipped with an automatic model selection procedure. 4.2 Covariate Shift Adaptation with Regression and Classification Benchmark Data Sets Here we employ importance estimation methods for compensating for the influence of covariate shift in regression and classification benchmark problems (see Table 1). Each data set consists of input-output samples i nto М Н and choose the test samples ми
б вДН Дм
Д ни
м

where м is the -th element of м and is randomly determined and fixed in each trial. We choose the training samples миж ниж вижН uniformly from the rest. That is, in this ДЕ experiment, the test input density tends to be lower than the training input density when м is small. We set виж НММ and ви ММ for all data sets.
О Е Е, Д

Е

Е

н иН в

. We normalize all the input samples м from the pool м н with probability

We learn the target function by importance-weighted regularized least squares (IWRLS):

Дм Е

и

Н

У Дм б

Е

ДУ

Я У З С и Е Н У Я ниж

(18)

ДН О where и ви НМ, иЕ , У Дм мМЕ is the Gaussian kernel (11), б is a template и ви иж point randomly chosen from м ДлН лО лвиж Е, and У Дм б Е, Я Н, У и и и ниж ДнНж нОж нвжиж Е . Note that the plain RLS (i.e., uniform weight) may not be consistent under covariate shift--IWRLS with the true importance weight л вижН is consistent [7].

The kernel width and the regularization parameter in IWRLS (18) are chosen by importanceж weighted CV [8]: First, divide the training samples м иж ниж вижН into Ъ disjoint subsets иж ж Ъ Н. ж ж Then learn a function ж ДмЕ from и и ж and compute its mean test error for ж :
Н
иж жД

миж ниж ЕО

иж ж

ФД ж Дмиж Е ниж Е

ФДн

нЕ
7

Д ДН  

н   нЕО
sign
нн

Е

О

(Regression) (Classification)

(19)


Table 1: Mean test error averaged over НММ trials. The numbers in the brackets are the standard deviation. All the error values are normalized by that of `NIW' (no importance weighting). For each data set, the best method and comparable ones based on the Wilcoxon signed rank test at the significance level Б are described in bold face. Upper half are regression data sets taken from DELVE and lower half are classification data sets taken from IDA. `KMM( )' denotes KMM with kernel width . Dim NIW KLIEP(CV) KDE(CV) KMM(0.01) KMM(0.3) KMM(1) Data kin-8fh Н ММДМ П Е М ДМ ПНЕ Н ООДМ ОЕ Н ММДМ П Е М ДМ П Е Н ДМ О Е kin-8fm Н ММДМ П Е М ДМ П Е Н НОДМ Е Н ММДМ П Е М ПДМ П Е Н ОДМ П Е kin-8nh Н ММДМ О Е М ДМ ООЕ Н М ДМ ОМЕ Н ММДМ О Е Н ММДМ ООЕ Н Н ДМ О Е kin-8nm Н ММДМ ПМЕ М ДМ О Е Н Н ДМ О Е Н ММДМ ПМЕ Н ММДМ О Е Н Н ДМ ООЕ abalone Н ММДМ МЕ М ДМ Е Н МОДМ НЕ М ДМ МЕ Н МПДМ Е М ПДМ МЕ image Н Н ММДМ МЕ М ОДМ НЕ М ДМ Е М ДМ МЕ М ДМ Е Н М ДМ МЕ ringnorm ОМ Н ММДМ М Е М ДМ М Е М ДМ М Е Н ММДМ М Е М ДМ М Е М ДМ М Е twonorm ОМ Н ММДМ Е М НДМ ОЕ Н Н ДМ НЕ М ДМ МЕ М ОДМ ОЕ М ДМ ОЕ waveform ОН Н ММДМ Е М ПДМ П Е Н М ДМ Е Н ММДМ Е М ДМ П Е Н ММДМ ППЕ # of Bests 2 7 1 3 6 1 This procedure is repeated for ж Н О Ъ and choose the values of of the above mean test error over all ж is minimized. We set Ъ .
иН

and such that the ave rage

We run the experiments НММ times for each data set and evaluate the mean test error: MTE

ви

Н

в

ФД Дми Е ни

Е

(20)

The results are summarize d in Table 1, where `NIW' denotes no importance weighting (or equivalent to uniform weighting). The table shows that KLIEP(CV) compares favorably with NIW, implying that the importance weighting methods combined with KLIEP(CV) are useful for improving the prediction performance under covariate shift. KLIEP(CV) works much better than KDE(CV); actually KDE(CV) tends to be worse than NIW, which may be due to high dimensionality. We tested НМ different values of kernel width for KMM and described three representative results in the table. KLIEP(CV) is comparable to or slightly better than KMM with the best kernel width. Given that KLIEP(CV) is equipped with an automatic model selection procedure, it is regarded as a promising method for covariate shift adaptation.

References
[1] P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, 1998. [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. [3] W. Hardle, M. Muller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. Springer Ј Ј Series in Statistics. Springer, Berlin, 2004. [4] J. J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153­162, 1979. [5] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias Ј by unlabeled data. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Ј Processing Systems 19. MIT Press, Cambridge, MA, 2007. [6] C. R. Shelton. Importance Sampling for Reinforcement Learning with Multiple Objectives. PhD thesis, Massachusetts Institute of Technology, 2001. [7] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227­244, 2000. [8] M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance weighted cross Ј validation. Journal of Machine Learning Research, 8:985­1005, May 2007. [9] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain-computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767­791, 2002. [10] B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, 2004. ACM Press.

8