OTL: A Framework of Online Transfer Learning

Peilin Zhao zhao0106@ntu.edu.sg Steven C.H. Hoi chhoi@ntu.edu.sg School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798

Abstract
In this paper, we investigate a new machine learning framework called Online Transfer Learning (OTL) that aims to transfer knowledge from some source domain to an online learning task on a target domain. We do not assume the target data follows the same class or generative distribution as the source data, and our key motivation is to improve a supervised online learning task in a target domain by exploiting the knowledge that had been learned from large amount of training data in source domains. OTL is in general challenging since data in both domains not only can be different in their class distributions but can be also different in their feature representations. As a first attempt to this problem, we propose techniques to address two kinds of OTL tasks: one is to perform OTL in a homogeneous domain, and the other is to perform OTL across heterogeneous domains. We show the mistake bounds of the proposed OTL algorithms, and empirically examine their performance on several challenging OTL tasks. Encouraging results validate the efficacy of our techniques.

assumption may not always hold for some real applications where training examples may arrive in an online/sequential manner. Unlike the existing transfer learning studies, in this paper, we propose a new framework of Online Transfer Learning (OTL), which addresses the transfer learning problem using an online learning framework. As the first attempt to this problem, we address some OTL challenges in two different settings. In the first setting, we study the homogeneous OTL where the target domain shares the same feature space as the old/source one. In the second setting, we address the challenge of heterogeneous OTL where the feature space of the target domain is different from that of the source domain. We propose algorithms to solve both problems, and theoretically analyze their mistake bounds. Finally, we empirically examine their performance on several challenging OTL tasks. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed framework. Section 4 and Section 5 address the homogeneous and heterogeneous OTL tasks, respectively. Section 6 gives our experimental results and Section 7 concludes this work.

2. Related Work
Our work is generally related to two machine learning topics: online learning and transfer learning. Below we review some important related work in both areas. Online learning has been extensively studied for years (Rosenblatt, 1958; Crammer et al., 2006; Zhao et al., 2009; Yang et al., 2010). Unlike typical machine learning methods that assume training examples are available before the learning task, online learning is more appropriate for some real-world problems where training data arrive sequentially. Due to their merits of attractive efficiency and scalability, various online learning methods have been proposed. One well-known approach is the Perceptron algorithm (Rosenblatt, 1958; Freund & Schapire, 1999), which updates the model by adding a new example with some constant weight into the current set of support vectors when the example is misclassified.

1. Introduction
Transfer learning (TL) has been actively studied recently (Pan & Yang, 2009). It mainly aims to address the machine learning tasks of building models in a new target domain by taking advantage of information from another existing source domain through knowledge transfer. Transfer learning is important for many applications where training data in a new domain may be limited or too expensive to collect. Although transfer learning has been actively explored, most existing work on transfer learning were often studied in an offline learning fashion, which has to assume training data in the new domain is given a priori. Such an
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

OTL: A Framework of Online Transfer Learning

Recently many online learning algorithms have been proposed based on the criterion of maximum margin (Crammer et al., 2006; Li & Long, 1999; Zhao et al., 2009). One example is the PassiveAggressive (PA) method (Crammer et al., 2006), which updates the classification model when a new example is misclassified or its classification score is smaller than some predefined margin. More extensive surveys for online learning can be found in (Shalev-Shwartz, 2007). Transfer learning (TL) has been actively studied. The goal of TL is to extract knowledge from one or more source tasks and then apply them to a target task. Various TL methods have been proposed. According to different learning types, these methods can be roughly classified into three categories: inductive, transductive, and unsupervised approaches. Ina ductive TL (Daum´III & Marcu, 2006) aims to induce the model in the target domain with the aid of knowledge transferred from the source domains; transductive TL (Arnold et al., 2007) aims to extract the knowledge from source domain to improve the prediction tasks in the target domain without labeled data in the target domain; while unsupervised TL aims to resolve unsupervised learning tasks in target domain (Dai et al., 2008). Moreover, according to different feature representation, TL can be classified as homogeneous vs. heterogeneous TL (Argyriou et al., 2008) where the feature spaces of source and target domains can be different. A comprehensive survey on transfer learning can be found in (Pan & Yang, 2009). Although both online learning and transfer learning have been actively studied, to the best of our knowledge, no existing work has formally addressed transfer learning by an online learning framework. Finally, we note that OTL is also different from online multi-task learning (Dekel et al., 2007), which aims to learn multiple tasks in parallel in an online learning framework.

ing via the Perceptron algorithm (Rosenblatt, 1958; Freund & Schapire, 1999) or regular learning by support vector machines (SVM). For an online transfer learning (OTL) task, our goal is to online learn some prediction function f  H on a target domain from a sequence of examples {(x2t , y2t )|t = 1, . . . , T } in some data space X2 × Y2 . Specifically, during the OTL task, at the t-th trial of online learning task, the learner receives an instance x2t , and the goal of online learning is to find a good prediction function such that the predicted class label sign(ft (x2t )) can match its truth class label y2t . The key challenge of OTL is how to effectively transfer the knowledge from the old/source domain to the new/target domain for improving the online learning performance. Next, we study OTL in two different settings: homogeneous vs. heterogeneous OTL.

4. Online Transfer Learning over Homogeneous Domains
We start by studying the homogeneous OTL, in which we assume the source domain and the target domain have the same feature space, i.e., X2 = X1 and Y2 = Y1 . One key challenge of this task is to address the concept drifting issue that often occurs in this scenario. Specifically, the concept drift means that the target variable to be predicted changes over time in the learning process. This raises the challenge of transferring knowledge from source domain to target domain. The basic idea of our OTL solution is based on the ensemble learning approach. In particular, we first construct an entirely new prediction function f only from the data in the target domain in an online fashion, and then learn an ensemble prediction function that is the mixture of both the old and the new prediction functions, i.e., h and f , which thus can transfer the knowledge from the source domain. The remaining issue is then how to effectively combine the two prediction functions for handling the concept drift issue. To combine the two prediction functions h(x) and ft (x) at the t-trial of the online learning task, we introduce two weight parameters, w1,t and w2,t , for the two prediction functions respectively. At the t-th step, given an instance x2t , we predict its class label by the following ensemble function:
y2t = sign w1,t (h(x2t )) + w2,t (ft (x2t )) - ^ 1 2 (1)

3. Problem Formulation
Let us denote by X1 × Y1 the source/old data space, where X1 = Rm and Y1 = {-1, +1}. Since our task aims to learn a kernel classifier, we thus denote by 1 (·, ·) : Rm × Rm  R the kernel function to be used in the source classifier. Assume that a source classifier h(x) can be represented as:
S

h(x) =
s=1

s y1s 1 (x1s , x)

where {(x1s , y1s )  X1 × Y1 |s = 1, . . . S} are the set of support vectors for the source training data set, and s are the coefficients of support vectors. Typically the source classifier h(x) can be obtained by applying existing learning techniques, such as online learn-

where (x) is a normalization function, i.e., (x) = max(0, min(1, x+1 )). At the beginning of the OTL 2 task, we simply set w1,1 = w2,1 = 1 . In order to 2 effectively transfer, for the subsequent trials of the

OTL: A Framework of Online Transfer Learning Algorithm 1: Online Transfer Learning algorithm (OTL) Input: the old classifier h(x) = S s y1s 1 (x1s , x) and s=1 initial trade off C and weights w1,1 = w2,1 = 1 2 1: Initialize f1 = 0 2: for t = 1, 2, . . . , T do 3: receive instance: x2t  X2 4: predict y2t by Eq. 1 ^ 5: receive correct label: y2t  {-1, +1} 6: compute w1,t+1 and w2,t+1 by Eq. (2) and (3) 7: suffer loss: t = [1 - y2t ft (x2t )]+ 8: if t > 0 then 9: t = min{C, t /2 (x2t , x2t )} 10: ft+1 = ft + t y2t 2 (x2t , ·) 11: end if 12: end for Figure 1. The Online Transfer Learning (OTL) algorithm.

The proof of Theorem 1 is given in the appendix. Remark. To better understand the mistake bound, we denote by Mh and Mf the mistake bound of model h and ft , respectively. First, we note that  ((h(x2t )), (y2t )) is the upper bound of 1 Mh in4 stead of Mh (because  is a square loss and both (h(x2t )) and (y2t ) are normalized to [0, 1]); similarly,  ((ft (x2t )), (y2t )) is the upper bound of 1  4 Mf . Further, if we assume  ((h(x2t )), (y2t ))  1 1  4 Mh and  ((ft (x2t )), (y2t ))  4 Mf , we have the result: M  min{Mh , Mf }+8 ln 2. This gives a strong theoretical support for the OTL algorithm.

OTL task, in addition to updating the function ft+1 (x) by some online learning methods, e.g. the PA algorithm (Crammer et al., 2006), we expect the two weights of both prediction functions, i.e., w1,t and w2,t , can be adjusted dynamically. We suggest the following updating scheme for adjusting the weights: w1,t+1 w2,t+1 = = w1,t  st (h) w1,t  st (h) + w2,t  st (ft ) w2,t  st (ft ) w1,t  st (h) + w2,t  st (ft ) (2) (3)

5. Online Transfer Learning over Heterogeneous Domains
In this section, we study the OTL problem across heterogeneous domains where the source and target domains have different feature spaces. Heterogeneous OTL is generally very challenging. To simplify the problem, we assume the feature space of the source domain is a subset of that of the target domain. Due to the difference of the two feature spaces, we cannot directly apply the algorithm in the previous section. Below we propose to introduce a multi-view approach for solving the challenge in this case. Formally, we denote the data on the target domain as: {(x2t , y2t )|t = 1, . . . , T }, where x2t  X2 = Rn  Rm and y2t  {-1, +1}. Without loss of generality, we assume the first m dimensions of X2 represent the old feature space X1 . In the multi-view setting, we split (1) each data instance x2t into two instances x2t  X1 and x2t  X2 /X1 . For the second view, we introduce a new kernel function 2 (·, ·) : Rn-m × Rn-m  R. The key idea of our heterogeneous OTL method is to adopt a co-regularization principle of online learning (1) (2) two classifiers ft and ft simultaneously from the two views, and predict an unseen example on the tar(1) (1) (2) (2) get domain by yt = sign 1 ft (x2t ) + ft (x2t ) . ^ 2 For the specific algorithm, we initialize the classifier for (1) (2) the first view by setting f1 = h, and setting f1 = 0 for the second view. For a new example in the online (1) learning task, we update the new functions ft+1 and (2) ft+1 by the following co-regularization optimization:
(ft+1 , ft+1 )
(1) (2)

where st (g) = exp{- ((g(x2t )), (y2t ))}, g  H and  (z, y) is a loss function which is set to  (z, y) = (z - y)2 in our approach. Finally, Figure 1 summarizes the proposed OTL algorithm. Next we analyze the mistake bound of the algorithm. We first introduce the following proposition. Proposition 1. When using the square loss  (z, y) = (z - y)2 for z  [0, 1] and y  {0, 1} and the above exponentially weighting update method and setting  = 1/2, we have the bound of the ensemble algorithm as:
T

(2)

 (w1,t (h(x2t )) + w2,t (ft (x2t )), (y2t ))  2 ln 2 +
t=1 T T


min
t=1

 ((h(x2t )), (y2t )),
t=1

 ((ft (x2t )), (y2t ))

The proposition can be proved by following the similar technique described at Section 3.3 of the book (Cesa-Bianchi & Lugosi, 2006). By Proposition 1, we derive the mistake bound of the OTL algorithm as follows. Theorem 1. Let us denote by M the number of mistakes made by the OTL algorithm, we then have M bounded from above by:
M  4 min h , f where h = T  t=1  ((ft (x2t )), (y2t )). + 8 ln 2 and f (4) =

=

arg +

f (1) H1 f (2) H2 2 H2

min

1 (1) (1) f - ft 2 + Ct

2 H1

2 (2) (2) f - ft 2

(5)

T  t=1  ((h(x2t )), (y2t ))

where 1 , 2 and C are positive parameters, and the loss term t is defined below:
1 (1) (2) t = [1 - y2t (f (1) (x2t ) + f (2) (x2t ))]+ 2 (6)

OTL: A Framework of Online Transfer Learning Algorithm 2: The Co-Regularized Online Transfer Learning Algorithm (COTL) Input: the old classifier h(x) = S s y1s 1 (x1s , x) and s=1 parameters 1 , 2 and C (1) (2) 1: Initialize f1 = h and f1 = 0 2: for t = 1, 2, . . . , T do 3: receive instance: x2t  X2 (1) (1) (2) (2) 4: predict: yt = sign 1 ft (x2t ) + ft (x2t ) ^ 2 5: 6: 7: 8: 9: receive correct label: y2t  {-1, +1} suffer loss: (1) (1) (2) (2) t = 1 - y2t 1 ft (x2t ) + ft (x2t ) 2 if t > 0 then 2 t = min{C, k141+kt } 2 
t 2 t 1

where t is given in Eqn. (6) and  is defined as: (1) (2) (g (1) , g (2) ; t) = [1 - y2t 1 (g (1) (x2t ) + g (2) (x2t ))]+ . 2 The proof of the Lemma is given in the appendix. Using Lemma 1, we can show the following theorem for the mistake bound of the proposed COTL algorithm. Theorem 2. Let (x2t , y2t ), t = 1, . . . , T be a sequence of examples, where x2t  Rn and y2t  {-1, +1} 1 2 for all t. In addition kt  R1 and kt  R2 t = 1, . . . , T . And we split the instance x2t into two views (1) (2) (x2t , x2t ). Then for any g (1)  H1 and g (2)  H2 , the number of mistakes M made by the proposed COTL algorithm is bounded from above by:
M 1 1 h - g (1) 
T 2

+

ft+1 = ft
(2)

(1)

(1) (2)

+ +

10: ft+1 = ft 11: end if 12: end for

(1) t y  (x2t , ·) 21 2t 1 (2) t y2t 2 (x2t , ·) 2
2

+ 2 g (2)

2

+ 2C
t=1

(g (1) , g (2) ; t)

Figure 2. The Co-regularized Online Transfer Learning.

41 2 where  = min{C, R1 2 +R2 1 }.

Intuitively, the above updating method aims to make the updated ensemble classifier be able to classify the new observed example (x2t , y2t ) correctly, and to force the two-view classifiers without deviating too much (1) (2) from the previous classifiers (ft , ft ) via the first two regularization terms. The above optimization enjoys a closed-form solution as shown in Proposition 2. To simplify our discussion, (1) (1) 1 2 we introduce notation kt = 1 (x2t , x2t ) and kt =
(2) (2) 2 (x2t , x2t ).

Proof. Since
(1) (2)

t

=

2 min{C, k1421+kt1 } 2 
t t


C,

t (g , g ; t)  C(g , g ; t). In addition, t = 2 2 min{C, k1421+kt1 }  k1421+kt1 , we thus have 2 2  
t t t t

(1)

(2)

T

t t - (g (1) , g (2) ; t) - (
t=1 T T

1 kt k2 + t )t 81 82 T 1 kt k2 + t )t2 81 82

=
t=1 T

t  t -
t=1 T

t (g (1) , g (2) ; t) -
t=1 T

(


t=1 T

t t-
t=1

C(g (1) , g (2) ; t)-
t=1 T

(

Proposition 2. For the optimization problem (5), its solution can be expressed as follows:
ft+1 = ft
(i) (i)

1 k2 41 2 t kt + t )t 1 2 81 82 kt  2 + kt  1

+

t (i) i (x2t , ·) 2i

=
t=1

t  t - C
t=1

(g (1) , g (2) ; t) -
T

i = 1, 2

(7) =

1 2

T

t  t
t=1

where t = min{C,

41 2 t }. 1 2 kt 2 +kt 1

1 2

T

t  t - C
t=1 t=1

(g (1) , g (2) ; t)

The proof of the proposition is given in the appendix. By this proposition, we summarize the proposed "Coregularized Online Transfer Learning" (COTL) algorithm in Figure 2. Before we prove the mistake bound for the COTL algorithm, we first introduce a lemma. Lemma 1. Let (x2t , y2t ), t = 1, . . . , T be a sequence of examples, where x2t  Rn and y2t  {-1, +1} for all t. After we split the instance x2t into two views (1) (2) (x2t , x2t ), for any g (1)  H1 and g (2)  H2 , we have the following bound:
T

Combining the above inequality with the conclusion of Lemma 1, we have
1 2
T

t  t 
t=1

1 h - g (1) 2

2

+

2 (2) g 2

T 2

+C
t=1

(g (1) , g (2) ; t)

Furthermore, when a mistake occurs, t  1; thus 2 2 t t = min{C, k1421+kt1 }  t  min{C, k1421+kt1 }  2 2  
t t t t

41 2 min{C, R1 2 +R2 1 } =  . Combining this observation with the inequality above, we have

t t - (g (1) , g (2) ; t) - (
t=1

1 kt

81
2

+

2 kt

1 1 M ×  h - g (1) 2 2

2

+

2 (2) g 2

T 2

+C
t=1

(g (1) , g (2) ; t)

82

)t (8)

1  h - g (1) 2

2

2 (2) + g 2

The theorem follows directly by multiplying 2/ on both sides of the above inequality.

OTL: A Framework of Online Transfer Learning

6. Experimental Results
In this section, we evaluate the empirical performance of the proposed two kinds of OTL algorithms. 6.1. Experimental Testbed and Setup for Homogeneous OTL Our first experiment is to evaluate the performance of OTL from homogeneous data. We compare our OTL technique against other popular online learning techniques, including the Passive-Aggressive algorithms("PA") (Crammer et al., 2006) without exploiting any knowledge from the source domain, and a variant of it, which is the PA method Initialized with the Old classifier h, denoted as PAIO for short. For our OTL technique, in addition to Algorithm 1, we also implement another variant, which is implemented by fixing the ensemble weights of the OTL algorithm to 1/2, denoted "OTL(fixed)" for short. This helps us to examine if the proposed weighting strategy is effective. For the PA methods, the original algorithm was proposed for learning linear models (Crammer et al., 2006). In our experiments, we adapted all the algorithms to the kernel settings. To extensively examine the performance, we test all the algorithms on some benchmark machine learning datasets, including dataset "w7a", a dataset without concept drifting, and "usenet2", a dataset with concept-drifting, which can be downloaded 1 . Besides, we also create another concept-drifting dataset named "newsgroup4" based on the dataset "newsgroup20" downloaded from the LIBSVM web site 2 . The details of the "newsgroup4" is shown in Table 1. For an OTL
Table 1. The class distribution of dataset newsgroup4.
example id comp.windows.x rec.sport.hockey sci.space talk.politics.mideast 0-400 + + 401-800 + + 801-1200 + + 1201-1600 + +

every dataset. We evaluate the performance of online learning methods by measuring the standard mistake rate. Also we evaluate the total number of support vectors to examine the sparsity of the resulting classifiers. Finally, we measure the average time cost for comparing the efficiency of the algorithms. 6.2. Evaluation of Homogeneous OTL Tasks Table 2 summarizes the performance of the compared algorithms on the three datasets.
Table 2. Results on the datasets of homogeneous domain. w7a (n=24692, d=300) Algorithm
Mistake (%) %± %± %± %± Support Vectors (#) 1639.95 2556.20 3045.75 3045.75 ± ± ± ± 23.96 30.95 33.99 33.99 PA PAIO OTL(fixed) OTL 2.86 2.34 2.22 1.87 0.05 0.06 0.05 0.01 0.96 1.82 2.22 2.37 Time (s) 0.04 0.04 0.05 0.07 Time (s) 0.04 0.05 0.06 0.08

Time (s)

Algorithm
PA PAIO OTL(fixed) OTL

usenet2 (n=1500, d=99)
Mistake (%) 49.33 47.92 42.67 34.42 %± 0 %± 0 %± 0 %± 0 Support Vectors (#) 949 ± 0 1116 ± 0 1203 ± 0 1203 ± 0 Support Vectors (#) 1188 1585 1536 1536 ± ± ± ± 0 0 0 0

Algorithm
PA PAIO OTL(fixed) OTL

newsgroup4 (n=1600, d=62062)
Mistake (%) 42.50 51.75 38.83 37.58 %± 0 %± 0 %± 0 %± 0

experiment, we must split the whole dataset into two parts: (1) training data for the source domain, and (2) test data for online learning in the target domain. For the two concept-drifting datasets, we split each of them into two parts according to their sequential orders: "usenet2"(300+1200) and "newsgroup4"(400+1200); for "w7a" without concept drifting, we randomly split it into two parts: (10000+14692), and repeat it 20 times. Finally, we adopt the (kernel) PA algorithm to build the baseline classifier in the source domain. All the algorithms employ a gaussian kernel. For fair comparison and simplicity, we set 1 = 4 and 2 = 8 for all the datasets and algorithms. In addition, parameter C is set to 5 for all algorithms on
1 2

Several observations can be drawn from the experimental results. First of all, for the "w7a" dataset without concept drifting, we found that all three algorithms outperform the baseline algorithm (PA), in which the proposed OTL algorithm achieved the best performance among all. Further, on the two conceptdrifting datasets, we found that the performances of the compared algorithms are quite different. For PAIO, it can only improve the mistake rate slightly on the "usenet2", but failed to improve over the baseline on the dataset "newsgroup4". For the two OTL algorithms, both can improve the mistake rates on both datasets, in which OTL is more effective than OTL(fixed) which uses a fixed combination weights. These experimental results show that without careful consideration, OTL may suffer from negative transfer when facing a serious concept drifting problem. Finally, Figure 3 shows the details of average mistake rates varying over the OTL processes on the three data sets, respectively. Similar to the previous results, the proposed OTL algorithm achieved the best results among all datasets. In particular, on the newsgroup4 dataset, we found that all the three algorithms suffer from the concept-drifting event at the very beginning of the OTL process; however, the proposed OTL algorithm is able to rapidly improve its performance when receiving more examples, while PAIO failed to improve since it depends too much on the knowledge inherited from the source domain. This again verifies the efficacy of the proposed method.

http://mlkd.csd.auth.gr/concept_drift.html http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools

OTL: A Framework of Online Transfer Learning
0.045
0.7

Average rate of mistakes

Average rate of mistakes

0.04 0.035 0.03 0.025 0.02 0.015 0

0.6 0.55 0.5 0.45 0.4 0.35

Average rate of mistakes

PA PAIO OTL(fixed) OTL

0.65

PA PAIO OTL(fixed) OTL

0.65 0.6 0.55 0.5 0.45 0.4 0.35

PA PAIO OTL(fixed) OTL

5000

10000

15000

0

200

400

600

800

1000

1200

0.3 0

200

400

600

800

1000

1200

Number of samples

Number of samples

Number of samples

(a) w7a

(b) usenet2

(c) newsgroup4

Figure 3. Experimental results of average mistake rates on the homogeneous OTL tasks.

6.3. Experimental Testbed and Setup for Heterogenous OTL In this section, we evaluate the empirical performance of the proposed Co-regularized Online Transfer Learning (COTL) algorithm for heterogenous OTL tasks. We compare our COTL technique with the PA algorithm, which does not exploit knowledge from the source domain. Similarly, we implement a variant of PA algorithm that uses only the first view of the data and is initialized with h from the source domain, denoted as "PAIO". We also implement a variant of COTL, whose first view classifier is initialized with zero function, denoted as "COTL0". This method enables us to examine the importance of engaging the h function learned from the source domain. Finally, we implement another baseline algorithm that simply combines the above PAIO classifier of the first view and the PA classifier learned from the second view on the target domain, denoted as "SCT" for short. To extensively examine the performance, we test all the algorithms on several benchmark datasets from machine learning repositories, including "a7a", "german", "mushrooms", "spambase" and "w7a". These datasets can be downloaded from LIBSVM website.
Table 3. Summary of data sets used for OTL tasks. source/old domain target/new domain Datasets Number Dimension Number Dimension
a7a mushrooms spambase w7a 6100 3000 2000 10000 61 56 28 150 10000 5124 2601 14692 123 112 57 300

all the algorithms on every dataset. We conducted 20 different random permutations to obtain the average results. We evaluate the performance of online learning methods by calculating the mistake rates. We also evaluate the total number of support vectors to examine the sparsity of the resulting classifiers. Finally, we evaluate the time cost of the compared algorithms. 6.4. Evaluation of Heterogenous OTL Tasks Table 4 summarizes the performance of all the algorithms for heterogenous OTL on the four datasets.
Table 4. Results on the datasets of heterogenous domain. a7a Algorithm
Mistake (%) %± %± %± %± %± Support Vectors (#) 4266.05 ± 47.10 7115.00 ± 33.15 8403.20 ± 86.34 11842.20 ± 56.15 11063.30 ± 87.58 PA PAIO COTL0 SCT COTL 22.08 22.60 21.62 26.06 21.32 0.34 0.31 0.29 0.36 0.27 1.49 3.55 3.25 6.16 5.71 Time (s) 0.23 0.24 0.36 0.53 0.47 Time (s) 0.17 0.21 0.25 0.40 0.34 Time (s) 0.96 1.91 1.89 3.03 3.08

Time (s)

Algorithm
PA PAIO COTL0 SCT COTL

mushrooms
Mistake (%) 2.56 0.66 1.10 1.10 0.38 %± %± %± %± %± 0.11 0.07 0.09 0.10 0.04 Support Vectors (#) 944.30 ± 17.93 865.75 ± 10.94 1450.30 ± 36.68 2198.35 ± 26.22 1779.90 ± 34.63

Algorithm
PA PAIO COTL0 SCT COTL

spambase
Mistake (%) 25.04 14.41 12.78 12.71 11.17 %± %± %± %± %± 0.66 0.57 0.34 0.40 0.43 Support Vectors (#) 1761.30 1742.65 2399.30 3667.30 2993.10 ± ± ± ± ± 16.40 13.46 24.97 23.89 27.49

Algorithm
PA PAIO COTL0 SCT COTL

w7a
Mistake (%) 3.85 3.53 3.34 3.22 3.04 %± %± %± %± %± 0.08 0.05 0.07 0.06 0.06 Support Vectors (#) 1780.60 2774.45 3432.00 4427.85 4480.50 ± ± ± ± ± 23.28 32.96 43.78 39.33 57.85

For each dataset, we randomly split it into two parts: source versus target, as shown in Table 3. In the partition, to meet the setup of the heterogenous OTL task, the source-domain data associate with only the first half of the feature space while the target-domain data include the whole feature space. All the algorithms in comparison employ a gaussian kernel. For fair comparison and simplicity, for all the datasets and algorithms, we set 1 = 2 = 1 and 1 = 2 = 4 for the two views, and  = 8 for the whole feature. In addition, parameter C is set to 5 for

Several observations can be drawn from the results. First of all, we found that among all the algorithms, the PA algorithm without exploiting knowledge from source domain achieved the highest mistake rate in most cases. This shows that it is important for studying knowledge transfer in an OTL task. Second, for all the datasets, we found that the COTL algorithm has the smallest mistake rate. This validates the proposed OTL technique is effective for knowledge transfer in

OTL: A Framework of Online Transfer Learning

the online learning tasks. Of course, there is some cost of knowledge transfer for the gain. By examining the number of support vectors and the running time cost, we found that the COTL techniques usually produce denser classifiers and spend more time. This is unavoidable as the COTL algorithm makes use of the old classifier from the source domain. Finally, Figure 4 shows the details of the COTL processes on the four data sets, respectively. Similar observations can be found from the results, which again verify the proposed OTL method is effective and promising.

The Lagrangian of the above optimization is:
L(f (1) , f (2) , , t , ) 2 (2) 1 (1) (1) (2) f - ft 2 1 + f - ft 2 2 + C = H H 2 2 1 (1) (2) +t 1 - y2t f (1) (x2t ) + f (2) (x2t ) -  -  2 1 (1) 2 (2) (1) (2) = f - ft 2 1 + f - ft 2 2 + (C H H 2 2 1 (1) (2) -t - ) + t 1 - y2t f (1) (x2t ) + f (2) (x2t ) (9) 2

7. Conclusion
In this paper, we studied the new problem of Online Transfer Learning (OTL), which aims to transfer knowledge from a source domain to an online learning task on a target domain. We addressed two OTL tasks in different settings and presented two novel OTL algorithms. We offered theoretical analysis on the mistake bounds of the proposed OTL algorithms, and extensively examined their empirical performance. Encouraging results show the proposed algorithms are effective. Through this work, we hope to encourage the investigation of OTL to address other harder problems, e.g. how to perform heterogeneous OTL from complex data of completely diverse feature representations.

where t  0 and   0 are Lagrange multipliers. We now find the minimum of the Lagrangian with respect to f (1) , f (2) and  by setting their partial derivatives to (i) (i) t zeros. We get f (i) = ft + 2i y2t i (x2t , ·) for i = 1, 2 and C - t -  = 0. And since   0, we conclude C  t . We thus have t  [0, C]. (i) Plugging the three equations f (i) = ft + (i) t 2i y2t i (x2t , ·) (where i = 1, 2) and C - t -  = 0 into Eq. (9), we have
L(t ) = -t2 (
1 kt k2 + t ) + t  t 81 82

By setting the partial derivative of the above equation to zero, we have
t = t /(
1 kt k2 41 2 t + t )= 1 2 41 42 kt  2 + kt  1

Appendix
Proof of Theorem 1 Proof. First notice that whenever there is a mistake at some t-th step, we should have |w1,t (h(x2t )) + 1 w2,t (ft (x2t )) - (y2t )|  2 . Thus, we haves
T

Finally, combining the result t  [0, C], we thus have 2 the solution: t = min{C, k1421+kt1 } . 2 
t t

Proof of Lemma 1 Proof. Let t = 1 2
2 2
T

ft

(1) (2)

- g (1)

2

- ft+1 - g (1) , then
- ft+1 - g (1)
(2) 2 (1)

(1)

2

+

 w1,t (h(x2t )) + w2,t (ft (x2t )), (y2t )
t=1 T 2

ft

(2)

- g (2)
T

2

- ft+1 - g (2)
1 2 ft ft
(2) (1)

2

=
t=1

w1,t (h(x2t )) + w2,t (ft (x2t )) - (y2t )

1  M 4

t
t=1

=
t=1

- g (1)
2

2

2

+ =

Combining the above fact with Proposition 1, we have
1 M  min h , f + 2 ln 2 4
 where h = t=1  ((h(x2t )), (y2t )) and f = T  t=1  ((ft (x2t )), (y2t )). The theorem follows directly by multiplying 4 at both sides of the above inequality. T


1 (1) h - g (1) 2 - fT +1 - g (1) 2 2 2 (2) (2) + f1 - g (2) 2 - fT +1 - g (2) 2 1 2 h - g (1) 2 + g (2) 2 2 2
(i) (i)

2 2

- g (2)

- ft+1 - g (2)

2

Proof of Proposition 2 Proof. It is easy to see that the optimization problem (5) is equivalent to the following problem
f (1) H

Second, when t = 0, ft+1 = ft for i = 1, 2, it is clear (i) (i) (i) t t = 0; when t > 0, ft+1 = ft + 2i y2t i (x2t , ·), we compute t as:
t = 1 (1) (1) ft - g (1) 2 - ft+1 - g (1) 2 2 2 (2) (2) + ft - g (2) 2 - ft+1 - g (2) 2 2 y2 (1) (1) (2) (2) t - t ft (x2t ) + ft (x2t ) 2 y2 k1 k2 (1) (2) + t g (1) (x2t )+g (2) (x2t ) -( t + t )t (10) 2 81 82

min
1

f (2) H

1 (1) (1) f -ft 2 2

2 2 H1 + 2

f (2) -ft

(2) 2 H2 +C

=

s.t. 1 - y2t

1 (1) (2) f (1) (x2t ) + f (2) (x2t )   and   0 2

OTL: A Framework of Online Transfer Learning
0.26 0.14 0.5 0.055

Average rate of mistakes

Average rate of mistakes

Average rate of mistakes

0.25 0.24 0.23 0.22 0.21 0.2 0

Average rate of mistakes

PA PAIO COTL0 SCT COTL

0.12 0.1 0.08 0.06 0.04 0.02 0 0 1000 2000 3000

PA PAIO COTL0 SCT COTL

0.45 0.4 0.35 0.3 0.25 0.2 0.15

PA PAIO COTL0 SCT COTL

0.05

0.045

PA PAIO COTL0 SCT COTL

0.04

0.035

2000

4000

6000

8000

10000

4000

5000

0.1 0

500

1000

1500

2000

2500

3000

0.03 0

2000

4000

6000

8000

10000 12000 14000

Number of samples

Number of samples

Number of samples

Number of samples

(a) a7a (b) mushrooms (c) spambase (d) w7a Figure 4. Experimental results of average mistake rates on the heterogenous OTL tasks.

We also have t = 1 - y2t

1 2

ft (x2t ) + ft (x2t )

(1)

(1)

(2)

(2)

since t > 0. This is equivalent to the following:
y2t (1) (1) (2) (2) ft (x2t ) + ft (x2t ) = 1 - t . 2

Crammer, Koby, Dekel, Ofer, Keshet, Joseph, ShalevShwartz, Shai, and Singer, Yoram. Online passiveaggressive algorithms. J. Mach. Learn. Res., 7:551­ 585, 2006. Dai, Wenyuan, Yang, Qiang, Xue, Gui-Rong, and Yu, Yong. Self-taught clustering. In Proc. 25th Int'l Conf. on Machine Learning (ICML2008), pp. 200­ 207, Helsinki, Finland, 2008. Daum´III, H. and Marcu, D. Domain adaptation for a statistical classifiers. Journal of Artificial Intelligence Research, 26:101­126, 2006. Dekel, Ofer, Long, Philip M., and Singer, Yoram. Online learning of multiple tasks with a shared loss. J. Mach. Learn. Res., 8:2233­2264, 2007. Freund, Yoav and Schapire, Robert E. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3):277­296, 1999. Li, Yi and Long, Philip M. The relaxed online maximum margin algorithm. In Advances in Neural Information Processing Systems (NIPS), pp. 498­504, 1999. Pan, Sinno Jialin and Yang, Qiang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2009. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386­407, 1958. Shalev-Shwartz, Shai. Online learning:theory, algorithms, and applications. In Ph.D thesis, 2007. Yang, Haiqin, Xu, Zenglin, King, Irwin, and Lyu, Michael. Online learning for group lasso. In 27th Int'l Conf. on Machine Learning (ICML2010), Haifa, Israel, 2010. Zhao, Peilin, Hoi, Steven C. H., and Jin, Rong. Duol: A double updating approach for online learning. In Advances in Neural Information Processing Systems (NIPS), pp. 2259­2267, 2009.

In addition,
(g (1) , g (2) ; t) =  1 - y2t 1 (1) (1) (2) g (x2t ) + g (2) (x2t ) 2
+

1 (1) (1) (2) 1 - y2t g (x2t ) + g (2) (x2t ) , 2
(2)

y2t 2

we thus have
g (1) (x2t ) + g (2) (x2t )  1 - (g (1) , g (2) ; (x2t , y2t )).
(1)

Combining these two facts and inequality (10), we thus have the following result:
t  t -(1-t )+ 1 -(g (1) g (2); (x2t , y2t ))-( , = t t - (g (1) , g (2) ; (x2t , y2t )) - (
1 kt k2 + t )t 81 82

1 k2 kt + t )t 81 82

Hence, we have the following conclusion:
T

t t - (g (1) , g (2) ; (x2t , y2t )) - (
t=1

1 kt k2 + t )t 81 82


1 h - g (1) 2

2

+

2 (2) g 2

2

References
Argyriou, Andreas, Maurer, Andreas, and Pontil, Massimiliano. An algorithm for transfer learning in a heterogeneous environment. In Euro. Conf. Mach. Learn. and Knowledge Discovery in Databases, pp. 71­85, Antwerp, Belgium, 2008. Arnold, Andrew, Nallapati, Ramesh, and Cohen, William W. A comparative study of methods for transductive transfer learning. In Proc. 7th IEEE Int'l Conf. on Data Mining Workshops, pp. 77­82, Washington, DC, USA, 2007. Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006. ISBN 0521841089.