Linear Algorithms for Online Multitask Classification

Giovanni Cavallanti

` Nicolo Cesa-Bianchi

Claudio Gentile

Abstract
We design and analyze interacting online algorithms for multitask classification that perform better than independent learners whenever the tasks are related in a certain sense. We formalize task relatedness in different ways, and derive formal guarantees on the performance advantage provided by interaction. Our online analysis gives new stimulating insights into previously known co-regularization techniques, such as the multitask kernels and the margin correlation analysis for multiview learning. In the last part we apply our approach to spectral co-regularization: we introduce a natural matrix extension of the quasiadditive algorithm for classification and prove bounds depending on certain unitarily invariant norms of the matrix of task coefficients.

1 Introduction
A fundamental and fascinating problem in learning theory is the study of learning algorithms that influence each other. Although much is known about the behavior of individual strategies that learn a classification or regression task from examples, our understanding of interacting learning systems is still fairly limited. In this paper, we investigate this problem from the specific viewpoint of multitask learning, where each one of K > 1 learners has to solve a different task (typically, K classification or K regression tasks). In particular, we focus on multitask binary classification, where learners are online linear classifiers (such as the Perceptron algorithm). Our goal is to design online interacting algorithms that perform better than independent learners whenever the tasks are related in a certain sense. We formalize task relatedness in different ways, and derive formal guarantees on the performance advantage provided by interaction. Our analysis builds on ideas that have been developed in the context of statistical learning. In the statistical analysis of multitask learning (e.g., [2, 3, 4, 11, 24, 26]) the starting point is a regularized empirical loss functional or Tikhonov
` DSI, Universita di Milano, Italy. ` DSI, Universita di Milano, Italy.  ` DICOM, Universita dell'Insubria, Italy. This is the corresponding author. Email: claudio.gentile@uninsubria.it
 

functional --see, e.g., [10]. In the presence of several tasks, this functional is extended to allow for co-regularization among tasks. Roughly speaking, the co-regularization term forces the set of predictive functions for the K tasks to lie "close" to each other. This co-regularization term is typically a squared norm in some Hilbert space of functions. We follow the approach pioneered by [11], where the K estimated solutions are linear functions parametrized by u = (u1 , . . . , uK )  RK d and the co-regularization is u A u, where A is a positive definite matrix enforcing certain relations among tasks. The key observation in [11] is the following. Assume the instances of the multitask problem are of the form (xt , it ), where xt  Rd is an attribute vector and it  {1, . . . , K } indicates the task xt refers to. Then one can reduce the K learning problems in Rd to a single problem in RK d by choosing a suitable embedding of the pairs (xt , it ) into a common RKHS space RK d with inner product u, v = u A v . This reduction allows us to solve a multitask learning problem by running any kernel-based single-task learning algorithm with the "multitask kernel" defined above. We build on this result by considering a natural online protocol for multitask linear classification. Within this protocol we analyze the performance of the Perceptron algorithm and some of its variants when run with a multitask kernel. Because such kernels are linear, we are not restricted to using kernel-based algorithms for efficiency reasons. In Section 3 we consider the kernel Perceptron algorithm, and derive mistake bounds for the multitask kernels proposed in [11]. This reveals new insights into the role played by the regularizer matrix A. First, we see that the update in the kernel space defined by A factorizes in the "shared update" of K interacting Perceptrons each running in Rd , thus providing a basic example of interactive online learning. Second, we exploit the simplicity of the mistake bound analysis to precisely quantify the performance advantage brought by the multitask approach over K independent online algorithms. In particular, in Subsections 3.2 and 3.3 we give examples where the mistake bound is used to guide the design of A. The first part of the paper is concluded with Sections 4 and 5, where we show multitask versions and mistake bound analyses for the second-order Perceptron algorithm of [7] and for the p-norm algorithm of [13, 14]. In the remaining sections of the paper, we depart from the approach of [11] to investigate the power of online learning when other forms of co-regularization are used. In Sec-


tion 6 we consider the case when instances belong to a space that is different for each task, and the similarity among tasks is measured by comparing their margin sequences (see, e.g., [6, 27]). We introduce and analyze a new multitask variant of the second-order Perceptron algorithm. The mistake bound that we prove is a margin-based version of the bound shown in Subsection 3.2 for the multitask Perceptron. Finally, in Section 7 we consider spectral co-regularization [4] for online multiview learning. Here diversity is penalized uu ng a norm function defined on the d × K matrix U = si o f view vectors. In the spirit of [19], we in1 , . . . , uK terpret this penalization function as a potential defined over arbitrary matrices. We then define a natural extension of the quasi-additive algorithm of [14, 20] to a certain class of matrix norms, and provide a mistake bound analysis depending on the singular values of U . The results we obtain are similar to those in [28, 29, 30], though we are able to overcome some of the difficulties encountered therein via a careful study of matrix differentials. In the next section, we introduce the basic online multitask protocol and define the multitask Perceptron algorithm. In order to keep the presentation as simple as possible, and to elucidate the interactive character of the updates, we delay the introduction of kernels until the proof of the mistake bound. In our initial online protocol, at each time step the multitask learner receives a pair (xt , it ), where it is the task index for time t and xt is the instance for task it . Note that we view multitask learning as a sequential problem where at each time step the learner works on a single adversarially chosen task, rather than working simultaneously on all tasks (a similar protocol was investigated in [1] in the context of prediction with expert advice). One of the advantages of this approach is that, in most cases, the cost of running our multitask algorithms has a mild dependence on the number K of tasks. We also remark that linear algorithms for online multitask learning have been studied in [9]. However, these results are sharply different from ours, as they do not depend on task relatedness.

learner's mistake count to
u1 ,...,uK Rd

inf

t

t(uit )

(1)

1 + where t(uit ) = - yt uit xt is the hinge loss of the reference linear classifier (or task vector ) uit at time t. Our goal is to design algorithms that make fewer mistakes than K independent learners when the tasks are related, and do not perform much worse than that when the tasks are completely unrelated. In the first part of the paper we use Euclidean distance to measure task relatedness. We say that the K tasks are related if there exist reference task vectors u1 , . . . , uK  Rd having small pairwise distances ui - uj , and achieving a small cumulative hinge loss in the sense of (1). More general notions of relatedness are investigated in later sections.

3 The multitask Perceptron algorithm
We first introduce a simple multitask version of the Perceptron algorithm. This algorithm keeps a weight vector for each task and updates all weight vectors at each mistake using the Perceptron rule with different learning rates. More precisely, let wi,t-1 be the weight vector associated with task i at the beginning of time step t. If we are forced (by the adversary) to predict on task it , and our prediction happens to be wrong, we update wit ,t-1 through the standard additive rule wit ,t = wit ,t-1 +  yt xt (where  > 0 is a constant learning rate) but, at the same time, we perform a "half-update" on the remaining K - 1 Perceptrons, i.e., we set wj,t = wj,t-1 +  yt xt for each j = it . This rule is 2 based on the simple observation that, in the presence of related tasks, any update step that is good for one Perceptron should also be good for the others. Clearly, this rule keeps the weight vectors wj,t , j = 1, . . . , K , always close to each other. The above algorithm is a special case of the multitask Perceptron algorithm described below. This more general algorithm updates each weight vector wj,t through a learning rate which is an arbitrary positive definite function of the pair (j, it ). These learning rates are defined by a K × K interaction matrix A. The pseudocode for the multitask Perceptron algorithm using a generic interaction matrix A is given in Figure 1. At the beginning of each time step, the counter s stores the mistakes made so far (plus one). The (column) vector t  RK d denotes the multitask instance defined by 0 ( t = , . . . , 0 xt 0, . . . , 0 2) d d
(it - 1) times (K - it ) times

2 Learning protocol and notation
There are K binary classification tasks indexed by 1, . . . , K . At each time step t = 1, 2, . . . the learner receives a task index it  {1, . . . , K } and the corresponding instance vector1 xt  Rd (which we henceforth assume to be normalized, xt = 1). Based on this information, the learner outputs a binary prediction yt  {-1, 1} and then observes the correct label yt  {-1, 1} for task it . As in the standard worstcase online learning model, no assumptions atre made on the x mechanism generating the sequence t , yt 1 . Moreover, similarly to [1], the sequence of tasks it is also generated in an adversarial manner. We compare the learner's performance to that of a reference predictor that is allowed to use a different linear classifier for each of the K tasks. In particular, we compare the
Throughout this paper all vectors are assumed to be column vectors.
1

where xt  Rd is the instance vector for the current task it . (Note that t = 1 since the instances xt are normalized.) The weights of the K werceptrons are maintained in a comP , 1 K d pound vector ws = ,s , . . . , w ,s with w j,s  R for all j . The algorithm predicts yt through the sign yt of the it -th Perceptron's margin ws-1 t = wit ,s-1 xt . Then, if prediction and true label disagree, the update rule becomes A -1 ws = ws-1 + yt  Id t , where  denotes the Kro-


Parameters: Positive definite K × K interaction matrix A. Initialization: w0 = 0  RK d , s = 1. At each time t = 1, 2, . . . do the following: 1. Observe task number it  {1, . . . , K } and the corresponding instance vector xt  Rd ; 2. Build the associated multitask instance t  RK d ; w  3. Predict yt = S G N s-1 t {-1, +1}; 4. Get label yt  {-1, +1}; 5. If yt = yt then update: ws = s ws-1 + yt s+1. A  Id -1 t

Theorem 1 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with interaction matrix (3) on any finite multitask sequence (1 , y1 ), (2 , y2 ), . . .  RK d × {-1, 1}, satisfies t uA K 2 u m  inf t(u) + Kd +1 uR M 2u , K Aut  + t(u) +1
M

where M is the set of mistaken trial indices, and u
A u=

iK
=1

ui 2 +

1
i<j K

ui - uj 2 .

Figure 1: The multitask Perceptron algorithm. necker product betweeen matrices2 and Id is the d×d identity A -1 matrix. Since  Id = A-1  Id , the above update is equivalent to the K task updates
1 wj,s  wj,s-1 + yt A-it xt j,

j = 1, . . . , K .

The algorithm is mistake driven, hence wt-1 is updated (and is s increased) only when yt = yt . 3.1 Pairwise distance interaction matrix We now analyze the choice of A that corresponds to the updates wit ,s  wit ,s-1 +  yt xt and wj,s  wj,s-1 +  2 yt xt for j = it with  = 2/(K + 1). As it can be easily verified, this choice is given by   K -1 . . . -1  -1 K . . . -1  A= (3) ... ... ... ...  -1 . . . . . . K with A-1 2 1 1 =  K + 1 ... 1  1 2 ... ... ... ... ... ...  1 1 . ...  2

Remark Note that when all tasks are equal, that is when u1 = · · · = uK , the bound of Theorem 1 becomes the standard Perceptron mistake bound (see, e.g., [7]). In the general case of distinct ui we have iK u A u 11 ui uj . < ui 2 - K +1 K +1 =1 i,j K i The sum of squares ui 2 is the mistake bound one can prove when learning K independent Perceptrons (under linear separability assumptions). On the other hand, highly correlated reference task vectors (i.e., large inner products ui uj ) imply a large negative second term in the right-hand side of the above expression. Theorem 1 is immediately proven by using the fact that the multitask Perceptron is a specific instance of the kernel Perceptron algorithm [12] using the linear kernel introduced in [11] (see also [15]). As mentioned in the introduction, this kernel is defined as follows: for any positive definite K × K interaction matrix A introdRce the K d-dimensional u w reproducing kernel Hilbert space v K d , · , · H ith inA ner product u, v H = u  Id . Then define the kernel feature map  : Rd × {1, . . . , K }  H such that A  (xt , it ) =  Id )-1 t . (4) The kernel used by the multitask Perceptron is thus defined by ( = H K xs , is ), (xt , it ) (xs , is ),  (xt , it ) A = s  Id )-1 t . (5) Proof of Theorem 1: We use the following version of the kernel Perceptron bound (see, e.g., [7]), + m t 2 m t(f ) + h H ax (xt , it ) 2 H
t

In order to keep in with the notation just introduced, we equivalently specify an online multitask problem by the sequence (1 , y1 ), (2 , y2 ), · · ·  RdK × {-1, 1} of multitask examples, where t is the multitask instance defined in (2). Moreover, given a sequence of multitask examples and reference task vectors u1 , . . . , uK  Rd , wu introduce the "come  Kd 1 pound" reference task vector u = , . . . , uK R and write + 1 + def 1 t(u) = - yt u t = - yt uit xt = t(uit ) . Finally, we use A as a shorthand for A  Id , where d is understood from the context. We have the following result.
The Kronecker or direct product between two matrices A = [ai,j ] and B of dimension m × n and q × r, respectively, is the block matrix of dimension mq × nr whose block on row i and column j is the q × r matrix ai,j B .
2

m h
H t

ax (xt , it ) 2 H

t

t(h)

where h is any function in the RKHS H induced by the ker2 nel. Let A = A  Id . For the kernel (5) we have u H = u A u and (xt , it ) 2 = t A-1 A A-1 t =   H 1 t A-1 t = A-,it . Observing that A-1is = 2/(K + 1)  it is , for the matrix A-1 defined in (3) concludes the proof.


3.2 A more general interaction matrix In this section we slightly generalize the analysis of the previous section and consider an update rule of the form b+K (1+b)K yt xt if j = it , wj,s = wj,s-1 + b (1+b)K yt xt otherwise, where b is a nonnegative parameter. The corresponding interaction matrix is given by   a -b . . . -b 1  -b a . . . -b  (6) A=  ... ... ... ...  . K -b . . . . . . a with a = K + b(K - 1). It is immediate to see that the previous case (3) is recovered by choosing b = K . The inverse of (6) is   b+K b ... b 1 b b + K ... b   A-1 = .  ... ... ... ...  (1 + b)K b ... ... b + K When (6) is used in the multitask Perceptron algorithm, the proof of Theorem 1 can be adapted to prove the following result. Corollary 2 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with interaction matrix (6) on any finite multitask sequence (1 , y1 ), (2 , y2 ), · · ·  RK d × {-1, 1}, satisfies t + (b + K ) u A m t(u) + u (1 + b)K M ( t b + K) u A t(u) u (1 + b)K
M

  u1 2 + · · · + uK 2 u 2 V   + K -1 A R [u]

if b = 0, otherwise.

When the variance VA R[u] is large (compared to the squared 2 centroid norm u ), then the optimal tuning of b is zero and the interaction matrix becomes the identity matrix, which amounts to running K independent Perceptron algorithms. On the other hand, when the optimal tuning of b is nonzero we learn K reference vectors, achieving a mistake bound equal to that of learning a single vector whose length is u V  A R [u]. plus K - 1 times the standard deviation At the other extreme, if the variance VA R[u] is zero (namely, when all tasks coincide) then the optimal b grows t +)uA unbounded, and the quadratic term ((1b+bKK  u ends ) K 1 to the average square norm K i=1 ui 2. In this case the multitask algorithm becomes essentially equivalent to an algorithm that, before learning starts, chooses one task at random and keeps referring all instance vectors xt to that task (somehow implementing the fact that now the information conveyed by it can be disregarded). 3.3 Encoding prior knowledge

We could also pick the interaction matrix A so to encode prior knowledge about tasks. For instance, suppose we know that only certain pairs of tasks are potentially related. We represent this knowledge in a standard way through an undirected graph G = (V , E ), where two vertices i and j are connected by an edge if and only if we believe task i and task j are related. A natural choice for A is then A = I + L, where L = [Li,j ]Kj =1 is the Laplacian of G, defined as i, d if i = j , i -1 if (i, j )  E , Li,j = 0 otherwise, where di is the degree (number of incoming edges) of node i. If we now follow the proof of Theorem 1, which holds for any positive definite matrix A, we obtain the following result. Corollary 3 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with interaction matrix I + L on any finite multitask sequence (1 , y1 ), (2 , y2 ), . . .  RK d × {-1, 1}, satisfies t  m  inf t(u) + cG u I + L u
uRK d M

for any u  RK d , where u
A u

=

iK
=1

ui 2 + bK VA R[u] ,

(7)

K 1 being VA R[u] = K i=1 ui - u 2 the "variance", of the u / task vectors, and u the centroid 1 + · · · + uK K . It is interesting to investigate how the above bound depends on the trade-off parameter b. The optimal value of b (requiring prior knowledge about the distribution of u1 , . . . , uK ) is   K 2   -1 u b = max 0, -1 .   K VA R[u] Thus b grows large as the reference task vectors ui get close to their centroid u (i.e., as all ui get close to each other). Substituting this choice of b gives (b + K ) u (1 + b)K
A u

w + here u
I

c
G

u

I

+L

t u
M

t(u)

+L


u=

iK
=1

ui 2 +

(
i,j )E

ui - uj 2

(8)

2 K vj and = = maxi=1,...,K j =1 1+,ij . Here 0 = 1 < cG  2  · · ·  K are the eigenvalues of the positive semidef-


inite matrix L, and vj,i denotes the i-th component3 of the eigenvector v j of L associated with eigenvalue j . Proof: Following the proof of Theorem 1, we just need to bound max A-i1 = max (I + L)-i1 . i, i,
i=1,...,K i=1,...,K

Parameters: Positive definite K × K interaction matrix A. Initialization: S0 = , v 0 = 0  RK d , s = 1. At each time t = 1, 2, . . . do the following: 1. Observe task number it  {1, . . . , K } and the corresponding instance vector xt  Rd ; 2. Build the associated multitask instance t  RK d and A -1/2 compute t =  Id t ; ws   3. Predict yt = S G N {-1, +1}, I -1 t -1 t + Ss-1 Ss-1 + t  v s-1 ; where ws-1 = 4. Get the label yt  {-1, 1}; 5. If yt = yt then update: v s = v s-1 + yt t , Ss = S
s-1

If v 1 , . . . , v K are the eigenvectors of L, then (I + L)-1 = which concludes the proof. Ideally, we would like to have cG = O(1/K ). Clearly enough, if G is the clique on K vertices we expect to exactly recover the bound of Theorem 1. In fact, we can easily verify that the eigenvector v 1 associated with the zero eigenvalue K -1/2 . 1 is , . . . , K -1/2 Moreover, it is well known that all the remaining eigenva1ues are equal to K (see, e.g., [16]). l 1 1 1 Therefore cG = K + - K K +1 = K2 1 . In the case + of more general graphs G, we can bound cG in terms of the smallest nonzero eigenvalue 2 , 1 1 1 1 + - . cG  K K 1 + 2 The value of 2 , known as the algebraic connectivity of G, is 0 only when the graph is disconnected. 2 is known for certain families of graphs. For instance, if G is a complete bipartite graph (i.e., if tasks can be divided in two disjoint subsets T1 and T2 such that every task in T1 is related to every task in T2 and for both i = 1, 2 no two tasks in Ti are | . related), then it is known that 2 = min T1 |, |T2 | We refer the reader to, e.g., [16] for further examples. The advantage of using a graph G with significantly fewer edges than the clique is that the sum of pairwise distances in (8) will contain less than K (K - 1) terms. On the other hand, this reduction i balanced by a larger coefficient s cG in front of u I + L u. This coefficient, in general, is related to the total number of edges in the graph (observe that the trace of L is exactly twice this total number). jK v j v j 1 + j =1


t

,

s  s+1 .

Figure 2: The second-order multitask Perceptron algorithm. where  is the kernel feature map (4). The algorithm computes a tentative (inverse) matrix I -1 t + Ss-1 Ss-1 + t  . Such a matrix is combined with the current Perceptron vector v s-1 to predict the label yt . If prediction yt and label yt disagree both v and S get updated (no update takes place otherwise). In particular, the new matrix Ss is augmented by padding with the current vector t . Since supports are shared, the computational cost of an update is not significantly larger than that for learning a single-task (see Section 4.1). Theorem 4 The number of mistakes m made by the secondorder multitask Perceptron of Figure 2, run with any positive definite interaction matrix A, on any finite multitask sequence (1 , y1 ), (2 , y2 ), . . .  RK d × {-1, 1}, satisfies, for all u  RK d , t m t(u)
M

4 The second-order extension
In this section we consider the second-order kernel Perceptron algorithm of [7] with the multitask kernel (5). The algorithm, which is described in Figure 2, maintains in its internal state a matrix S (initialized to the empty matrix ) and a multitask Perceptron weight vector v (initialized to the zero vector). Just like in Figure 1, we use the subscript s to denote the current number of mistakes plus one. Note that we have exploited the linearity of the kernel (5) to simplify the description of the algorithm. In particular, letting A = AId , we have repeatedly used the fact that  H (xs , is ),  (xt , it ) = s A-1 t  = A A -1/2 -1/2 = s t   s  , t
2 Note that the orthonormality of the eigenvectors imply v1,i + 2 · · · + vK,i = 1 for all i. 3

u +
A

m  Id u + t
M

ui
t

xt

2

j
=1

ln(1 + j )

where M is the sequence of mistaken trial indices and 1 , . . . , m are the eigenvalues of the m × m matrix of elements xs A-1it xt , where s, t  M. is , Proof: Recall the mistake bound for the second-order kernel Perceptron algorithm [7]: h m t j 2 2+ h(xt ) ln(1 + j ) m H
M =1

where 1 , . . . , m are the eigenvalues of the m × m kernel Gram (sub)matrix including only time steps in M. When the


kernel is (5) with feature map f we have u H = u A u u, 2 u A -1 2 u 2 nd  (xt , it ) = = it xt . Fi A  t  = nally, the kernel Gram matrix is K (xs , is ),  (xt , it ) s A-1 t = xs A-1it xt . This concludes the proof.  is , Again, this bound should be compared to the one obtained when learning K independent tasks. As in the firstu order algorithm, we have the complexity term u A  Id . In this case, however, the interaction matrix A also plays a role in the scale of the eigenvalues of the resulting multitask Gram matrix. Roughly speaking, we gain a factor K from u A-1 u (according to the arguments in Section 3).  In addition, however, we gain a furthe factor K , ssince the r s -1 trace of the multitask Gram matrix A t ,tM = s s -1 [x Ais ,it xt ,tM is about 1/K times the trace of the origix s nal Gram matrix s xt ,tM . Since both factors are under the square root, the resulting gain over the K independent task bound is about K . 4.1 Implementation in dual variables

a

2

5 The p-norm extension
We now extend our multitask results to the p-norm Perceptron algorithm of [14, 13]. As before, when the tasks are all equal we want to recover the bound of the single-task algorithm, and when the task vectors are different we want the mistake bound to increase according to a function that penalizes task diversity according to their p-norm distance. We develop our p-norm multitask analysis for the specific choices of p = 2 ln d (or p = 2 ln K when d  K ) and for the pairwise distance matrix (3). It is well known that for p = 2 ln d the mistake bound of the single-task p-norm Perceptron is essentially equivalent to the one of the zerothreshold Winnow algorithm of [22]. We now see that this property is preserved in the multitask extension. We start with the following slightly more general algorithm based on arbitrary norms. Later, we specialize it to p-norms. The quasi-additive multitask algorithm of [20, 14] is defined for any norm · over RK d . Initially, w0 = 0  RK d . If s - 1 mistakes have been made in thw first t - 1 e . time steps, then the prediction at time t is S G N s-1 t If a mistake occurs at time t, then ws-1 is updated with the rule ws = 2 v s 2, where the primal weight v s  RK d is 1 updated using the multitask Perceptron rule, v 0 = 0  RK d and v s = v s-1 + yt A-1 t for an arbitrary positive definite  interaction matrix A. This can be analyzed using the following technique (see [8] for details). Let v m be the primal weight after any 1 number m of mistakes. Then, by Taylor expanding 2 v s 2 around v s-1 for each s = 1, . . . , m, and using the fact yt ws-1 t  0 whenever a mistake occurs at step t, we get sm 1 D (v s v s-1 ) (10) vm 2  2 =1 v - D (v s v s-1 ) = 1 yt ws-1 xt is a sos 2 - v s-1 2 2 called Bregman divergence; i.e., the error term in the first2 1 order Taylor expansion of 2 · around vector v s-1 , at vector v s . Fix any u  RK d . Using the convex inequality for norms u v  u v  where ·  is the dual norm of · (see, e.g., [25, page 131]), and using the fact u A v s = u A v s-1 + yt u t  u A v s-1 + 1 - t(u), one then obtains t m- t(u) u A v m vm   . (11) A u  A u  Combining (10) with (11) and solving for m gives 2 sm t m t(u) + A u  D (v s v s-1 ) . (12)
M =1

It is easy to see that the second-order multitask Perceptron can be run in dual variables by maintaining K classifiers that share the same set of support vectors. This allows an efficient implementation that does not impose any significant overhead with respect to the corresponding single-task version. Specifically, given some interaction matrix A, the margin at time t is computed as (see [7, Theorem 3.3]) I -1 t  ws-1 t = v s-1 + Ss-1 Ss-1 + t  t I -1 Ss  t , = ys + Ss Ss

(9)

where y s is the s-dimensional vector whose first s - 1 components are the labels yi where the algorithm has made a mistake up to time t - 1, and the last component is 0. First, note that replacing I + Ss Ss with I + Ss-1 Ss-1 in (9) does not change the sign of the prediction. The margin at time t can then be computed by calculating the scalar I -1 product between Ss t and y s + Ss-1 Ss-1 . Now, -1 j s each entry of the vector S t is of the form Aj,it x xt , and thus computing Ss t requires O(s) inner products so that, overall, the prediction step requires O(s) scalar multiplications and O(s) inner products (independent of the number of tasks K ). On the other hand, the update step involves the computaI -1 tion of the vector y s + Ss Ss . For the matrix update we can write I . s Ss-1 t s-1 + S -1 Ss-1 s Is + S Ss = t t Ss-1 1+ 
t

Using standard facts about the inverse of partitioned matrices (e.g., [17, Ch. 0]), one can see that the inverse of matrix Is + Ss Ss can be computed from the inverse of Is-1 + Ss-1 Ss-1 with O(s) extra inner products (again, independent of K ) and O(s2 ) additional scalar multiplications.

We obtain our multitask version of the p-norm Perceptron | 1/p when u = u p = u1 |p + |u2 |p + · · · . In particular, we focus our analysis on the choice p = 2 ln max{K, d}, which gives mistake bounds in the dual norms u 1 and xt , and on the pairwise distance matrix (3). Using the analysis in [8] we obtain, for ts = t, A 2 2 p - 1 A-1 p-1 -1 D (v s v s-1 )  xt p 2  t p =  it 2 2 p


where A-1 is the it -th column of A-1 . If we now use p =  it 2 ln max{K, d}, then xt 2  e xt 2 and p  A 2 A 2 A-1 2 4e -1 -1 . e = e it ,it =  it  it (K + 1)2 p  We now turn to the computation of the dual norm A u q, where q = p/(p - 1) is the dual coefficient of p. We find that 2 K u i ju 1 . A u 2  A u 2 = i+ i - uj q 1
=1 =i

where b is a positive constant. We add the above terms up to time t, resulting in a cumulative regularization term u bK s
Mt i K 1j uj xj,s xis ,s - s K =1

2 ,

Plugging back into (12) gives the following theorem. Theorem 5 The number of mistakes m made by the p-norm multitask Perceptron, run with the pairwise distance matrix (3) and p = 2 ln max{K, d}, on any finite multitask sequence (1 , y1 ), (2 , y2 ), . . .  RK d × {-1, 1}, satisfies, for all u  RK d , 2t t m t(u) + H + H t(u)
M M

where H is equal to K u j 4 e2 ln max{K, d} 2 i X i+ 2 (K + 1) =1 and X = maxtM xt . Remark When all tasks are equal, u1 = · · · = uK , the coefficient H in the bound of Theorem 5 becomes 2 42 m e ln max{K, d} ax xt  ui 2 . 1
t

2 u
=i i

- uj

1

.

where Mt is the set of mistaken trials up to time t. Thus, in trial t our prior knowledge about task relatedness is encoded as a (positive) correlation among the K margin sequences (uj xj,1 , uj xj,2 , . . . , uj xj,t ), j = 1, . . . , K . The algorithm of Figure 3 is a natural multitask predictor operating with the above view-based regularization criterion. We call this algorithm the multiview-based multitask Perceptron algorithm, or M M P E R C for brevity. M M P E R C can be viewed as a variant of the second-order Perceptron using the cumulative covariance matrix of the past margin vectors in order to suitably transform instances. M M P E R C has a constant tradeoff parameter b > 0 (playing the same role as the one in Section 3.2), and maintains in its internal state a multitask matrix A (initialized to the identity matrix I ) and a Perceptron multitask weight vector v (initialized to the zero vector). The subscript s plays the same role as in the previous algorithms. Unlike the algorithms in previous sections, M M P E R C observes the task number it and the multitask instance t = x1 m 2 K ade up of the instance vectors of all ,t , x ,t , . . . , x ,t tasks. Then M M P E R C computes a tentative matrix At to be used for prediction. Matrix At is obtained by adding the rank-one positive semidefinite matrix Mt to the previous matrix As-1 . Here j,t is the (d1 + · · · + dK )-dimensional vector f j,t = d
1

If K  d this bound is equivalent (apart from constant factors) to the mistake bound for the single-task zero-threshold Winnow algorithm of [22].

0

,...,0

xj,t d
j +1

0, . . . , 0
+ · · · + dK times

+ · · · + dj -1 times

6

Learning tasks in heterogeneous spaces

In this section we slightly deviate from the approach followed so far. We consider the case when the K task vectors ui may live in different spaces: ui  Rdi , i = 1, . . . , K . This is a plausible assumption when attributes associated with different tasks have a completely different meaning. In such a case, the correlation among tasks is naturally measured through the task margins ui x (or views ) --see, e.g., the previous work of [6, 27] for a similar approach in the context of semi-supervised learning. In order to allow for the views to interact in a meaningful way, we slightly modify the learning protocol of Section 2. We now assume that, at each time t, we receive the adversarial choice of task it together with all instance vectors xi,t  Rdi for i = 1, . . . , K . The co-regularization terms motivating our algorithm are proportional to the distance between the margin uit xit ,t of task it k 1 and the average margin k j =1 uj xj,t of all tasks: u 2 K 1j i j bK xit ,t - u xj,t , (13) t K =1

or j = 1, . . . , K . Observe that Mt has been set so as to make the quadratic form u Mt u coincide with the regularization term (13). Similarly to the algorithms of previous sections, the tentative matrix At and the current Perceptron vector v s-1 are used for predicting the true label yt . If prediction yt and label yt disagree both v and A get updated. In particular, As is set to the tentative matrix At . In this protocol we call example the triple (it , t , yt ). Like the results contained in the previous sections, our analysis will provide a multitask bound on the number of prediction mistakes which is comparable to the one obtained by a single task plus a penalization term due to task relatedness. However, though this algorithm is a second-order prediction method, we only give a first-order analysis that disregards the eigenstructure of the data. This is due to the technical difficulty of handling a time-varying matrix A that in trial t includes all instance vectors x1,t , . . . xK,t . Theorem 6 The number of mistakes m made by the algorithm in Figure 3, run on any multitask sequence = (i1 , 1 , y1 ), (i2 , 2 , y2 ), . . . satisfies, for all u


Parameters: b > 0. Initialization: A0 = I , v 0 = 0  Rd1 +···+dK , s = 1. At each time t = 1, 2, . . . do the following: 1. Observe task number it  {1, . . . , K }; 2. Observe multitask instance vector x  d1 +···+dK t = 1,t , . . . , xK,t R ; 3. Build the associated multitask instance it ,t ; 4. Set At = As-1 + Mt where  t  t ; Mt = b K it ,t - it ,t - K K vs  5. Predict yt = S G N -1 (At )-1 it ,t {-1, +1}; 6. Get label yt  {-1, +1}; 7. If yt = yt then update: v s = v s-1 + yt it ,t , As = A ,
t

one of the margin deviation terms in (15) is zero (maximal task relatedness). Notice that setting b = 1 gives (14) =
K 2K - 1 i ||ui ||2 K2 =1 m 2K - 1 t + K

u

i

M

K 2 1j xit ,t - uj xj,t t K =1

yielding a gain of K whenever the K tasks are significantly related, as measured by (15). Proof of Theorem 6: Although the general structure of the analysis is based on the second-order Perceptron proof, we use the special properties of I + Mt in order to compute the contribution of K and b to the final bound. Let t = ts be the time step when the s-th mistake occurs. We write v s A-1 v s = (v s-1 + yt it ,t ) s
A-1 s (v s-1

+ yt it ,t )

(from the update rule in Figure 3) ss+1. = v s-1 A-1 v s-1 + 2yt v s-1 A-1 it ,t s s + it ,t A-1 it ,t s  v s-1 A-1 v s-1 + it ,t A-1 it ,t s s  v s-1 A-11 v s-1 + it ,t (I + Mt )-1 it ,t . s- (16)

Figure 3: The multiview-based multitask Perceptron algorithm (M M P E R C). u1 , , . . . , uK t+ m t(u) +
M

K (b + 1) - b u bK (K - 1) + K
A

A

mu

(b + 1) - b u bK (K - 1) + K where u =
A mu

K

mu

t
M

t(u) ,

In order to prove the first inequality, note that on the s-th mistaken trial As = At and yt v s-1 (A t )-1 it ,t  0. In order to prove the second inequality note that As - As-1 and As - (I + Mt ) are both positive semidefinite. We now focus on computing the quadratic form it ,t (I + K Mt )-1 it ,t . Recall that t = j =1 j,t is the sum of the orthonormal vectors 1,t , . . . , it ,t , . . . , K,t . Thus, from the very definition of Mt in Figure 3 it is easy to verify that , b(K - 1) t . K (17) Also, since Mt t = 0, we have (I + Mt )t = t , and thus (I + Mt )-1 t = t . Hence (17) allows us to write (I + Mt ) it ,t = (b(K - 1) + 1) it ,t - b(K - 1) t . K Taking the inner product of both sides with it ,t , and solving for it ,t (I + Mt )-1 it ,t yields it ,t = (b(K - 1) + 1) (I + Mt )-1 it ,t - it ,t (I + Mt )-1 it ,t = K (b + 1) - b . bK (K - 1) + K

iK
=1

||ui || + bK

2

tm
M

u
i

K 1j xit ,t - uj xj,t t K =1

2

being M the set of mistaken trials. Remark It is the factor K (b + 1) - b u bK (K - 1) + K

A

mu

(

14)

that quantifies the relatedness among tasks (the leading constant b K in the second term of u Am , u is needed for scaling purposes). Note that the notion of relatedness provided by u Am u is analogous to the one used in Section 3.2. As suggested in Section 3.3, other measures of similarity are possible. The parameter b allows for a limited trade-off between K 2 i=1 ||ui || and the cumulative "margin deviation"
m

Substituting this into (16), recalling that v 0 = 0, and summing over s = 1, . . . , m we obtain v m A-1 v m  m m K (b + 1) - b . bK (K - 1) + K

t
M

u

i
t

xit ,t -

1 K

jK
=1

uj xj,t

2

.

(15)

In particular, setting b = 0 corresponds to running K independent (first-order) Perceptron algorithms (no task relatedness), while letting b go to infinity is optimal only when each

A lower bound on the left-hand side can be obtained in the standard way (details omitted), t v m- t(u) m A-1 v A M . m m 1 /2 mu


Thus we get v m A-1 v m  m recall that by construction u
A mu

(m-

P u

tM t(u) Au m

)

2

. Finally,

=

iK
=1

u ||ui || + b K
2

t
M

i

K 1j uj xj,t xit ,t - t K =1

2 .

Putting together and solving for m gives the desired bound.

6.1 Implementation in dual variables Like the second-order multitask Perceptron algorithm, also M M P E R C can be formulated in dual variables. Due to the need to handle K instance vectors at a time, the implementation we sketch below has an extra linear dependence on K , as compared to the one in Section 4.1. Let t = ts be thtrial when the s-th mistake occurs, z r e be the vector z r = bK (ir ,r - r /K ), Ss be the matrix whose columns are the vectors ir ,r corresponding to mistaken time steps r up to time t, and Zs be the matrix whose columns are the vectors z r corresponding to mistaken time steps r up to time t. It is easy to verify that As = I + Zs Zs and v s-1 = Ss y s where y s is as in Section 4.1. From the inversion formula (e.g., [17, Ch. 0]) (I + Zs Zs )-1 = I - Zs (I + Zs Zs )-1 Zs we see that the margin v s-1 (At )-1 it ,t in Figure 3 can be computed as v s-1 (At )-1 it ,t = y s Ss it ,t - y s Ss Zs (I + Zs Zs )-1 Zs it ,t . Calculating the vectors Ss it ,t and Zs it ,t in the above expression takes O(s) inner products while other O(s) inner products are required to incrementally compute Ss Zs from Ss-1 Zs-1 . Finally, when calculating the inverse (I + Zs Zs )-1 we exploit the same updating scheme of Section 4.1, I , s Zs-1 z s s-1 + Z -1 Zs-1 Is + Zs Zs = z s Zs-1 1 + zs zs where Zs-1 z s and z s z s require O(K s) and O(K ) inner products, respectively. Hence (Is + Zs Zs )-1 can be computed from (Is-1 + Zs-1 Zs-1 )-1 with O(K s) extra inner products and O(s2 ) additional scalar multiplications.

7 Spectral co-regularization
An extreme case of multitask learning is the multiview setting, where all tasks share the same label. In the multiview protocol, at each time step t the learner receives K instances x1,t , . . . , xK,t  Rd , predicts with yt  {-1, 1}, and then receives the correct binary label yt , which --unlike the general multitask case-- is the same for all instances. What distinguishes multiview learning from a standard online binary classification task, defined on instances of the , form xt = (x1,t , . . . , xK,t is that in multiview one postulates the existence of K vectors u1 , . . . , uK such that each

ui is a good linear classifier for the corresponding sequence (xi,1 , y1 ), (xi,2 , y2 ), . . . of examples. In this respect a natural baseline for online multiview learning is the algorithm that chooses a random index i  {1, . . . , K } and then runs a Perceptron algorithm on the sequence of examples (xi,t , yt ) for t  1. Equivalently, we may think of running K Perceptrons in parallel and then average their mistakes (see the remark after Theorem 10). In this section we design multiview learning algorithms that in certain cases are able to perform significantly better than the above baseline. In order to do so, we arrange the K d-dimensional instances x1,t , . . . , xK,t in a d × K instance matrix Xt and penalize diversity among the reference vectors u1 , . . . , uK using a matrix norm of the d × K matrix U = [u1 , . . . , uK ]. We focus our attention on matrix norms that are unitarily invariant. Such norms are of the form U f = f ( U ), where  U = (1 , . . . , r ) is the vector of the ordered singular values 1  · · ·  r  0 of U and f : Rr  R, with r = min{K, d}, is an absolutely symmetric function --that is, f is invariant under coordinate permutations and sign-changes. Matrix norms of this form control the distribution of the singular values of U , thus acting as spectral co-regularizers for the reference vectors (see, e.g., [4] for very recent developments on this subject). Known examples are the Schatten def p-norms, U sp =  U p. For instance, the Schatten 2norm is the Frobenius norm. For p = 1 the Schatten p-norm becomes the trace norm, a good proxy for the rank of U , since U s1 =  U 1   U 0 = rank of U . In order to obtain a multiview bound that depends on  U p, we extend the dual norm analysis of Section 5 to matrices. We start by defining the matrix version of the quasi-additive algorithm of [14, 20]. We remark that matrix versions of the EG algorithm and the Winnow algorithm (related to specific instances of the quasi-additive algorithm) have been proposed and analyzed in [28, 29, 30]. When dealing with the trace norm regularizer, their algorithms could be specialized to our multiview framework to obtain mistake bounds comparable to ours. See the brief discussion at the end of this section. The quasi-additive matrix algorithm maintains a d × K matrix W . Initially, W0 is the zero matrix. If s - 1 mistakes have been made in the first t - 1 time steps, then the predicW tion at time t is S G N s-1 , Xt , where Xt is the d × K matrix [x1,t , . . . , xK,t ] in which xi,t is the instance vector associated with the i-th view at time t, and Ws-1 , Xt is the W . standard matrix inner product Ws-1 , Xt = T R s-1 Xt If a mistake occurs at time t, then Ws-1 is updated with Ws = 2 Vs 2 where, in turn, the d × K matrix Vs is 1 f updated using a matrix Perceptron rule, Vs = Vs-1 + yt Xt . A useful property of norms U f = f ( U ) is that their duals are easily computed. Theorem 7 [21, Theorem 2.4] If f is absolutely symmetric and U f = f ( U ), then U f  = f  ( U ) where f  is the convex dual of f . In the case of Schatten p-norms, we have that the dual of vector norm · p is vector norm · q , where q = p/(p - 1)


is the dual coefficient of p. An important feature of the quasi-additive algorithms for 2 vectors is that the mapping µ : v  µ(v ) = 2 v is 1 invertible whenever the vector norm · satisfies certain regularity properties (see, e.g., [8, page 294]). We call such norms Legendre. Hence, we "do not lose information" when the primal weight vector v is mapped to the weight vector w = µ(v ) used for prediction. In particular, we always have 2 that µ-1 (v ) = 1 v  , where ·  is the dual norm of a 2 Legendre norm · (see, e.g., [8, Lemma 11.5]). This property is preserved when the algorithm is applied to matrices. This is shown by the following result where, without loss of generality, we prove the property for f ( U ) rather than 1 12 2 2 f ( U ) . (In fact, if f is Legendre, then 2 f is also Legendre). Theorem 8 Let f be a Legendre function. · -1 f ( U ) then f = · f. The following result will be useful. Theorem 9 [21, Theorem 3.1] Let U D I AG[ A ] V be an SVD decomposition of a matrix A. If · f is a matrix norm such that A f = f ( A ) for f Legendre, then f ( A ) = f V.  A f . Moreover, A f = U D I AG ( A ) Proof of Theorem 8: If A = U D I AG[ A ] V , then by Theorem 9 f V=  V. A f = U D I AG ( A ) U D I AG Af Therefore, using Theorem 9, f= Af = U D I AG


In addition, we have U, Vs = U, Vs-1 + yt U, Xt . Thus we obtain t Km - t(U ) U, Vm U Vm   f U f + K1 K def where t(U ) = - yt ui xi,t = i=1 i=1 t(ui ). Solving for m gives 2 sm U f 1t t(U ) + D (Vs Vs-1 ) . (18) m K K =1
M

If U f =

Equation (18) is our general starting point for analyzing multiview algorithms working under spectral co-regularization. The analysis reduces to bounding from above the secondorder term D (· ·) of the specific matrix norm · f under consideration. For the rest of this section we focus on the Schatten 2pnorm V s2p =  V 2p , where V is a generic d×K matrix, and p is a positive integer (thus 2p is an even number  2). ( 1/p . Note that, in general, V 22p = T R V V )p s In order to prove our main result, stated below, we use some facts from differential matrix calculus. A standard reference on this subject is [23], to which the reader is referred. Theorem 10 The number of mistakes m made by the 2p-norm matrix Perceptron, run on any sequence (X1 , y1 ), (X2 , y2 ), . . . satisfies, for any d × K matrix U, X 2 s2p U s2q 1t t(U ) + (2p - 1) m K K M 2 Xs2p U s2q p-1t + t(U ) K K
M

f


f

( A )
(

V

U D I AG[ A ] V =A concluding the proof.

f is Legendre)

We now develop a general analysis of the quasi-additive matrix algorithms, and then specialize it (in Theorem 10 below) to a multiview algorithm operating with a Schatten p-norm regularizer. We start by adapting the dual norm proof of Section 5 to an arbitrary matrix norm A f = f ( A ), where f is Legendre. Let Vm be the primal weight matrix after any number m 1 of mistakes. By Taylor expanding 2 Vs 2 around Vs-1 for f each s = 1, . . . , m, and using yt Ws-1 , Xt  0, we get sm 1 Vm 2  D (Vs Vs-1 ) f 2 =1 wheVe D (Vs Vs-1 ) is the matrix Bregman divergence r - 1 yt Ws-1 , Xt . s 2 - Vs-1 2 f f 2 Fix any d × K matrix U . First, we derive a matrix version of the convex inequality for vector norms. We use a classical result by von Neumann (see, e.g., [18, p. 182]) stating that V, U   V  U for any two d × K matrices U and V . We have  f ( Vm f U f  = f Vm  U by Theorem 7)   Vm  U  Vm , U (by the convex ineq. for norms) (by von Neumann's ineq.).

where Xs2p = maxtM Xt s2p , U s2q is the Schatten 2q 2p norm of U , with 2q = 2p-1 , and M is the set of mistaken trial indices. Remark Similarly to the vector case, when the parameter p is chosen to be logarithmic in r = min{d, K }, the pnorm matrix Perceptron penalizes diversity using the trace norm of U . If the vectors ui span a subspace of size K, and instances tend to have K nonzero singular values of roughly the same magnitude, then U s2q  U s2 while 2 2 2 Xs2p  Xs  Xs2 /K . Hence this choice of p leads (at least in the linearly separable case) to a factor K improvement over the bound achieved by the matrix algorithm based on the Frobenius norm (p = 1 in Theorem 10), which amounts to running K independent Perceptrons in parallel and then average their mistakes. The following trace inequality is our main technical lemma. Lemma 11 Let A,B be positive semidefinite matrices, of size d × d and K × K respectively, with the same nonzero eigenvalues. Let X be an arbitrary real matrix of size d × K . Then, for any pair on nonnegative exponents l, g  0, we T 1/p T 1/q have T R(X Al X B g )  R(X X )p R A(l+g )q where 1/p + 1/q = 1, p  1.


Proof of Lemma 11: We first consider the case l  g . By the Cauchy-Schwartz and Holder's inequalities applied to traces [23, Ch.11] we have ( B Al 19) T R (X X B g ) = T R (g-l)/2 X Al X B (g+l)/2 X A2l 1 X X g+l 1/2 g -l /2 TR B  TR XB 1 /2 1 /2 X X 1 /2 B X A2l Tp Tq g+l  TR X B g-l where we used the shorthand Tr (Z ) = (T RZ r )1/r . In the case when l > g we can simply swap the matrices X Al and X B g and reduce to the previous case. We now recursively apply the above argument to the left-hand side of (19). Recalling that Tq (A) = Tq (B ) and Tp (X X ) = Tp (X X ), after n steps we obtain T 1n X Al Al X g ) /2 R (X B × TR X Bg X X Pn=1 (1/2)i B g+l Pn=1 (1/2)i i i × Tp Tq for some pair of exponents l , g  0 such that l + g = l + g . ) X Since for any such pair l , g , we have T R(X Al Ag < ,we can take the limit as n  . Recalling that i i=1 (1/2) = 1 completes the proof. Proof of Theorem 10: We set for brevity G : Rd×K  R, G(V ) = Thus in our case D (Vs Vs-1 ) = G(Vs-1 +yt Xt )-G(Vs-1 )-yt Ws-1 , Xt . Since G(V ) is twice4 continuously differentiable, by the mean-value theorem we can write D (Vs Vs-1 ) = 1 V E C (Xt ) 2
H G ( ) V E C (Xt ),

where (B
p-1

)=

p-2 =
0

B


B p-2-

(

IK 2 + TK )

I

k

V

,

( 1 TR V 2

Vp

)

1/p

=

1 V 22p . s 2

and TK is the K 2 × K 2 commutation matrix such that TK V E C(M ) = V E C(M ) for any K × K matrix M . Putting together c(V ) ( p-1 (20)  V E C (Xt ) B  Id ) V E C(Xt ) 2 c(V ) ( + V E C (Xt ) Ik  V ) × 2 I V × (IK 2 + TK ) k  V E C (Xt ) , (21) p-2  p-2- . where we used the shorthand  = B We =0 B now bound the two terms in the right-hand side of (21). By well-known relationships between Kronecker products and the V E C operator (see [23, Ch. 3]) we can write c(V ) ( p-1 V E C (Xt ) B  Id ) V E C(Xt ) 2 1 1T c(V ) t p-1 t p /p T R (X Xt B ) R (X Xt ) , = 2 2 independent of V . The majorization follows from Holder's inequality applied to the positive semidefinite matrices Xt Xt and B p-1 . Moreover, it is easy to see that the symmetric matrices  and TK commute, thereby sharing the same eigenspace. Hence,  (IK 2 + TK ) 2, and we can bound from above the second term in (21) by c(V )V E C(Xt )
- =2 p 0

B


Ap-1-

V

E C (Xt )

,

where we set A = V V . Again, [23, Ch. 3] allows us to rewrite this quadratic form as the sum of traces c(V )
p= 2 - 0

(20)

T R (X

t

Ap-1-

X

tB

)

.

where V E C(X ) is the standard columnwise vectorization of a matrix X , HG denotes the Hessian matrix of (matrix) function G and  is some matrix on the line connecting Vs-1 to Vs . Using the rules of matrix differentiation, the gradient G of G is G(V ) = c(V )V E C(D) where we set for brevity D = V B p-1 , c(V ) = T R(B p )1/p-1 , with B = V V . Taking the second derivative HG = G gives HG (V ) = V E C (D) (c(V )) + c(V ) (D). Now, recalling the definition of c(V ), it is not hard to show that V E C (D ) (c(V )) is the K d × K d matrix 2(1 - p)T R(B p )1/p-2 V E C(D) V E C(D)
.

Since A and B have the same nonzero eigenvalues, we can apply Lemma 11 to each term and put together as in (21). After simplifying we get T 1/p 1 1 (20)  (2p - 1) R(Xt Xt )p = (2p - 1)||Xt ||22p . s 2 2 The desired bound is then obtained by plugging back into (18), solving the resulting inequality for m, and overapproximating. The result of Theorem 10 is similar to those obtained in [28, 29, 30]. However, unlike these previous results, our matrix algorithm has no learning rate to tune (a property inherited from the vector p-norm Perceptron of [13]) and works for arbitrary nonsquare maTrices U . We also obtW o serve that the prediction yt = S G N R s-1 Xt f the p-norm matrix Perceptron red(uces to computing the sign of ( T R Vs-1 Vs-1 )p-1 Vs-1 Xt recall the expression for G calculated in the proof of Theorem 10). Since matrix Vs is updated additively, it is clear that both Vs-1 Vs-1 and Vs-1 Xt do depend on instance vectors xi,t only through inner products. This allows us to turn our p-norm matrix Perceptron into a kernel-based algorithm, and repeat the analysis given here using a standard RKHS formalism.

Since p  1 this matrix is negative semidefinite, and we can disregard it when bounding from the above the quadratic form (20). Thus we continue by considering only the second term c(V ) (D) of the Hessian matrix. We have B p-1 + , (D) =  Id (Ik  V ) B p-1
4  " In fact G is C everywhere but (possibly) in zero, since " Vp T R (V ) is just a polynomial function of the entries of V . " " Moreover T R (V V )p = 0 if and only if V is the zero matrix.


8 Conclusions and ongoing research
In this work we have studied the problem of learning multiple tasks online using various approaches to formalize the notion of task relatedness. Our results can be extended in different directions. First, in Sections 3.2 and 6 it might be interesting to devise methods for dynamically adapting the b parameter as new data are revealed. Second, the mistakes of the second-order algorithm M M P E R C have been bounded using a first-order analysis. A more refined analysis should reveal in the bound an explicit dependence on the spectral properties of the data. It is also worth noting that the significance of the mistake bound obtained for M M P E R C relies on the fact that the algorithm assumes the tasks to be different, although somewhat related. In the case when the K observed instances share the same label at each time step (like in multiview learning), we could not devise an algorithm with a significant advantage over the following trivial baseline: run K Perceptrons in parallel and use the sum of margins to predict. Third, it would be interesting to study the problem of learning multiple tasks when K predictions have to be output in each step. In this case the main difficulty appears to be the control of the interaction among instances at each time step. Fourth, it would be also interesting to prove lower bounds on the number of mistakes, as a function of task relatedness. Finally, since multitask learning problems arise naturally in a variety of settings, spanning from biology to news processing, we plan to complement the theoretical analysis presented in this paper with experimental results, so as to evaluate the empirical performance of our algorithms in real-case scenarios. Acknowledgments. Thanks to Sham Kakade, Massi Pontil, and Francis Bach for useful discussions. We also thank the COLT 2008 reviewers for their comments. This work was supported in part by the PASCAL2 Network of Excellence under EC grant no. 216886. This publication only reflects the authors' views.

References
[1] J . A B E R N E T H Y, P. L . BA RT L E T T & A . R A K H L I N, Multitask learning with expert advice, Proc. 20th COLT, pp. 484­498, Springer, 2007. [2] R . K . A N D O & T. Z H A N G, A framework for learning predictive structures from multiple tasks and unlabeled data, JMLR, 6, pp. 1817­1853, MIT Press, 2005. [3] A . A R G Y R I O U , T. E V G E N I O U & M . P O N T I L Multi-Task feature learning, NIPS 19, pp. 41­48, MIT Press, 2007. [4] A . A R G Y R I O U , C . A . M I C C H E L L I , M . P O N T I L & Y. Y I N G, A spectral regularization framework for multi-task structure learning, NIPS 20, MIT Press, 2008. [5] K . A Z O U RY A N D M . WA R M U T H, Relative loss bounds for on-line density estimation with the exponential family of distributions, Machine Learning, 43, pp. 211­246, 2001. [6] U . B R E F E L D , T. G A E RT N E R , T. S C H E FF E R , & S . W RO B E L, Efficient co-regularised least squares regression, Proc. 23rd ICML, 2006. [7] N . C E S A - B I A N C H I , A . C O N C O N I & C . G E N T I L E, A second-order Perceptron algorithm, SIAM Journal on Computing, 34/3, pp. 640­668, 2005. [8] N . C E S A - B I A N C H I & G . L U G O S I, Prediction, Learning, and Games. Cambridge University Press, 2006.

[9] O . D E K E L , P. M . L O N G & Y. S I N G E R, Online learning of multiple tasks with a shared loss, JMLR, 8, pp. 2233­2264, 2007. [10] T. E V G E N I O U , M . P O N T I L & T. P O G G I O, Regularization networks and Support Vector Machines, Advances in Computational Mathematics, 13/1, pp. 1­50, Springer, 2000. [11] T. E V G E N I O U , C . M I C C H E L L I & M . P O N T I L, Learning Multiple tasks with kernel methods, JMLR, 6, pp. 615­637, MIT Press, 2005. [12] Y. F R E U N D & R . E . S C H A P I R E, Large margin classification using the Perceptron algorithm. Machine Learning, 37:3, pp. 277­296, 1999. [13] C . G E N T I L E, The robustness of the p-norm algorithms, Machine Learning, 53, pp. 265­299, 2003. [14] A . G ROV E , N . L I T T L E S T O N E & D . S C H U U R M A N S, General convergence results for linear discriminant updates, Machine Learning, 43, pp. 173­210, 2001. [15] M . H E R B S T E R , M . P O N T I L & L . WA I N E R, Online learning over graphs, Proc. 22nd ICML, pp. 305­312, ACM Press, 2005. [16] L . H O G B E N, Handbook of Linear Algebra, Discrete Mathematics and Its Applications, 39, CRC Press, 2006. [17] R . A . H O R N & C . R . J O H N S O N, Matrix Analysis. Cambridge University Press, 1985. [18] R . A . H O R N & C . R . J O H N S O N, Topics in Matrix Analysis. Cambridge University Press, 1991. [19] A . JAG OTA & M . K . WA R M U T H, Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence, Electr. Proc. 5th International Symposium on Artificial Intelligence and Mathematics, 1998. Electronic, http://rutcor.rutgers.edu/amai. [20] J . K I V I N E N & M . WA R M U T H, Relative loss bounds for multidimensional regression problems, Machine Learning, 45, pp. 301­329, 2001. [21] A . S . L E W I S, The convex analysis of unitarily invariant matrix functions, Journal of Convex Analysis, 2, pp. 173­183, 1995. [22] N . L I T T L E S T O N E, Mistake bounds and logarithmic linearthreshold learning algorithms. Ph.D. Thesis, University of California at Santa Cruz, 1989. [23] J . R . M AG N U S , & H , N E U D E C K E R, Matrix Differential Calculus with Applications in Statistics and Econometrics, revised edition. John Wiley, 1999. [24] A . M AU R E R, Bounds for linear multi-task learning, JMLR, 7, pp. 117­139, MIT Press, 2006. [25] R . T. RO C K A F E L L A R, Convex Analysis. Princeton University Press, 1970. [26] D . RO S E N B E R G & P. BA RT L E T T, Rademacher complexity of co-regularized kernel classes, Proc. Artificial Intelligence and Statistics, 2007. [27] V. S I N D H WA N I , P. N I YO G I & M . B E L K I N, A co-regularized approach to semi-supervised learning. Proc. ICML Workshop on Learning with Multiple Views, 2005. [28] K . T S U DA , G . R A E T S C H & M . K . WA R M U T H, Matrix exponentiated gradient updates for on-line learning and Bregman projection, JMLR, 6, pp. 995­1018, 2005. [29] M . K . WA R M U T H & D . K U Z M I N, Online variance minimization, Proc. 19th COLT, Springer, 2006. [30] M . K . WA R M U T H, Winnowing subspaces. Proc. 24th ICML, pp. 999­1006, ACM Press, 2007.