Linear Algorithms for Online Multitask Classification Giovanni Cavallanti ` Nicolo Cesa-Bianchi Claudio Gentile Abstract We design and analyze interacting online algorithms for multitask classification that perform better than independent learners whenever the tasks are related in a certain sense. We formalize task relatedness in different ways, and derive formal guarantees on the performance advantage provided by interaction. Our online analysis gives new stimulating insights into previously known co-regularization techniques, such as the multitask kernels and the margin correlation analysis for multiview learning. In the last part we apply our approach to spectral co-regularization: we introduce a natural matrix extension of the quasiadditive algorithm for classification and prove bounds depending on certain unitarily invariant norms of the matrix of task coefficients. 1 Introduction A fundamental and fascinating problem in learning theory is the study of learning algorithms that influence each other. Although much is known about the behavior of individual strategies that learn a classification or regression task from examples, our understanding of interacting learning systems is still fairly limited. In this paper, we investigate this problem from the specific viewpoint of multitask learning, where each one of K > 1 learners has to solve a different task (typically, K classification or K regression tasks). In particular, we focus on multitask binary classification, where learners are online linear classifiers (such as the Perceptron algorithm). Our goal is to design online interacting algorithms that perform better than independent learners whenever the tasks are related in a certain sense. We formalize task relatedness in different ways, and derive formal guarantees on the performance advantage provided by interaction. Our analysis builds on ideas that have been developed in the context of statistical learning. In the statistical analysis of multitask learning (e.g., [2, 3, 4, 11, 24, 26]) the starting point is a regularized empirical loss functional or Tikhonov ` DSI, Universita di Milano, Italy. ` DSI, Universita di Milano, Italy. ` DICOM, Universita dell'Insubria, Italy. This is the corresponding author. Email: claudio.gentile@uninsubria.it functional --see, e.g., [10]. In the presence of several tasks, this functional is extended to allow for co-regularization among tasks. Roughly speaking, the co-regularization term forces the set of predictive functions for the K tasks to lie "close" to each other. This co-regularization term is typically a squared norm in some Hilbert space of functions. We follow the approach pioneered by [11], where the K estimated solutions are linear functions parametrized by u = (u1 , . . . , uK ) RK d and the co-regularization is u A u, where A is a positive definite matrix enforcing certain relations among tasks. The key observation in [11] is the following. Assume the instances of the multitask problem are of the form (xt , it ), where xt Rd is an attribute vector and it {1, . . . , K } indicates the task xt refers to. Then one can reduce the K learning problems in Rd to a single problem in RK d by choosing a suitable embedding of the pairs (xt , it ) into a common RKHS space RK d with inner product u, v = u A v . This reduction allows us to solve a multitask learning problem by running any kernel-based single-task learning algorithm with the "multitask kernel" defined above. We build on this result by considering a natural online protocol for multitask linear classification. Within this protocol we analyze the performance of the Perceptron algorithm and some of its variants when run with a multitask kernel. Because such kernels are linear, we are not restricted to using kernel-based algorithms for efficiency reasons. In Section 3 we consider the kernel Perceptron algorithm, and derive mistake bounds for the multitask kernels proposed in [11]. This reveals new insights into the role played by the regularizer matrix A. First, we see that the update in the kernel space defined by A factorizes in the "shared update" of K interacting Perceptrons each running in Rd , thus providing a basic example of interactive online learning. Second, we exploit the simplicity of the mistake bound analysis to precisely quantify the performance advantage brought by the multitask approach over K independent online algorithms. In particular, in Subsections 3.2 and 3.3 we give examples where the mistake bound is used to guide the design of A. The first part of the paper is concluded with Sections 4 and 5, where we show multitask versions and mistake bound analyses for the second-order Perceptron algorithm of [7] and for the p-norm algorithm of [13, 14]. In the remaining sections of the paper, we depart from the approach of [11] to investigate the power of online learning when other forms of co-regularization are used. In Sec- tion 6 we consider the case when instances belong to a space that is different for each task, and the similarity among tasks is measured by comparing their margin sequences (see, e.g., [6, 27]). We introduce and analyze a new multitask variant of the second-order Perceptron algorithm. The mistake bound that we prove is a margin-based version of the bound shown in Subsection 3.2 for the multitask Perceptron. Finally, in Section 7 we consider spectral co-regularization [4] for online multiview learning. Here diversity is penalized uu ng a norm function defined on the d × K matrix U = si o f view vectors. In the spirit of [19], we in1 , . . . , uK terpret this penalization function as a potential defined over arbitrary matrices. We then define a natural extension of the quasi-additive algorithm of [14, 20] to a certain class of matrix norms, and provide a mistake bound analysis depending on the singular values of U . The results we obtain are similar to those in [28, 29, 30], though we are able to overcome some of the difficulties encountered therein via a careful study of matrix differentials. In the next section, we introduce the basic online multitask protocol and define the multitask Perceptron algorithm. In order to keep the presentation as simple as possible, and to elucidate the interactive character of the updates, we delay the introduction of kernels until the proof of the mistake bound. In our initial online protocol, at each time step the multitask learner receives a pair (xt , it ), where it is the task index for time t and xt is the instance for task it . Note that we view multitask learning as a sequential problem where at each time step the learner works on a single adversarially chosen task, rather than working simultaneously on all tasks (a similar protocol was investigated in [1] in the context of prediction with expert advice). One of the advantages of this approach is that, in most cases, the cost of running our multitask algorithms has a mild dependence on the number K of tasks. We also remark that linear algorithms for online multitask learning have been studied in [9]. However, these results are sharply different from ours, as they do not depend on task relatedness. learner's mistake count to u1 ,...,uK Rd inf t t(uit ) (1) 1 + where t(uit ) = - yt uit xt is the hinge loss of the reference linear classifier (or task vector ) uit at time t. Our goal is to design algorithms that make fewer mistakes than K independent learners when the tasks are related, and do not perform much worse than that when the tasks are completely unrelated. In the first part of the paper we use Euclidean distance to measure task relatedness. We say that the K tasks are related if there exist reference task vectors u1 , . . . , uK Rd having small pairwise distances ui - uj , and achieving a small cumulative hinge loss in the sense of (1). More general notions of relatedness are investigated in later sections. 3 The multitask Perceptron algorithm We first introduce a simple multitask version of the Perceptron algorithm. This algorithm keeps a weight vector for each task and updates all weight vectors at each mistake using the Perceptron rule with different learning rates. More precisely, let wi,t-1 be the weight vector associated with task i at the beginning of time step t. If we are forced (by the adversary) to predict on task it , and our prediction happens to be wrong, we update wit ,t-1 through the standard additive rule wit ,t = wit ,t-1 + yt xt (where > 0 is a constant learning rate) but, at the same time, we perform a "half-update" on the remaining K - 1 Perceptrons, i.e., we set wj,t = wj,t-1 + yt xt for each j = it . This rule is 2 based on the simple observation that, in the presence of related tasks, any update step that is good for one Perceptron should also be good for the others. Clearly, this rule keeps the weight vectors wj,t , j = 1, . . . , K , always close to each other. The above algorithm is a special case of the multitask Perceptron algorithm described below. This more general algorithm updates each weight vector wj,t through a learning rate which is an arbitrary positive definite function of the pair (j, it ). These learning rates are defined by a K × K interaction matrix A. The pseudocode for the multitask Perceptron algorithm using a generic interaction matrix A is given in Figure 1. At the beginning of each time step, the counter s stores the mistakes made so far (plus one). The (column) vector t RK d denotes the multitask instance defined by 0 ( t = , . . . , 0 xt 0, . . . , 0 2) d d (it - 1) times (K - it ) times 2 Learning protocol and notation There are K binary classification tasks indexed by 1, . . . , K . At each time step t = 1, 2, . . . the learner receives a task index it {1, . . . , K } and the corresponding instance vector1 xt Rd (which we henceforth assume to be normalized, xt = 1). Based on this information, the learner outputs a binary prediction yt {-1, 1} and then observes the correct label yt {-1, 1} for task it . As in the standard worstcase online learning model, no assumptions atre made on the x mechanism generating the sequence t , yt 1 . Moreover, similarly to [1], the sequence of tasks it is also generated in an adversarial manner. We compare the learner's performance to that of a reference predictor that is allowed to use a different linear classifier for each of the K tasks. In particular, we compare the Throughout this paper all vectors are assumed to be column vectors. 1 where xt Rd is the instance vector for the current task it . (Note that t = 1 since the instances xt are normalized.) The weights of the K werceptrons are maintained in a comP , 1 K d pound vector ws = ,s , . . . , w ,s with w j,s R for all j . The algorithm predicts yt through the sign yt of the it -th Perceptron's margin ws-1 t = wit ,s-1 xt . Then, if prediction and true label disagree, the update rule becomes A -1 ws = ws-1 + yt Id t , where denotes the Kro- Parameters: Positive definite K × K interaction matrix A. Initialization: w0 = 0 RK d , s = 1. At each time t = 1, 2, . . . do the following: 1. Observe task number it {1, . . . , K } and the corresponding instance vector xt Rd ; 2. Build the associated multitask instance t RK d ; w 3. Predict yt = S G N s-1 t {-1, +1}; 4. Get label yt {-1, +1}; 5. If yt = yt then update: ws = s ws-1 + yt s+1. A Id -1 t Theorem 1 The number of mistakes m made by the multitask Perceptron algorithm in Figure 1, run with interaction matrix (3) on any finite multitask sequence (1 , y1 ), (2 , y2 ), . . . RK d × {-1, 1}, satisfies t uA K 2 u m inf t(u) + Kd +1 uR M 2u , K Aut + t(u) +1 M where M is the set of mistaken trial indices, and u A u= iK =1 ui 2 + 1 i 0 (playing the same role as the one in Section 3.2), and maintains in its internal state a multitask matrix A (initialized to the identity matrix I ) and a Perceptron multitask weight vector v (initialized to the zero vector). The subscript s plays the same role as in the previous algorithms. Unlike the algorithms in previous sections, M M P E R C observes the task number it and the multitask instance t = x1 m 2 K ade up of the instance vectors of all ,t , x ,t , . . . , x ,t tasks. Then M M P E R C computes a tentative matrix At to be used for prediction. Matrix At is obtained by adding the rank-one positive semidefinite matrix Mt to the previous matrix As-1 . Here j,t is the (d1 + · · · + dK )-dimensional vector f j,t = d 1 If K d this bound is equivalent (apart from constant factors) to the mistake bound for the single-task zero-threshold Winnow algorithm of [22]. 0 ,...,0 xj,t d j +1 0, . . . , 0 + · · · + dK times + · · · + dj -1 times 6 Learning tasks in heterogeneous spaces In this section we slightly deviate from the approach followed so far. We consider the case when the K task vectors ui may live in different spaces: ui Rdi , i = 1, . . . , K . This is a plausible assumption when attributes associated with different tasks have a completely different meaning. In such a case, the correlation among tasks is naturally measured through the task margins ui x (or views ) --see, e.g., the previous work of [6, 27] for a similar approach in the context of semi-supervised learning. In order to allow for the views to interact in a meaningful way, we slightly modify the learning protocol of Section 2. We now assume that, at each time t, we receive the adversarial choice of task it together with all instance vectors xi,t Rdi for i = 1, . . . , K . The co-regularization terms motivating our algorithm are proportional to the distance between the margin uit xit ,t of task it k 1 and the average margin k j =1 uj xj,t of all tasks: u 2 K 1j i j bK xit ,t - u xj,t , (13) t K =1 or j = 1, . . . , K . Observe that Mt has been set so as to make the quadratic form u Mt u coincide with the regularization term (13). Similarly to the algorithms of previous sections, the tentative matrix At and the current Perceptron vector v s-1 are used for predicting the true label yt . If prediction yt and label yt disagree both v and A get updated. In particular, As is set to the tentative matrix At . In this protocol we call example the triple (it , t , yt ). Like the results contained in the previous sections, our analysis will provide a multitask bound on the number of prediction mistakes which is comparable to the one obtained by a single task plus a penalization term due to task relatedness. However, though this algorithm is a second-order prediction method, we only give a first-order analysis that disregards the eigenstructure of the data. This is due to the technical difficulty of handling a time-varying matrix A that in trial t includes all instance vectors x1,t , . . . xK,t . Theorem 6 The number of mistakes m made by the algorithm in Figure 3, run on any multitask sequence = (i1 , 1 , y1 ), (i2 , 2 , y2 ), . . . satisfies, for all u Parameters: b > 0. Initialization: A0 = I , v 0 = 0 Rd1 +···+dK , s = 1. At each time t = 1, 2, . . . do the following: 1. Observe task number it {1, . . . , K }; 2. Observe multitask instance vector x d1 +···+dK t = 1,t , . . . , xK,t R ; 3. Build the associated multitask instance it ,t ; 4. Set At = As-1 + Mt where t t ; Mt = b K it ,t - it ,t - K K vs 5. Predict yt = S G N -1 (At )-1 it ,t {-1, +1}; 6. Get label yt {-1, +1}; 7. If yt = yt then update: v s = v s-1 + yt it ,t , As = A , t one of the margin deviation terms in (15) is zero (maximal task relatedness). Notice that setting b = 1 gives (14) = K 2K - 1 i ||ui ||2 K2 =1 m 2K - 1 t + K u i M K 2 1j xit ,t - uj xj,t t K =1 yielding a gain of K whenever the K tasks are significantly related, as measured by (15). Proof of Theorem 6: Although the general structure of the analysis is based on the second-order Perceptron proof, we use the special properties of I + Mt in order to compute the contribution of K and b to the final bound. Let t = ts be the time step when the s-th mistake occurs. We write v s A-1 v s = (v s-1 + yt it ,t ) s A-1 s (v s-1 + yt it ,t ) (from the update rule in Figure 3) ss+1. = v s-1 A-1 v s-1 + 2yt v s-1 A-1 it ,t s s + it ,t A-1 it ,t s v s-1 A-1 v s-1 + it ,t A-1 it ,t s s v s-1 A-11 v s-1 + it ,t (I + Mt )-1 it ,t . s- (16) Figure 3: The multiview-based multitask Perceptron algorithm (M M P E R C). u1 , , . . . , uK t+ m t(u) + M K (b + 1) - b u bK (K - 1) + K A A mu (b + 1) - b u bK (K - 1) + K where u = A mu K mu t M t(u) , In order to prove the first inequality, note that on the s-th mistaken trial As = At and yt v s-1 (A t )-1 it ,t 0. In order to prove the second inequality note that As - As-1 and As - (I + Mt ) are both positive semidefinite. We now focus on computing the quadratic form it ,t (I + K Mt )-1 it ,t . Recall that t = j =1 j,t is the sum of the orthonormal vectors 1,t , . . . , it ,t , . . . , K,t . Thus, from the very definition of Mt in Figure 3 it is easy to verify that , b(K - 1) t . K (17) Also, since Mt t = 0, we have (I + Mt )t = t , and thus (I + Mt )-1 t = t . Hence (17) allows us to write (I + Mt ) it ,t = (b(K - 1) + 1) it ,t - b(K - 1) t . K Taking the inner product of both sides with it ,t , and solving for it ,t (I + Mt )-1 it ,t yields it ,t = (b(K - 1) + 1) (I + Mt )-1 it ,t - it ,t (I + Mt )-1 it ,t = K (b + 1) - b . bK (K - 1) + K iK =1 ||ui || + bK 2 tm M u i K 1j xit ,t - uj xj,t t K =1 2 being M the set of mistaken trials. Remark It is the factor K (b + 1) - b u bK (K - 1) + K A mu ( 14) that quantifies the relatedness among tasks (the leading constant b K in the second term of u Am , u is needed for scaling purposes). Note that the notion of relatedness provided by u Am u is analogous to the one used in Section 3.2. As suggested in Section 3.3, other measures of similarity are possible. The parameter b allows for a limited trade-off between K 2 i=1 ||ui || and the cumulative "margin deviation" m Substituting this into (16), recalling that v 0 = 0, and summing over s = 1, . . . , m we obtain v m A-1 v m m m K (b + 1) - b . bK (K - 1) + K t M u i t xit ,t - 1 K jK =1 uj xj,t 2 . (15) In particular, setting b = 0 corresponds to running K independent (first-order) Perceptron algorithms (no task relatedness), while letting b go to infinity is optimal only when each A lower bound on the left-hand side can be obtained in the standard way (details omitted), t v m- t(u) m A-1 v A M . m m 1 /2 mu Thus we get v m A-1 v m m recall that by construction u A mu (m- P u tM t(u) Au m ) 2 . Finally, = iK =1 u ||ui || + b K 2 t M i K 1j uj xj,t xit ,t - t K =1 2 . Putting together and solving for m gives the desired bound. 6.1 Implementation in dual variables Like the second-order multitask Perceptron algorithm, also M M P E R C can be formulated in dual variables. Due to the need to handle K instance vectors at a time, the implementation we sketch below has an extra linear dependence on K , as compared to the one in Section 4.1. Let t = ts be thtrial when the s-th mistake occurs, z r e be the vector z r = bK (ir ,r - r /K ), Ss be the matrix whose columns are the vectors ir ,r corresponding to mistaken time steps r up to time t, and Zs be the matrix whose columns are the vectors z r corresponding to mistaken time steps r up to time t. It is easy to verify that As = I + Zs Zs and v s-1 = Ss y s where y s is as in Section 4.1. From the inversion formula (e.g., [17, Ch. 0]) (I + Zs Zs )-1 = I - Zs (I + Zs Zs )-1 Zs we see that the margin v s-1 (At )-1 it ,t in Figure 3 can be computed as v s-1 (At )-1 it ,t = y s Ss it ,t - y s Ss Zs (I + Zs Zs )-1 Zs it ,t . Calculating the vectors Ss it ,t and Zs it ,t in the above expression takes O(s) inner products while other O(s) inner products are required to incrementally compute Ss Zs from Ss-1 Zs-1 . Finally, when calculating the inverse (I + Zs Zs )-1 we exploit the same updating scheme of Section 4.1, I , s Zs-1 z s s-1 + Z -1 Zs-1 Is + Zs Zs = z s Zs-1 1 + zs zs where Zs-1 z s and z s z s require O(K s) and O(K ) inner products, respectively. Hence (Is + Zs Zs )-1 can be computed from (Is-1 + Zs-1 Zs-1 )-1 with O(K s) extra inner products and O(s2 ) additional scalar multiplications. 7 Spectral co-regularization An extreme case of multitask learning is the multiview setting, where all tasks share the same label. In the multiview protocol, at each time step t the learner receives K instances x1,t , . . . , xK,t Rd , predicts with yt {-1, 1}, and then receives the correct binary label yt , which --unlike the general multitask case-- is the same for all instances. What distinguishes multiview learning from a standard online binary classification task, defined on instances of the , form xt = (x1,t , . . . , xK,t is that in multiview one postulates the existence of K vectors u1 , . . . , uK such that each ui is a good linear classifier for the corresponding sequence (xi,1 , y1 ), (xi,2 , y2 ), . . . of examples. In this respect a natural baseline for online multiview learning is the algorithm that chooses a random index i {1, . . . , K } and then runs a Perceptron algorithm on the sequence of examples (xi,t , yt ) for t 1. Equivalently, we may think of running K Perceptrons in parallel and then average their mistakes (see the remark after Theorem 10). In this section we design multiview learning algorithms that in certain cases are able to perform significantly better than the above baseline. In order to do so, we arrange the K d-dimensional instances x1,t , . . . , xK,t in a d × K instance matrix Xt and penalize diversity among the reference vectors u1 , . . . , uK using a matrix norm of the d × K matrix U = [u1 , . . . , uK ]. We focus our attention on matrix norms that are unitarily invariant. Such norms are of the form U f = f ( U ), where U = (1 , . . . , r ) is the vector of the ordered singular values 1 · · · r 0 of U and f : Rr R, with r = min{K, d}, is an absolutely symmetric function --that is, f is invariant under coordinate permutations and sign-changes. Matrix norms of this form control the distribution of the singular values of U , thus acting as spectral co-regularizers for the reference vectors (see, e.g., [4] for very recent developments on this subject). Known examples are the Schatten def p-norms, U sp = U p. For instance, the Schatten 2norm is the Frobenius norm. For p = 1 the Schatten p-norm becomes the trace norm, a good proxy for the rank of U , since U s1 = U 1 U 0 = rank of U . In order to obtain a multiview bound that depends on U p, we extend the dual norm analysis of Section 5 to matrices. We start by defining the matrix version of the quasi-additive algorithm of [14, 20]. We remark that matrix versions of the EG algorithm and the Winnow algorithm (related to specific instances of the quasi-additive algorithm) have been proposed and analyzed in [28, 29, 30]. When dealing with the trace norm regularizer, their algorithms could be specialized to our multiview framework to obtain mistake bounds comparable to ours. See the brief discussion at the end of this section. The quasi-additive matrix algorithm maintains a d × K matrix W . Initially, W0 is the zero matrix. If s - 1 mistakes have been made in the first t - 1 time steps, then the predicW tion at time t is S G N s-1 , Xt , where Xt is the d × K matrix [x1,t , . . . , xK,t ] in which xi,t is the instance vector associated with the i-th view at time t, and Ws-1 , Xt is the W . standard matrix inner product Ws-1 , Xt = T R s-1 Xt If a mistake occurs at time t, then Ws-1 is updated with Ws = 2 Vs 2 where, in turn, the d × K matrix Vs is 1 f updated using a matrix Perceptron rule, Vs = Vs-1 + yt Xt . A useful property of norms U f = f ( U ) is that their duals are easily computed. Theorem 7 [21, Theorem 2.4] If f is absolutely symmetric and U f = f ( U ), then U f = f ( U ) where f is the convex dual of f . In the case of Schatten p-norms, we have that the dual of vector norm · p is vector norm · q , where q = p/(p - 1) is the dual coefficient of p. An important feature of the quasi-additive algorithms for 2 vectors is that the mapping µ : v µ(v ) = 2 v is 1 invertible whenever the vector norm · satisfies certain regularity properties (see, e.g., [8, page 294]). We call such norms Legendre. Hence, we "do not lose information" when the primal weight vector v is mapped to the weight vector w = µ(v ) used for prediction. In particular, we always have 2 that µ-1 (v ) = 1 v , where · is the dual norm of a 2 Legendre norm · (see, e.g., [8, Lemma 11.5]). This property is preserved when the algorithm is applied to matrices. This is shown by the following result where, without loss of generality, we prove the property for f ( U ) rather than 1 12 2 2 f ( U ) . (In fact, if f is Legendre, then 2 f is also Legendre). Theorem 8 Let f be a Legendre function. · -1 f ( U ) then f = · f. The following result will be useful. Theorem 9 [21, Theorem 3.1] Let U D I AG[ A ] V be an SVD decomposition of a matrix A. If · f is a matrix norm such that A f = f ( A ) for f Legendre, then f ( A ) = f V. A f . Moreover, A f = U D I AG ( A ) Proof of Theorem 8: If A = U D I AG[ A ] V , then by Theorem 9 f V= V. A f = U D I AG ( A ) U D I AG Af Therefore, using Theorem 9, f= Af = U D I AG In addition, we have U, Vs = U, Vs-1 + yt U, Xt . Thus we obtain t Km - t(U ) U, Vm U Vm f U f + K1 K def where t(U ) = - yt ui xi,t = i=1 i=1 t(ui ). Solving for m gives 2 sm U f 1t t(U ) + D (Vs Vs-1 ) . (18) m K K =1 M If U f = Equation (18) is our general starting point for analyzing multiview algorithms working under spectral co-regularization. The analysis reduces to bounding from above the secondorder term D (· ·) of the specific matrix norm · f under consideration. For the rest of this section we focus on the Schatten 2pnorm V s2p = V 2p , where V is a generic d×K matrix, and p is a positive integer (thus 2p is an even number 2). ( 1/p . Note that, in general, V 22p = T R V V )p s In order to prove our main result, stated below, we use some facts from differential matrix calculus. A standard reference on this subject is [23], to which the reader is referred. Theorem 10 The number of mistakes m made by the 2p-norm matrix Perceptron, run on any sequence (X1 , y1 ), (X2 , y2 ), . . . satisfies, for any d × K matrix U, X 2 s2p U s2q 1t t(U ) + (2p - 1) m K K M 2 Xs2p U s2q p-1t + t(U ) K K M f f ( A ) ( V U D I AG[ A ] V =A concluding the proof. f is Legendre) We now develop a general analysis of the quasi-additive matrix algorithms, and then specialize it (in Theorem 10 below) to a multiview algorithm operating with a Schatten p-norm regularizer. We start by adapting the dual norm proof of Section 5 to an arbitrary matrix norm A f = f ( A ), where f is Legendre. Let Vm be the primal weight matrix after any number m 1 of mistakes. By Taylor expanding 2 Vs 2 around Vs-1 for f each s = 1, . . . , m, and using yt Ws-1 , Xt 0, we get sm 1 Vm 2 D (Vs Vs-1 ) f 2 =1 wheVe D (Vs Vs-1 ) is the matrix Bregman divergence r - 1 yt Ws-1 , Xt . s 2 - Vs-1 2 f f 2 Fix any d × K matrix U . First, we derive a matrix version of the convex inequality for vector norms. We use a classical result by von Neumann (see, e.g., [18, p. 182]) stating that V, U V U for any two d × K matrices U and V . We have f ( Vm f U f = f Vm U by Theorem 7) Vm U Vm , U (by the convex ineq. for norms) (by von Neumann's ineq.). where Xs2p = maxtM Xt s2p , U s2q is the Schatten 2q 2p norm of U , with 2q = 2p-1 , and M is the set of mistaken trial indices. Remark Similarly to the vector case, when the parameter p is chosen to be logarithmic in r = min{d, K }, the pnorm matrix Perceptron penalizes diversity using the trace norm of U . If the vectors ui span a subspace of size K, and instances tend to have K nonzero singular values of roughly the same magnitude, then U s2q U s2 while 2 2 2 Xs2p Xs Xs2 /K . Hence this choice of p leads (at least in the linearly separable case) to a factor K improvement over the bound achieved by the matrix algorithm based on the Frobenius norm (p = 1 in Theorem 10), which amounts to running K independent Perceptrons in parallel and then average their mistakes. The following trace inequality is our main technical lemma. Lemma 11 Let A,B be positive semidefinite matrices, of size d × d and K × K respectively, with the same nonzero eigenvalues. Let X be an arbitrary real matrix of size d × K . Then, for any pair on nonnegative exponents l, g 0, we T 1/p T 1/q have T R(X Al X B g ) R(X X )p R A(l+g )q where 1/p + 1/q = 1, p 1. Proof of Lemma 11: We first consider the case l g . By the Cauchy-Schwartz and Holder's inequalities applied to traces [23, Ch.11] we have ( B Al 19) T R (X X B g ) = T R (g-l)/2 X Al X B (g+l)/2 X A2l 1 X X g+l 1/2 g -l /2 TR B TR XB 1 /2 1 /2 X X 1 /2 B X A2l Tp Tq g+l TR X B g-l where we used the shorthand Tr (Z ) = (T RZ r )1/r . In the case when l > g we can simply swap the matrices X Al and X B g and reduce to the previous case. We now recursively apply the above argument to the left-hand side of (19). Recalling that Tq (A) = Tq (B ) and Tp (X X ) = Tp (X X ), after n steps we obtain T 1n X Al Al X g ) /2 R (X B × TR X Bg X X Pn=1 (1/2)i B g+l Pn=1 (1/2)i i i × Tp Tq for some pair of exponents l , g 0 such that l + g = l + g . ) X Since for any such pair l , g , we have T R(X Al Ag < ,we can take the limit as n . Recalling that i i=1 (1/2) = 1 completes the proof. Proof of Theorem 10: We set for brevity G : Rd×K R, G(V ) = Thus in our case D (Vs Vs-1 ) = G(Vs-1 +yt Xt )-G(Vs-1 )-yt Ws-1 , Xt . Since G(V ) is twice4 continuously differentiable, by the mean-value theorem we can write D (Vs Vs-1 ) = 1 V E C (Xt ) 2 H G ( ) V E C (Xt ), where (B p-1 )= p-2 = 0 B B p-2- ( IK 2 + TK ) I k V , ( 1 TR V 2 Vp ) 1/p = 1 V 22p . s 2 and TK is the K 2 × K 2 commutation matrix such that TK V E C(M ) = V E C(M ) for any K × K matrix M . Putting together c(V ) ( p-1 (20) V E C (Xt ) B Id ) V E C(Xt ) 2 c(V ) ( + V E C (Xt ) Ik V ) × 2 I V × (IK 2 + TK ) k V E C (Xt ) , (21) p-2 p-2- . where we used the shorthand = B We =0 B now bound the two terms in the right-hand side of (21). By well-known relationships between Kronecker products and the V E C operator (see [23, Ch. 3]) we can write c(V ) ( p-1 V E C (Xt ) B Id ) V E C(Xt ) 2 1 1T c(V ) t p-1 t p /p T R (X Xt B ) R (X Xt ) , = 2 2 independent of V . The majorization follows from Holder's inequality applied to the positive semidefinite matrices Xt Xt and B p-1 . Moreover, it is easy to see that the symmetric matrices and TK commute, thereby sharing the same eigenspace. Hence, (IK 2 + TK ) 2, and we can bound from above the second term in (21) by c(V )V E C(Xt ) - =2 p 0 B Ap-1- V E C (Xt ) , where we set A = V V . Again, [23, Ch. 3] allows us to rewrite this quadratic form as the sum of traces c(V ) p= 2 - 0 (20) T R (X t Ap-1- X tB ) . where V E C(X ) is the standard columnwise vectorization of a matrix X , HG denotes the Hessian matrix of (matrix) function G and is some matrix on the line connecting Vs-1 to Vs . Using the rules of matrix differentiation, the gradient G of G is G(V ) = c(V )V E C(D) where we set for brevity D = V B p-1 , c(V ) = T R(B p )1/p-1 , with B = V V . Taking the second derivative HG = G gives HG (V ) = V E C (D) (c(V )) + c(V ) (D). Now, recalling the definition of c(V ), it is not hard to show that V E C (D ) (c(V )) is the K d × K d matrix 2(1 - p)T R(B p )1/p-2 V E C(D) V E C(D) . Since A and B have the same nonzero eigenvalues, we can apply Lemma 11 to each term and put together as in (21). After simplifying we get T 1/p 1 1 (20) (2p - 1) R(Xt Xt )p = (2p - 1)||Xt ||22p . s 2 2 The desired bound is then obtained by plugging back into (18), solving the resulting inequality for m, and overapproximating. The result of Theorem 10 is similar to those obtained in [28, 29, 30]. However, unlike these previous results, our matrix algorithm has no learning rate to tune (a property inherited from the vector p-norm Perceptron of [13]) and works for arbitrary nonsquare maTrices U . We also obtW o serve that the prediction yt = S G N R s-1 Xt f the p-norm matrix Perceptron red(uces to computing the sign of ( T R Vs-1 Vs-1 )p-1 Vs-1 Xt recall the expression for G calculated in the proof of Theorem 10). Since matrix Vs is updated additively, it is clear that both Vs-1 Vs-1 and Vs-1 Xt do depend on instance vectors xi,t only through inner products. This allows us to turn our p-norm matrix Perceptron into a kernel-based algorithm, and repeat the analysis given here using a standard RKHS formalism. Since p 1 this matrix is negative semidefinite, and we can disregard it when bounding from the above the quadratic form (20). Thus we continue by considering only the second term c(V ) (D) of the Hessian matrix. We have B p-1 + , (D) = Id (Ik V ) B p-1 4 " In fact G is C everywhere but (possibly) in zero, since " Vp T R (V ) is just a polynomial function of the entries of V . " " Moreover T R (V V )p = 0 if and only if V is the zero matrix. 8 Conclusions and ongoing research In this work we have studied the problem of learning multiple tasks online using various approaches to formalize the notion of task relatedness. Our results can be extended in different directions. First, in Sections 3.2 and 6 it might be interesting to devise methods for dynamically adapting the b parameter as new data are revealed. Second, the mistakes of the second-order algorithm M M P E R C have been bounded using a first-order analysis. A more refined analysis should reveal in the bound an explicit dependence on the spectral properties of the data. It is also worth noting that the significance of the mistake bound obtained for M M P E R C relies on the fact that the algorithm assumes the tasks to be different, although somewhat related. In the case when the K observed instances share the same label at each time step (like in multiview learning), we could not devise an algorithm with a significant advantage over the following trivial baseline: run K Perceptrons in parallel and use the sum of margins to predict. Third, it would be interesting to study the problem of learning multiple tasks when K predictions have to be output in each step. In this case the main difficulty appears to be the control of the interaction among instances at each time step. Fourth, it would be also interesting to prove lower bounds on the number of mistakes, as a function of task relatedness. Finally, since multitask learning problems arise naturally in a variety of settings, spanning from biology to news processing, we plan to complement the theoretical analysis presented in this paper with experimental results, so as to evaluate the empirical performance of our algorithms in real-case scenarios. Acknowledgments. Thanks to Sham Kakade, Massi Pontil, and Francis Bach for useful discussions. We also thank the COLT 2008 reviewers for their comments. This work was supported in part by the PASCAL2 Network of Excellence under EC grant no. 216886. This publication only reflects the authors' views. References [1] J . A B E R N E T H Y, P. L . BA RT L E T T & A . R A K H L I N, Multitask learning with expert advice, Proc. 20th COLT, pp. 484­498, Springer, 2007. [2] R . K . A N D O & T. Z H A N G, A framework for learning predictive structures from multiple tasks and unlabeled data, JMLR, 6, pp. 1817­1853, MIT Press, 2005. [3] A . A R G Y R I O U , T. E V G E N I O U & M . P O N T I L Multi-Task feature learning, NIPS 19, pp. 41­48, MIT Press, 2007. [4] A . A R G Y R I O U , C . A . M I C C H E L L I , M . P O N T I L & Y. Y I N G, A spectral regularization framework for multi-task structure learning, NIPS 20, MIT Press, 2008. [5] K . A Z O U RY A N D M . WA R M U T H, Relative loss bounds for on-line density estimation with the exponential family of distributions, Machine Learning, 43, pp. 211­246, 2001. [6] U . B R E F E L D , T. G A E RT N E R , T. S C H E FF E R , & S . W RO B E L, Efficient co-regularised least squares regression, Proc. 23rd ICML, 2006. [7] N . C E S A - B I A N C H I , A . C O N C O N I & C . G E N T I L E, A second-order Perceptron algorithm, SIAM Journal on Computing, 34/3, pp. 640­668, 2005. [8] N . C E S A - B I A N C H I & G . L U G O S I, Prediction, Learning, and Games. Cambridge University Press, 2006. [9] O . D E K E L , P. M . L O N G & Y. S I N G E R, Online learning of multiple tasks with a shared loss, JMLR, 8, pp. 2233­2264, 2007. [10] T. E V G E N I O U , M . P O N T I L & T. P O G G I O, Regularization networks and Support Vector Machines, Advances in Computational Mathematics, 13/1, pp. 1­50, Springer, 2000. [11] T. E V G E N I O U , C . M I C C H E L L I & M . P O N T I L, Learning Multiple tasks with kernel methods, JMLR, 6, pp. 615­637, MIT Press, 2005. [12] Y. F R E U N D & R . E . S C H A P I R E, Large margin classification using the Perceptron algorithm. Machine Learning, 37:3, pp. 277­296, 1999. [13] C . G E N T I L E, The robustness of the p-norm algorithms, Machine Learning, 53, pp. 265­299, 2003. [14] A . G ROV E , N . L I T T L E S T O N E & D . S C H U U R M A N S, General convergence results for linear discriminant updates, Machine Learning, 43, pp. 173­210, 2001. [15] M . H E R B S T E R , M . P O N T I L & L . WA I N E R, Online learning over graphs, Proc. 22nd ICML, pp. 305­312, ACM Press, 2005. [16] L . H O G B E N, Handbook of Linear Algebra, Discrete Mathematics and Its Applications, 39, CRC Press, 2006. [17] R . A . H O R N & C . R . J O H N S O N, Matrix Analysis. Cambridge University Press, 1985. [18] R . A . H O R N & C . R . J O H N S O N, Topics in Matrix Analysis. Cambridge University Press, 1991. [19] A . JAG OTA & M . K . WA R M U T H, Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence, Electr. Proc. 5th International Symposium on Artificial Intelligence and Mathematics, 1998. Electronic, http://rutcor.rutgers.edu/amai. [20] J . K I V I N E N & M . WA R M U T H, Relative loss bounds for multidimensional regression problems, Machine Learning, 45, pp. 301­329, 2001. [21] A . S . L E W I S, The convex analysis of unitarily invariant matrix functions, Journal of Convex Analysis, 2, pp. 173­183, 1995. [22] N . L I T T L E S T O N E, Mistake bounds and logarithmic linearthreshold learning algorithms. Ph.D. Thesis, University of California at Santa Cruz, 1989. [23] J . R . M AG N U S , & H , N E U D E C K E R, Matrix Differential Calculus with Applications in Statistics and Econometrics, revised edition. John Wiley, 1999. [24] A . M AU R E R, Bounds for linear multi-task learning, JMLR, 7, pp. 117­139, MIT Press, 2006. [25] R . T. RO C K A F E L L A R, Convex Analysis. Princeton University Press, 1970. [26] D . RO S E N B E R G & P. BA RT L E T T, Rademacher complexity of co-regularized kernel classes, Proc. Artificial Intelligence and Statistics, 2007. [27] V. S I N D H WA N I , P. N I YO G I & M . B E L K I N, A co-regularized approach to semi-supervised learning. Proc. ICML Workshop on Learning with Multiple Views, 2005. [28] K . T S U DA , G . R A E T S C H & M . K . WA R M U T H, Matrix exponentiated gradient updates for on-line learning and Bregman projection, JMLR, 6, pp. 995­1018, 2005. [29] M . K . WA R M U T H & D . K U Z M I N, Online variance minimization, Proc. 19th COLT, Springer, 2006. [30] M . K . WA R M U T H, Winnowing subspaces. Proc. 24th ICML, pp. 999­1006, ACM Press, 2007.