Active Learning for Multi-Task Adaptive Filtering

Abhay Harpale Yiming Yang Carnegie Mellon Univeresity, 5000 Forbes Avenue, Pittsburgh, PA-15213, USA

aharpale@cs.cmu.edu yiming@cs.cmu.edu

Abstract
In this paper, we propose an Active Learning (AL) framework for the Multi-Task Adaptive Filtering (MTAF) problem. Specifically, we explore AL approaches to rapidly improve an MTAF system, based on Dirichlet Process priors, with minimal user/task-level feedback. The proposed AL approaches select instances for delivery with a two-fold objective: 1) Improve future task-specific system performance based on feedback received on delivered instances for that task, 2) Improve the future overall system performance, thereby benefiting other tasks in the system, based on feedback received on delivered instances for a particular task. Current AL approaches focus only on the first objective. For satisfying both goals, we define a new scoring function called Utility Gain to estimate the perceived improvements in task-specific and global models. In our experiments on standard benchmark datasets, we observed that global AL approaches that additionally take into account the potential benefit of feedback to other tasks in the system performed better than the task-specific approach that focused only the benefit of the current task.

tive. AF systems model user interests based on the initial information request (e.g. a query) presented by the user, and the subsequent relevance feedback provided by the user for the results presented to the user. For example, one popular AF approach involves the use of Logistic Regression classifier (Jaakkola & Jordan, 1996) to classify documents into relevant and non-relevant categories for that particular user. The relevance feedback received from the user is used to supplement the training data to retrain the classifier for future predictions. Most research in AF systems has focused on learning each task independently of other tasks. In this paper, we will refer to such approaches as Single-Task AF (STAF) approaches. In the initial stages of learning, when the feedback from each user is limited, these approaches suffer from data sparsity, leading to weaker models, and consequently poorer performance. Multitask learning methods have shown significant success in mitigating this per-task data sparsity problem by sharing information across multiple tasks. For example, to learn the interest-model of a particular stock analyst, it could be useful to identify common important features for portfolio tracking based on the feedback received from other stock analysts. Irrespective of their portfolios, all stock analysts are usually interested in common news regarding corporate announcements, balance sheets, government regulations, and day-to-day stock market indicators pertinent to their stocks. In this paper, we will refer to such approaches, which leverage information from multiple tasks, as Multi-Task Adaptive Filtering (MTAF) approaches. The future performance of an AF system relies on the feedback received on delivered items. If the system myopically focuses on delivering only (perceived) relevant documents (better immediate performance), the feedback received on these documents may not lead to the best learnt task models (in the future), thereby limiting the usefulness of such feedback for future predictions. In this paper, we present an Active Learn-

1. Introduction
Adaptive Filtering (AF) (Robertson et al., 2001) systems monitor a stream of documents to filter out documents that are relevant to the particular task or user. For example, a stock analyst would like to filter out all the news about stocks in his portfolio, while a technology entrepreneur might be interested in news about latest technology startups. Each of these users can be considered a task from the AF system's perspecAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Active Learning for Multi-Task Adaptive Filtering

ing (AL) framework for MTAF that additionally takes into account the perceived benefit of feedback on items before making a delivery. In the MTAF setting, the AL system has a three-fold objective: 1) Provide relevant documents, 2) Feedback provided on a delivered document should maximally improve the task-specific performance in the future, 3) Feedback provided on a delivered document, should maximally improve the overall system (multiple tasks) in the future. The current AL approaches focuses only on the second objective, thereby narrowly focusing on task-specific performance improvements. In this paper, for satisfying the goal 1, we chose the Multi-Task Logistic Regression with Dirichlet Process priors (Blei & Jordan, 2006; Xue et al., 2007) for facilitating information sharing across tasks. For goals 2 and 3, we propose a novel AL framework based on a new scoring function called Utility Gain. This scoring function is inspired by the popular empirical risk minimization approaches in Active Learning (Melville et al., 2005). However, the current empirical risk minimization approaches only focus on minimizing the risk of one task, while we develop variants that selectively focus on one task, global model, or all tasks, as required during the various phases of learning. Consequently, our framework chooses instances that might lead to maximal (expected) gain in performance of the system (task-specific or global) for future predictions.

Table 1. Important notation used in this paper.

Symbol M m  {1, . . . , M } Nm d xm,n  Rd ym,n  {0, 1} wm  R d k  {1, . . . , }  wk  R d m,k  [0, 1]

Description Total number of tasks A task Total number of instances for task m Dimensionality of feature space The n'th data instance for task m The label of instance xm,n LR parameters for task m A group or cluster LR parameters of the k'th group Group k mixing proportion for task m

estimate of P (y = 1|x, D) P (y = 1|x, D) =
w

P (y = 1|x, w)P (w|D)dw

(2)

2.2. Utility of Adaptive Filtering In the Adaptive Filtering setting, the performance of a system is usually measured in terms of the utility of documents delivered by that system. For example, in the TREC Filtering track, one popular utility metric is T 9U = 1 R + 0 N , where R and N are the number of delivered results the user considered relevant and non-relevant respectively. 1 and 0 are the benefit achieved and loss incurred, by the user due to reading the relevant and non-relevant documents respectively. For TREC-9, 1 = 2 and 0 = -1. From the system perspective, the expected utility of delivering a document can be computed as: U(x|D) =
y{0,1}

2. Adaptive Filtering Preliminaries
2.1. Single-Task Adaptive Filtering We briefly describe a popular representative STAF approach based on the Logistic Regression (LR) classifier. The LR classifier estimates relevance of a doc1 ument using the sigmoid function (x) = 1+e-x as follows: P (y = 1|x, w) = (wT x) = 1 1 + e-wT x (1)

y P (y|x, D)

(3)

In the above equation, x is a feature vector representing a document, and w is the weight vector of regression coefficients. w is usually fit by maximum likelihood estimation on the available relevant and nonrelevant training documents D for the task. In this paper, for consistency among approaches, we will be focusing on the Bayesian Logistic Regression variant. In the Bayesian setting, the parameters w are usually drawn from a diffuse Gaussian prior distribution w  N (ľ, ), to ensure regularization. For a fully Bayesian treatment, instead of using a point estimate of the parameters w, one may integrate over the posterior distribution of w to obtain a Bayesian

Instances with U(x) > t are delivered, where t is the dissemination threshold learnt via cross-validation or set to 0 (Zhang et al., 2003). In this paper, we focus on the T 9U utility metric to guage the system performance, but the ideas are general and maybe extended to other evaluation metrics as well.

3. Our approach
3.1. Multi-Task Adaptive Filtering Owing to the success of Logistic Regression (LR) classifiers in Adaptive Filtering, we choose a multi-task approach based on LR classifiers. Our chosen approach is depicted graphically in Figure 1. Table 1 lists the

Active Learning for Multi-Task Adaptive Filtering

Algorithm 1 MTAF Generative Process 1: Fixed diffuse hyperpriors: 10 , 20 , 0 , 10 , 20 2: Draw j  Gamma(10 , 20 ), j = 1, . . . , d 3: Let  be a diagonal matrix with elements j , j 4: Draw ľ  N (0, (0 )-1 ) 5: Let  = -1 6: Draw   Gamma(10 , 20 ) 7: Draw vk  Beta(1, ), k = 1, . . . ,   8: Draw wk  N (ľ, ), k = 1, . . . ,  k-1 9: Let k = vk i=1 (1 - vi ), k 10: for m = 1 to M do 11: Draw m  M ultinomial(1; 1 , . . . ,  )   12: Let wm = k=1 (wk )m,k T 13: Draw ym,n  Binomial(1, (wm xm,n )), n = 1, . . . , Nm 14: end for
Figure 1. Graphical model representation of multi-task adaptive filtering based on Dirichlet Processes.

notation used in this paper. This approach is based on the Mutli-Task classification approach developed by (Xue et al., 2007). A similar approach, consisting of Support Vector Machines for individual tasks, has been used for content-enhanced collaborative filtering by (Yu et al., 2004). We summarize the generative process of the model in Algorithm 1. The approach clusters/groups related tasks by drawing the parameters for the related tasks from a mixture of Gaussians (as evident from the line 12 of the Algorithm 1). Intuitively, related tasks will share information by getting grouped into the same cluster(s). For example, features that are indicators of interests of a football fanatic (e.g. teams, scoreboards, win/loss decisions) may be inferred from the relevance judgments available from other sports enthusiasts. The grouping of related tasks ensures that unrelated tasks (e.g. stockportfolio tracking vs sports-news filtering) do not contaminate, as such contamination might lead to poorer understanding of the individual tasks. As the optimal number of groups is unknown apriori, a Dirichlet Process model is inferred over the tasks to discover the optimal number of groups pertinent to the tasks. The Dirichlet Process is essentially a mixture model with potentially infininte components. The actual number of components participating in the formation of the tasks is based on the parameter , also known as the innovation parameter. Larger values of  lead to more participating components and vice-versa. The optimal value of  (and other parameters) can be inferred from the data using variational inference (Xue et al., 2007).

With the knowledge of the inferred parameters for a task m, the decision function for a new instance xm,ˇ follows the distribution:
K

P (ym,ˇ = 1|xm,ˇ , wm ) =
k=1

 m,k (wk T xm,ˇ )

(4)

Again, like in Equation 2, for a fully Bayesian treatment, we integrate over the posterior distribution of  wk , instead of using a point estimate: P (ym,ˇ = 1|xm,ˇ , wm ) =
K

m,k
k=1

   (wk T xm,ˇ )P (wk |ľ, , D)dwk

(5)

The above integral does not have an analytical solution. (Xue et al., 2007) suggest the use of an approximate form of the integral based on (MacKay, 1992). Combining Equation 5 and Equation 3, one can compute the expected utility U(xm,ˇ ) of delivering the document xm,ˇ . A passive MTAF approach will deliver the document if U(xm,ˇ ) > t, where t is a dissemination threshold, usually set to zero, or maybe learned via cross-validation. We call this approach passive, because in choosing to deliver xm,ˇ , the system did not foresee the benefit of getting the feedback on that document. Consequently, the user effort in providing feedback on this item may go wasted as it may not result in better results (for the user) in the future. We remedy this situation in the next section by proposing our Active Learning framework for the MTAF approach.

Active Learning for Multi-Task Adaptive Filtering

3.2. Active Learning for Adaptive Filtering In a classification setting, an AL approach typically selects instances that (if labeled) are expected to improve the classification performance the most. Along similar lines, in the MTAF setting, an AL approach could estimate the perceived benefit of delivering an item in three different ways. Firstly, does the delivered item xm,ˇ lead to improved performance on that task m in the future (based on feedback received on xm,ˇ ). Secondly, will the feedback on delivered item xm,ˇ lead to improvements in the global model (e.g. , ľ, )? Finally, does feedback on the delivered item xm,ˇ lead to improvements in other tasks in the system? We propose an Active Learning solution for each of these objectives in the following sections. 3.2.1. Local Active Learning In this section, we discuss an AL approach that scores instances based on the perceived future benefit of delivering these instances to the current task m. Consequently, we first define our notion of future benefit LUG m (xm,ˇ ) of delivering an instance to the task m. LUG m (xm,ˇ |D) = LSP m (D  (xm,ˇ , ym,ˇ )) - LSP m (D) (6)

specific population. It should be noted that each task in the system may have it's own sample population from which documents are being monitored/filtered. For example, a stock analyst may subscribe to news feeds from business magazines, while a sports enthusiast may subscribe to news feeds from sports magazines, leading to different p(xm, ) for these two tasks. (Ofcourse, the sports enthusiast is not interested in all sports news being served, but only those that match her interests. It is the goal of the AF system to identify those interests and filter the relevant news from the subscribed feed. Hence the system performance is evaluated over the sample population of the subscribed feed.). In this Equation, the domain of the integral is defined as LU m (xm, ) = {xm, : LU m (xm, |D ) > 0}. This means that the future system performance will only be calculated based on the instances that will be delivered; undelivered instances will be ignored from the system performance calculation due to the nature of the T 9U utility function. The second subtle point in Equation 8, is the localized nature of the utility function LU m (xm, |D ). For computing the expected future system performance Sm (D  (xm,ˇ , ym,ˇ )), we compute the utitlity LU m (xm, |D  (xm,ˇ , ym,ˇ )) based only on the updated local model wm of the task m. This means, we do not consider the effect of updating the global parameters , ľ,  after observing the new training instance (xm,ˇ , ym,ˇ ). If wm is the current (trained on D) local model, we can use the Bayes' rule to obtain the posterior distribution P (wm |xm,ˇ , ym,ˇ ). We use the variational approximation method suggested by (Jaakkola & Jordan, 1996) to obtain the posterior. With the knowledge of the posterior distribution P (wm |xm,ˇ , ym,ˇ ), the localized expected utility can then be defined as: LU m (xm, |D  (xm,ˇ , ym,ˇ )) = y P (y|xm, , wm )P (wm |xm,ˇ , ym,ˇ )dwm
wm y{0,1}

In the above equation, LSP m (D ) denotes the expected system performance on task m when trained on any data D . Intuitively, the benefit of delivering an instance xm,ˇ is estimated as the improvement in the task-specific performance (for task m), if the feedback of that instance (xm,ˇ , ym,ˇ ) was added to the training set D. In Equation 6, the system doesn't know the true label ym,ˇ for the instance xm,ˇ . As a result, we rephrase the equation as an expectation over the possible labels. LUG m (xm,ˇ ) =
y{0,1}

P (y|xm,ˇ , D)LSP m (D  (xm,ˇ , y)) (7)

- LSP m (D)

(9) For the case of Adaptive Filtering, we define the system performance in terms of the expected utility of all potential instances the system will monitor. Consequently, LUG m is the localized utility gain of task m. LSP m (D ) = p(xm, )LU m (xm, |D )dxm,
LU m (xm, )

In the Adaptive Filtering setting, the system will then choose to deliver items that provide immediate utility Um (xm,ˇ |D), or are potentially beneficial in improving the system in future iterations, as estimated by expected future utility LUG m (xm,ˇ |D). We express the joint objective as a linear combination: LALm (xm,ˇ |D) = Um (xm,ˇ |D) + L LUG m (xm,ˇ |D) (10) Here, L is the weightage given to the AL component

(8) In Equation, 8, there are two important points to note. Firstly, the integral sums over p(xm, ), i.e. the task-

Active Learning for Multi-Task Adaptive Filtering

of the system. We discuss the nature of this weight in Section 3.2.5. Owing to the localized focus of this approach on the perceived improvement of the current task, we call this approach Local Active Learning (LAL). We expect LAL to perform well if the global model, consisting of , ľ, , is already strong, and thus does not need to be updated. Thus, LAL is expected to be effective on new tasks that are added to an already well-performing MTAF system. We deal with the case of the improving the global model in the next section. 3.2.2. Global Active Learning As mentioned in the previous section, LAL does not foresee the updates/improvements in the global model consisting of  = {, ľ, }. Global AL (GAL) fixes this by updating the global model for computing the expected utility gain in Equation 7. It should be noted that, just like LAL, the goal of GAL is still to improve the future performance of task m (the task to which the system plans to deliver instance xm,ˇ ), albeit based on improvements in the global model. Consequently, we define the estimation of the global utility function GU m (xm, |D  (xm,ˇ , ym,ˇ )). This is a modification of localized utility LU m (xm, |D  (xm,ˇ , ym,ˇ )) in equation 9. GU m (xm, |D  (xm,ˇ , ym,ˇ )) =
 y{0,1}

ized utility gain GU G m (xm,ˇ ) =
y{0,1}

P (y|xm,ˇ , D)GSP m (D  (xm,ˇ , y)) (13)

- GSP m (D)

Finally, substituting the globalized utility gain GU G m into Equation 14, we get the scoring function for the Global Active Learning GAL approach. GALm (xm,ˇ |D) = Um (xm,ˇ |D) + G GU G m (xm,ˇ |D) (14) 3.2.3. Benevolent Active Learning So far, LAL and GAL have focused on improved performance of the task m to which the system plans to deliver the instance xm,ˇ . But it is not clear if the feedback on that instance will lead to improvements in other tasks. To achieve this goal, we devise another AL score, called Benevolent AL, as the system tries to deliver instances to task m that might lead to improvements in other tasks. In this context, we revise the definition of GSP m to BSP m , which sums over the performance of each task in the system, based on the feedback received on the instance xm,ˇ delivered to task m. BSP m (D ) =
M

y P (y|xm, , )
m =1 GU m (xm
, )

p(xm , )GU m (xm , |D )dxm , (15) It is important to note that inside the summation for each task m , the instances xm , are being sampled from the distribution p(xm , ) of incoming documents for that particular task m . This stems from the fact that the distribution of incoming instances P (xm ) may be different for different tasks, as discussed earlier. Based on BSP m , for clarity and completeness, we provide the Equations of the corresponding utility gain BU G m and AL scores BALm . BU G m (xm,ˇ ) = P (y|xm,ˇ , D)BSP m (D  (xm,ˇ , y)) - BSP m (D)
y{0,1}

P (|xm,ˇ , ym,ˇ )d =
 y{0,1} wm

y P (y|xm, , wm ) P (wm |)(P (|xm,ˇ , ym,ˇ )dwm d (11)

It is important to note that the posterior model global P (|(xm,ˇ , ym,ˇ ) has been trained on additional available feedback from the task m only. This is because, we are computing the expected global future utility if the instance is delivered to task m, that is if the feedback is received from task m. Replacing local utility LU in Equation 8 with expected future global utility GU, we get the globalized system performance GSP m of task m. GSP m (D ) =
GU m (xm, )

(16) BALm (xm,ˇ |D) = Um (xm,ˇ |D) + B BU G m (xm,ˇ |D) (17)

p(xm, )GU m (xm, |D )dxm,

(12) Substituting the globalized system performance GSP m from Equation 12 into Equation 7, we get the global-

Active Learning for Multi-Task Adaptive Filtering

3.2.4. Analysis of LAL, GAL, and BAL In Table 2, we summarize the major ideas from proposed AL approaches. Each approach focuses on improvements in different parameters of the model, thereby leading to different consequences. The LAL approach focuses on improving its own task. Such approach can be expected to perform well when the global model is already strong, and less susceptible to drastic change based on feedback on a new instance. The GAL approach restimates the global model, but in the process computes the future utility only based on the current task. This selfish strategy of modifying the global model without foreseeing the effect on other tasks in the system can be detrimental to the overall performance of the model, making this approach undesirable. The BAL approach updates the global model, but at the same time studies the effect of the update on the utility gain of other tasks in the system, thereby ensuring that the global model doesn't get biased towards a particular task. Consequently, we expect BAL to perform superior to other methods in the initial stages of AF when global model is weak.
Table 2. A comparative summary of AL approaches. The first column lists the AL score to decide delivery of instance xm,ˇ to task m. The second column lists the parameters that will be (potentially) improved (for better future utility) if the system retrains on the feedback received on the instance delivered based on the corresponding AL score Delivery criteria Potentially improved parameters based on feedback LALm GALm BALm wm wm , , ľ,  , ľ, , wm , m  {1, . . . , M }

stronger global model means lower G . Similarly, B should decrease so that the tasks can then focus on improving locally, thereby maintaining a higher value of L . We test these hypotheses empirically in Section 4.3

4. Experiments
4.1. Datasets We chose to use 2 datasets that are popular in the TREC filtering community, namely, RCV1 (84 categories, 810,000 documents) (Lewis et al., 2004) and 20 Newsgroups (20 categories, 18,846 documents) (Joachims, 1997). For both datasets, we modeled each category as a task. For experiments on both datasets, we start AF with only one known relevant document per task. It should be noted that RCV1 is a multi-labeled dataset, meaning each document belongs to multiple classes/tasks, thereby indicating some level of overlap/relatedness among tasks (e.g. tasks sharing the same documents). 20 Newsgroups is not multi-labeled and each document belongs to only one class. However, it is known that the categories can be grouped into 6 groups based on subject matter, namely, comp.*, rec.*, sci.*, talk.*, misc.* and others. Thus, the relatedness of tasks is in the feature-space, as documents are not shared among tasks. For example, the feature computer is a strong indicator of the tasks in comp.* group. We chose this dataset to see if the MTAF approaches are able to discover these hidden groups of related categories, even if the documents are not shared among tasks. 4.2. Methods Primarily, we are interested in comparing the four Active Learning approaches LAL, GAL, BAL and MAL in the MTAF setting to see which approach rapidly improves utility, and how they perform in the various phases of filtering. We also wish to compare these 4 approaches to the passive version that decides to deliver an instance without taking any future utility gain into account. (i.e. delivery criteria is Um (xm,ˇ |D) > 0). 4.2.1. STAF baselines (Zhang et al., 2003) devised a strategy called exploration and exploitation to perform Active Adaptive Filtering, and our AL framework is inspired by their work. The exploitation component is a passive AF strategy that delivers an item if Um (xm,ˇ |D) > 0. The exploration component is an Active Learning component based on a metric called Utility Divergence, similar to the Utility Gain score developed in this paper. Utility Divergence, however, doesn't com-

3.2.5. Combined AL for MTAF Based on the discussion in the previous section, each of the above methods can be combined to come up with a meta-AL system for AF that tries to satisfy multiple objectives at different phases of learning. We use weighted linear combination of the individual utility gain scores to come up with the Meta AL score MALm (xm,ˇ |D) = Um (xm,ˇ |D) + L LUG m (xm,ˇ |D) + G GU G m (xm,ˇ |D) + B BU G m (xm,ˇ |D) (18)

The weight parameters L , G and B should vary with the quality of the model, local as well as global. A

Active Learning for Multi-Task Adaptive Filtering

pute the actual gain in utility of the system based on feedback on the instance. It instead computes the difference between the expected utility of a hypothetical true model  and the model based on expected feedback |(xm,ˇ , ym,ˇ ) and instances are scored based on their potential to reduce the gap between utility of the true model and utility of the current model. In our results, we will refer to these approaches as STAF-passive (only exploitation), STAFactive-UD (exploitation and exploration using Utility Divergence score). 4.2.2. A note on implementation The various methods were implemented in MATLAB. For integrating over the posterior distribution of a vari able (e.g. wk ), we used the Metropolis Hastings algorithm. Specifically, we draw several (S) samples from the posterior of the corresponding distribution of the variable, and averaged over the various outputs to obtain a probabilistic integral, similar to the approach described in (Zhang et al., 2003). For example, to implement Equation 3, we sample S samples from the posterior distribution P (w|D) and the integration was implemented as: U(x|D) = 1 S
S

an stronger global model. LAL improves at a faster pace once the learnt global model is stronger (due to learning from more feedback). The GAL approach, that tries to improve the global model, without foreseeing improvements (or degradation) in other task performs inferior to the LAL approach. We believe that the GAL approach sometimes makes wrong decisions about instance selection, by selecting those instances that will lead to improvement in one task, but mostly degrade other tasks (because the utility gain in tasks is not taken into account). We found that setting G = 0 in Equation 18, i.e. ignoring the GU G m component corresponding to GAL, in the MAL approach led to better performance of MAL. Consequently, the MAL approach, which combines the strengths of the LAL and the BAL approaches, outperforms all the other approaches.

y P (y|x, w(s) )
s=1 y{0,1}

(19)

In this specific case, the posterior distribution was derived by using the Laplace approximation method (Xue et al., 2007). Other posterior distributions for the MTAF case were based on the derivations available in (Xue et al., 2007). For integrating over P (x), the samples were drawn from the data stream observed so far, consisting of delivered and undelivered items, to get a non-parametric estimate of P (x). 4.3. Experimental Results and Discussion First we compare the MTAF approaches (MTAFPassive, LAL, GAL, BAL, and MAL) to test the hypotheses we made in Section 3.2.4. Figure 2 shows the trends (in terms of T9U utility) on the RCV1 dataset. AL approaches are typically most effective in the initial phases of filtering and so the Figure 2 shows performance upto the filtering of the first 5000 instances on the RCV1 dataset. The results validate our hypothesis that the LAL approach performs quite poorly in the initial stages of filtering, when the global model itself is quite weak. It can also be observed that BAL approach, that tries to improve the overall utility of all the tasks in the system performs the best initially. This is because improved utility of all tasks reflects in

Figure 2. Comparison of the MTAF AL approaches on the RCV1 dataset (similar trends for 20 Newsgroups)

Next, we compare representative MTAF approaches (MTAF-passive and MTAF-MAL) to the STAF approaches to study the benefit of using a multi-tasking setup. We also compare the AL scoring functions: our Utility Gain (UG) criteria and the Utility Divergence (UD) criteria proposed in (Zhang et al., 2003). We call these variants of the STAF-active approach STAFactive-UG and STAF-active-UD respectively. In Figure 3, we observe that the MTAF approaches, active as well as passive, outperform their STAF counterparts. It can be observed that our best performing approach, MAL (T9U = 423), has a performance improvement of more than 20 percent over that of STAF-active-UD (T9U=348). A paired t-test shows strong statistically significant evidence (p-value = 0.0002) for the superiority of our approach over the current state-of-art STAF-active-UD. We also observe that the starting utility of the MTAF approaches is higher, due to information sharing between tasks, to overcome the pertask data sparsity problem in the STAF approaches.

Active Learning for Multi-Task Adaptive Filtering

Regarding the AL score, we observe that the performance of UG and UD is quite similar for STAF-active. Empirically, it seems that the goal of UD (reduce the gap between the utility of a hypothetical true model and the learnt model) is quite similar to the goal of UG (increase the future utility of the learnt model the most).

It is also necessary to derive analytic solutions to the sampling strategies described here to come up with closed-form/quicker expected future utility evaluation schemes. Another challenge is the problem of spam. In a practical MTAF system, how does the MTAF system protect its genuine users/tasks from other malicious users. In a benevolent AL approach the feedback from the malicious users is potentially harmful to the system, and consequently to the genuine users. So how does a Benevolent AL approach safeguard againt malicious use?

References
Blei, David M. and Jordan, Michael I. Variational inference for dirichlet process mixtures, 2006. Jaakkola, Tommi S. and Jordan, Michael I. A variational approach to bayesian logistic regression models and their extensions, 1996. Joachims, Thorsten. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. pp. 143­151, 1997. Zhang, Yi, Xu, Wei, and Callan, Jamie. Exploration and exploitation in adaptive filtering based on. In Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003, pp. 896­ 903, 2003. MacKay, David J.C. The evidence framework applied to classification networks. Neural Computation, 4: 720­736, 1992. Stephen Robertson, Robertson, Stephen, and Soboroff, Ian. The trec 2002 filtering track report. In In TREC-11, 2001. Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8:2007, 2007. Yu, Kai, Tresp, Volker, and Yu, Shipeng. A nonparametric hierarchical bayesian framework for information filtering, 2004. David Lewis, Yiming Yang, T. Rose and Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004) 361-397. Melville, P., Saar-Tsechansky, M., Provost, F., and Mooney, R. 2005. An Expected Utility Approach to Active Feature-Value Acquisition. In Proceedings of the Fifth IEEE international Conference on Data Mining (November 27 - 30, 2005).

Figure 3. Comparison of MTAF and STAF approaches on the 20 Newsgroups dataset (similar trends for RCV1)

5. Conclusion and Future Work
In this paper, we have explored various Active Learning approaches to the Multi-Task Adaptive Filtering. To score the benefit of delivering an instance, we developed a new metric called Utility Gain that estimates the improvement in system performance (in terms of utility) if the system is re-trained on the feedback received for the delivered instance. In the MTAF setting, we compare the effect of selfish (local) AL approaches (that focus on improvements in one task) to a benevolent AL approach (that evaluates the benefit of labeling an instance across many tasks). Our empirical analysis demonstrates the superior performance of the benevolent approach in the initial phase of filtering, when the global model of the MTAF system is weak, while a more rapid improvement in the performance of the selfish approaches in later stages, when the global model is strong (i.e. global model may not benefit much from AL). We also demonstrate that a combined approach, called meta-AL, that combines the strengths of local and benevolent AL approaches, is superior to the individual approaches. There are several areas for exploration in the future. The sampling approaches presented in this work can be infeasible in a large-scale adaptive filtering system with millions of tasks. In such an environment, it is necessary to first segregate the tasks so that each of the smaller AF systems can handle their respective tasks.