Local Likeliho o d Mo deling of Temp oral Text Streams

Guy Lebanon Yang Zhao Department of Statistics, Purdue University ­ West Lafayette, IN 47907

lebanon@stat.purdue.edu zhao18@stat.purdue.edu

Abstract
Temporal text data is often generated by a time-changing process or distribution. Such a drift in the underlying distribution cannot be captured by stationary likelihood techniques. We consider the application of local likelihood methods to generative and conditional modeling of temporal document sequences. We examine the asymptotic bias and variance and present an experimental study using the RCV1 dataset containing a temporal sequence of Reuters news stories.

uments and labels (t, (x, y )), the drift pt (x, y ) can be characterized by considering the temporal transition of the joint distribution pt (x, y ), the conditionals pt (y |x), pt (x|y ), or the marginals pt (x), pt (y ). The choice of which of the distributions above to model depends on the application at hand. For example, modeling pt (y |x) is usually sufficient for document classification purposes while modeling pt (x|y ) is necessary for language modeling which is an important component in speech recognition, machine translation, and IR. We demonstrate the presence of concept drift in practice by considering the Reuters RCV1 dataset (Lewis et al., 2004) which contains over 800,000 news stories gathered in a period spanning 365 consecutive days and categorized according to topic. Figure 1 displays the temporal change in the relative frequency (number of appearance in a document divided by document length) of three words: million, common, and Handelsgesellschaft (German trade unions) for documents in the most popular RCV1 category titled CCAT. It is obvious from these plots that the relative frequency of these words vary substantially in time. For example, the word Handelsgesellschaft appear in 8 distinct time regions, representing time points in which German trade unions were featured in the Reuters news archive. The temporal variation in relative frequencies illustrated by Figure 1 corresponds to a drift in the distribution generating the data. Since the drift is rather pronounced, standard estimation methods based on maximum likelihood are not likely to accurately model the data. In this paper, we consider instead estimating {pt (x, y ) : t  I } based on the the local likelihood principle. Local likelihood is a locally weighted version of the loglikelihood with the weights determined by the difference between the time points associated with the sampled data and a the time at which the inference takes place. After presenting a more formal discussion of concept drift in Section 3 and the definition of local likelihood in Section 4 we turn to examine in detail the case of

1. Introduction
Time stamped documents such as news stories often cannot be accurately modeled by a single time invariant distribution. An alternative is to assume that the concepts underlying the distribution generating the data drift with time. In other words, the data is generated by a time dependent process z (t)  pt (z ), t  I  R whose approximation {pt : t  I } becomes ^ the main ob jective of the learning task. We assume that the time t is a continuous quantity, even in cases where the realized time points form a discrete sample. For example, assuming that the time stamps represent the days of the year when the documents were authored, we assume that the set {1, . . . , 365} is a discrete sample from a underlying continuous interval [1, 365]. We further assume that the data samples z (t) , sampled from pt , correspond to pairs z (t) = (x, y ) constituting a document x and a categorial-valued label y . Such pairs (x, y ) appear often in practice, for example with y corresponding to the document topic (Lewis et al., 2004), sentiment (Pang & Lee, 2005), author (Mosteller & Wallace, 1964) or Email spam/no-spam (Mulligan, 1999). Assuming that our data is a set of time stamped docApp earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Local Likelihood Modeling of Temporal Text Streams
x 10 2.2 2 1.8 1.6 1.4 1.2 1
0.012

-3

x 10 2 1.8 1.6 1.4 1.2

-6

0.019 0.018 0.017 0.016 0.015 0.014 0.013

1 0.8 0.6 0.4

0.8
0.011

0.2 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

0.6
50 100 150 200 250 300 350

Figure 1. Estimated relative frequency (numb er of app earances in a document divided by document length) of three words from the most p opular category in RCV1 as a function of time. The three panels corresp ond to the words million, common, and Handelsgesellschaft (German trade unions). The displayed curves were smoothed to remove sampling noise.

modeling pt (x|y ) with local likelihood for n-grams and modeling pt (y |x) with local likelihood for logistic regression. In the case of 1-grams or the naive Bayes model, we provide a precise as well as asymptotic description of the bias and variance which illuminates certain facts concerning the selection of weights and the difference between the online and offline scenarios. Experiments conducted on the RCV1 dataset demonstrates the local likelihood estimation in practice and contrasts it with more standard non-local alternatives.

3. The Concept Drift Phenomenon and its Estimation
Formally, the concept drift phenomenon may be thought of as a smooth flow or transition of the joint distribution of a random vector. We will focus on the case of a joint distribution of a random vector X and a random variable Y representing predictor and response variables. We will also restrict our attention to temporal or one dimensional drifts. Definition 1. Let X and Y be two discrete random vectors taking values in X and Y . A smooth temporal drift of X, Y is a smooth mapping from I  R to a family of joint distributions t  pt (x, y ) = pt (X = x, Y = y ). By restricting ourselves to discrete random variables we can obtain a simple geometrical interpretation of concept drift. Denoting the simplex of all distributions over the set S by     i|S | def PS = r  R|S | : i ri  0, ri = 1 (1)  
=1
def

2. Related Work
Concept drift or similar phenomena under different names have been studied in a number of communities. It has recently gained interest primarily due to an increase in the need to model large scale temporal data streams. Early machine learning literature on the concept drift problem involved mostly computational learning theory tools (Helmbold & Long, 1994; Kuh et al., 1990). Hulten et al. (2001) studied the problem in the context of datamining large scale streams whose distribution change in time. More recently, Forman (2006) studied the concept drift phenomenon in the context of information retrieval in large textual databases. Sharan and Neville (2007) consider the modeling of temporal changes in relational databases and its application to text classification. Overall, the prevailing techniques have been to train standard methods on examples obtained by filtering the data through a sliding window. Tibshirani and Hastie (1987) developed the local likelihood idea in the statistics community within the context of nonparametric smoothing and regression. More details on local likelihood can be found in (Loader, 1999).

we have that Definition 1 is equivalent to a smooth parameterized curve in the simplex PX ×Y .

The drift in the joint distribution can be decomposed in several ways. The first decomposition pt (x, y ) = pt (x|y )pt (y ) is useful for generative modeling and the second decomposition pt (x, y ) = pt (y |x)pt (x) is useful for conditional modeling. In the generative case we will focus on modeling pt (x|y ) since modeling pt (y ) is typically an easier problem due to its lower dimensionality (in most cases involving text documents |Y |  |X |). In the case of conditional modeling, we focus on modeling pt (y |x) and we ignore the drift in the marginal pt (x) since it is irrelevant for discriminative tasks.


Local Likelihood Modeling of Temporal Text Streams

In both cases we assume that our data is a set of timestamped labeled documents sampled from pt (x, y ) where the time points t are sampled from a distribution g (t). If g is a continuous density, the number of samples at time t, denoted by Nt , is no greater than 1 with probability 1. In practice, however, we allow Nt to be larger than 1 in order to account for the discretization of time. We thus have the data D = {(xtj , ytj ) : t  I  R, j = 1, . . . , Nt } (2)

where the time points are sampled from g (t) and (xtj , ytj )  pt (x, y ). To illustrate these concepts in the context of the RCV1 dataset, we display in Figure 2 the total number of words per day (left) and the total number of documents per day (right) corresponding to the most popular category in RCV1. As is evident from the right panel, g (t) is a highly non-uniform density corresponding to varying amount of news content in different dates. It is easy to come up with two simple solutions to the problem of concept drift modeling. The first solution, called the extreme global model, is to simply ignore the temporal drift and use all of the samples in D regardless of their time stamp. This approach results in a single global model p which serves as an ^ estimate for the entire flow {pt , t  I } effectively modeling the concept drift as a degenerate curve equivalent to a stationary point in the simplex. The second simple alternative, called the extreme local model, is to model pt using only data sampled from time t i.e. {(xtj , ytj ) : j = 1, . . . , Nt }. This alternative decomposes the concept drift estimation into a sequence of disconnected estimation problems. The extreme local model has the benefit that if the individual estimation problems are unbiased, the estimation of the concept drift is unbiased as well. The main drawback of this method is the high estimation variance resulting from the relatively small number of daily samples Nt used to estimate the individual models. Furthermore, assuming D is finite we can only estimate the drift in the finite number of time points appearing in the dataset D (since we have no training data for the remaining time points). On the other hand, the extreme global model enjoys low variance since it uses all data points to estimate pt . Its main drawback is that it is almost always heavily biased due to the fact that samples from one distribution pt1 are used to estimate a different distribution pt2 . It is a well known fact that the optimal solution in terms of minimizing the mean squared estimation error usually lies between the extreme local and extreme

global models. An intermediate solution can tradeoff increased bias for reduced variance and can significantly improve the estimation accuracy. Motivated by this principle, we employ local smoothing in forming a local version of the maximum likelihood principle which includes as special cases the two extreme models mentioned above. The intuition behind local smoothing in the present context is that due to the similarity between pt and pt+ , it makes sense to estimate pt using samples from neighboring time points t + . However, in contrast to the global model the contribution of points sampled from pt+ towards estimating pt should decrease as  increases.

4. Local Likelihood and Concept Drift
The local likelihood principle extends the ideas of nonparametric regression smoothing and density estimation to likelihood-based inference. We concentrate on using the local likelihood principle for estimating pt (x|y ) and pt (y |x) which are described next. 4.1. Lo cal Likeliho o d for n-Gram Estimation We apply local likelihood to the problem of estimating pt (x|y ) by assuming the naive Bayes assumption i.e. that x|y is generated by a multinomial distribution or its n-gram extensions. Assuming documents contain words belonging to a finite dictionary of size V , the naive Bayes assumption may be stated as w c pt (x|y )  w(w,x) ,   PV (3)
V

where c(w, x) represents the number of times word w appears in document x. Similarly, the n-gram model extends naive Bayes (3) by considering n-order Markov dependency. The naive Bayes and n-gram are a mainstay of statistical text processing (Manning & Schutze, 1999) and usually lead to accurate language modeling, especially when appropriate smoothing is used (Chen & Goodman, 1998). For notational simplicity we consider the problem of estimating pt (x) rather than the equivalent pt (x|y ) and we concentrate on naive Bayes i.e. 1-gram. Extending the discussion to n-grams with n > 1 is relatively straightforward and is omitted due to lack of space. Applied to the concept drift problem, the local loglikelihood at time t is a smoothed or weighted version of the loglikelihood of the data D in (2) with the amount of smoothing determined by a non-negative smoothing kernel Kh : R  R t ( |D) =
def


Kh (t -  )
I 

jN

log p(x j ;  ).

(4)

=1


Local Likelihood Modeling of Temporal Text Streams
x 10 4 13 12 11 10 9 8 7 6 5 4 50 100 150 200 250 300 350 50 100 150 200 250 300 350 1300 1200 1100 1000 900 800 700 600 500 400

Figure 2. Total numb er of words p er day (left) and documents p er day (right) for the most p opular category in RCV1. The displayed curves were smoothed to remove sampling noise.

We assume that the kernel function is a normalized density concentrated around 0 and parameterized by a scale parameter h > 0 reflecting its spread and satisfying the relation Kh (r) = h-1 K (r/h) for some K : R  R referred to as the base kernel form. Wu further assume that K has bounded support and e r K (u) du <  for r  2. Wand and Jones (1995) provide more details on the formal requirements of a smoothing kernel. Three popular kernel choices are the tricube, triangular and uniform kernels, defined as Kh (r) = h-1 K (r/h) where the K (·) functions are respectively K (r) = (1 - |r|3 )3 · 1{|r|<1} K (r) = (1 - |r|) · 1{|r|<1} K (r) = 2-1 · 1{|r|<1} . (5) (6) (7)

tween the loglikelihoods of the extreme local model and the extreme global model mentioned in Section 3 as h ranges from 0 to +. Solving the maximum local likelihood problem for each ^ t provides an estimation of the entire drift {t : t  ^t = arg max R} with    t ( |D ). In the case of the naive Bayes or n-gram model we obtain a closed form ^ expression for the local likelihood maximizer t as well as convenient expressions for its bias and variance. In general, however, there is no closed form maximizer and iterative optimization algorithms are needed in ^ order to obtain t = arg max t ( |D) for all t. We denote the length of a document in (2) by v def |xtj | = V c(xtj , v ) and the total numb er of Nt def words in day t in (2) by |xt | = j =1 |xtj | = v Nt c(v , xtj ). We assume that the length of V j =1 documents xtj is independent of t and is drawn from a distribution with expectation . Under the above assumptions, the local likelihood (4) of the naive Bayes model becomes jN w
=1

The uniform kernel is the simplest choice and leads to a local likelihood (4) equivalent to filtering the data ^ by a sliding window i.e. t is computed based on data from adjacent time points with uniform weights. Unfortunately, it can be shown that the uniform kernel is suboptimal in terms of its statistical efficiency or rate of convergence to the underlying distribution (Wand & Jones, 1995). Surprisingly, the triangular kernel has a higher statistical efficiency than the Gaussian kernel and is the focus of our experiments in this subsection. We use the tricube kernel in the next subsection. The scale parameter h is central to the bias-variance tradeoff. Large h represents more uniform kernels achieving higher bias and lower variance. Small h represents a higher degree of locality or lower bias but higher variance. Since limh0 Kh approaches Dirac's delta function and limh Kh approaches a constant function the local log-likelihood (4) interpolates be-

t ( |D) =


Kh (t -  )
I 

c(w, x j ) log w
V

where   PV . The local likelihood has a single global maximum whose closed form is obtained by setting to 0 the gradient of the Lagrangian 1 ^ [t ]w jN

0=

Kh (t -  )
I

c(w, x j ) + w

=1


Local Likelihood Modeling of Temporal Text Streams

to obtain ^ [t ]w = 
I

We distinguish between two fundamental scenarios for predicting the drift t .

^ The estimator t is a normalized linear combination of word counts where the combination coefficients are determined by the kernel function and normalized by the number of words in different days. We note that ^ t in (8) is different from a weighted averaging of the w  relative frequencies c(w, x j )/  c(w , x j ).

Kh (t -  ) j =1 c(w, x j )  . I Kh (t -  )|x |

N

the proof we note that for Y  Mult(1, ), E Y = , Var () = diag() -  . (8) Examining Equations (9)-(10) reveals the expected dependency of the bias on h and t . The contribution to the bias of the terms ( - t ), for large | - t|, will decrease as h decreases since the kernel becomes more localized and will reduce to 0 as h  0. Similarly, for slower drifts,  - t , t   will decrease and reduce the bias. Despite the relative simplicity of Equations (9)-(10), it is difficult to quantitatively capture the relationship between the bias and variance, the sample size, h, , and the smoothness of t , g . Towards this goal we derive the following asymptotic expansions. Prop osition 2. Assuming (i) , g are smooth in t, (ii) h  0, hn  , (iii) g > 0 in a neighborhood of t, and (iv) document lengths do not depend on t and have expectation , we have in the offline case  + ^t |I ) = h2 µ21 (K ) t g (t) + 1 t ¨ bias ( oP (h2 ) g (t) 2 (11) µ02 (K )  ^ Var (t |I ) = (diag(t ) - t t ) + oP ((nh)-1 ) (nh)g (t) and in the online case ^  bias (t |I ) = hµ11 (K )t + oP (h) (12) µ (  µ12 (K )g (t) 02 (K )  ^ Var (t |I ) = + diag(t ) - t t ) nhg (t) ng 2 (t) µ12 (K )    + (diag(t ) - t t - t  ) + oP ((nh)-1 ) ng (t) uk l def where µkl (K ) = K (u) du is assumed to be finite d   and t is the vector [t ]i = dt [t ]i . The proof is somewhat similar to the derivation of the asymptotic bias and variance of the Nadaraya-Watson local regression (Wand & Jones, 1995) and is omitp ted due to space limitations. The notation gn  f represents convergence in probability of gn to f i.e.  > 0, P (|gn - f | > )  0, and gn = oP (fn ) reprep sents gn /fn  0. Corollary 1. Under the assumptions in Proposition 2, and in particular h  0, nh  , the esti^ ^p mator t is consistent i.e. t  t in both the offline and online settings. Proposition 2 specifies the conditions for consistency as well as the rate of convergence. In particular, the bias of online kernels converges at a linear rather than quadratic rate. In either cases, the estimator is biased and inconsistent unless h  0, n   and nh-1 

Offline scenario: The goal is to estimate the drift {t : t  R} given the entire dataset D. In this case we will consider symmetric kernels K (r) = K (-r) which will achieve an increased conver^ gence rate of t  t as indicated by Proposition 2. Online scenario: The goal is estimate a model for the present distribution t using training data from the past i.e. a dataset whose time stamps are strictly smaller than t. This corresponds to situations where the data arrives sequentially as a temporal stream and at each time point a model for the present is estimated using the available stream at that time. We realize this restriction by constraining K to satisfy K (r) = 0, r  0. ^ As with other statistical estimators, the accuracy of t may be measured in terms of its mean squared error ^ E (t -t )2 which decomposes as the sum of the squared ^ bias and variance of t . Examining these quantities ^ allow us to study the convergence rate of t   and its leading coefficient . ^ def ^ Prop osition 1. The bias vector bias (t ) = E t - t ^t in (8) are and variance matrix of   I Kh (t -  )|x | ( - t ) ^t ) =  bias ( (9) I Kh (t -  )|x |  2  I Kh (t -  )|x | (diag ( ) -   ) ^ Var (t ) =  2 I Kh (t -  )|x | (10) where diag(z ) is the diagonal matrix [diag(z )]ij = ij zi . Proof. The random variable (RV) c(w, x j ) is distributed as a sum of multivariate Bernoulli RVs, or single draws from multinomial distribution. The expectation and variance of the estimator are that of a linear combination of iid multinomial RVs. To conclude


Local Likelihood Modeling of Temporal Text Streams

. Expressions (11)-(12) reveal the performance gain associated with a slower drift and sampling density g  indicated by t and g  (t) and with more (represented by n) and longer (represented by ) documents. Figure 3 displays the RCV1 per-word test set loglikelihood for the online and offline scenarios as a function of the (triangular) kernel's bandwidth. As expected, offline kernels performs better than online kernels with both achieving the best performance for a bandwidth approximately 25 which corresponds to a support of 25 days in the online scenario and 50 days in the offline scenario. Note that in addition to obtaining higher accuracy than the global model corresponding to h  , the local model enjoys computational efficiency as it ignores a large portion of the training data. A central issue in local likelihood modeling is selecting the appropriate bandwidth h. A practical solution is to use cross validation or some other automatic bandwidth selection mechanism. On RCV1 data, the performance of such cross validation schemes is very good and the estimated bandwidth possesses test set loglikelihood that is almost identical to the optimal bandwidth (see Figure 4, left). Allowing the kernel scale to vary over time results in a higher modeling accuracy than using fixed bandwidth for all dates (see Figure 4, right). A time-dependent cross validation procedure may be used to approximate the time-dependent optimal bandwidth which performs slightly better than the fixed-date cross validation estimator. Note that the accuracy with which the cross validation estimator approximates the optimal bandwidth is lower in the time-dependent or varying bandwidth situation due the fact that much less data is available in each of the daily cross validation problems. From a theoretical perspective, the asymptotic bias and variance can be used to characterize the optimal bandwidth and study its properties. Minimizing the (offline) leading term of sum of component-wise MSE with respect to h we obtain the bandwidth estimator ^ h5 = t (13)

trated in Figure 5 (left) which demonstrates the temporal change in the gradient norm. Perhaps more interesting than the dependency of the ¨ optimal bandwidth on n, , t , t is its dependency on the time sampling distribution g (t). Equation (13) reveals an un-expected non-monotonic dependency of the optimal bandwidth in g (tg. The dependency, ex) g V ^ pressed by ht  ( j =1 (c1j / (t) + c2j (t))2 )-1/5 is illustrated in Figure 6 (left) where we assume for simplicity that c1j ,g 2j do not gchange with j resulting in c ^ t )-1  (c1 / (h (t) + c2 (t))2/5 . The key to understanding this relationship is the increased asymptotic bias due to the presence of the term g  (t)/g (t) in Equation (11). Intuitively, the variations in g (t) expressed by g  (t) introduce a bias component which alters the otherwise monotonic role of the optimal bandwidth and bias-variance tradeoff. Since g (t) is highly nonuniform (as illustrated in Figure 2), this dependency of ^ t on g (t) is likely to play a significant role. h We finally point out that different words w have dif ferent parameters [t ]w and parameter derivatives [t ]w which indicates that it is unlikely that a single bandwidth will work best for all words. Frequent words are likely to benefit more from narrow kernel smoothing than rare words which almost never appear. As a result, a lower bandwidth should be used for frequent words while a high bandwidth should be used for rare words. A systematic investigation of these topics is beyond the scope of this paper. 4.2. Lo cal Likeliho o d for Logistic Regression Often, the primary goal behind modeling the drift is conditional modeling i.e. predicting the value of y given x. In this case, drift modeling should focus on estimating the conditional pt (y |x) since modeling the marginal pt (x) becomes irrelevant. In contrast to the modeling of the conditional by Bayes rule pt (y |x)  pt (x|y )pt (y ) described in the previous section, we explore here direct modeling of {pt (y |x) : t  I } using local likelihood for logistic regression. By direct analogy to Equation (4) the conditional local likelihood estimator pt (y |x) is the maximizer of the locally weighted conditional loglikelihood t ( |D) = n Kh (t -  ) jN log p(y j |x j ;  )   .

 µ02 (K )tr(diag(t ) - t t ) [ 2. g g j ¨ 4nµ21 (K ) t ]j g  (t)/ (t) + (t)[t ]j /2 2

As expected, the optimal bandwidth decreases as n, , t ,  increases. Intuitively this makes sense  ¨ since in these cases the variance decreases and bias ¨ either increases or stays constant. In practice, t , t may vary significantly with time which leads to the conclusion that a single bandwidth selection for all t may not perform adequately. These changes are illus-

I

=1

As in the generative case, the kernel parameter h balances the degree of the kernel's locality and controls the bias-variance tradeoff. Denoting by f (x) the vector of relative frequencies in the document x, the logistic regression model


Local Likelihood Modeling of Temporal Text Streams
4 4

-1.512

x 10

-1.2785 online offline

x 10

online offline -1.279

-1.513

-1.514

-1.2795

-1.515

-1.28

-1.516

-1.2805

-1.517

-1.281

-1.518

-1.2815

-1.519

0

50

100

150

200

250

300

350

-1.282

0

50

100

150

200

250

300

350

Figure 3. Per-word log-likelihood of held out test set as a function of the triangular kernel's bandwidth for the two largest RCV1 categories (CCAT (left) and GCAT (right)). In all four cases, the optimal bandwidth seems to b e approximately 25 which indicates a supp ort of 25 days for the online kernels and 50 days for the offline kernels.

-6.395

infinite bandwidth optimal badwidth CV bandwidth

CV daily bandwidth optimal daily bandwidth -6.395

-6.4

-6.4

-6.405

-6.405

-6.41

-6.41

-6.415 20

30

40

50

60 70 daily train set size

80

90

100

100

90

80

70

60 50 daily train set size

40

30

-6.415 20

Figure 4. Per-word log-likelihood over held-out test set for various bandwidths as a function of the daily training set size. Left: The extreme global model corresp onding to h   p erforms worst. Selecting the bandwidth by cross validation results in an accurate estimate and test-set loglikelihood almost identical to that of the optimal slop e. Right: Allowing the kernel scale to vary over time results in a higher modeling accuracy than using fixed bandwidth for all dates.
0.078
0.01 0.008 0.006 0.004

0.076 0.074 0.072 0.07 0.068 0.066 0.064

h=opt online h=inf online h=opt offline h=inf offline

0.002

0.062 0.06

0

50

100

150

200

250

300

350

0.058

20

40

60

80

100 120 140 160 180 200

Figure 5. Left: Estimated gradient norm for the most p opular category in RCV1 as a function of t. The derivatives were estimated using local smoothing. To avoid running into b oundary effects we ignore the first and last 50 days. Right: Classification error rate over a held-out test set for the local logistic regression model as a function of the train set size.


Local Likelihood Modeling of Temporal Text Streams
p x; log 1-(1|1|x;t)t ) = t , f (x) ,   RV leads to the folp( lowing local conditional likelihood
Inverse optimal bandwidth 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 0

t ( |D) = -

n

Kh (t -  )

I

jN

log

=1

1

+e

-y  j x  j , 

.

In contrast to the naive Bayes model in the previous section, the local likelihood does not have a close form maximizer. However, it can be shown that under mild conditions it is a concave problem exhibiting a single global maximum (for each t) (Loader, 1999). Most of the standard iterative algorithms for training logistic regression can be modified to account for the local weighting introduced by the smoothing kernel. Moreover, recently popularized regularization techniques such as the penalty c  q , q = 1, 2 may be added to the local likelihood to obtain a local regularized version equivalent to maximum posterior estimation. Figure 5 (right) displays classification error rate over a held-out test set for local logistic regression as a function of the train set size. The classification task was predicting the most popular class vs the second most popular class in RCV1. The plots in the figure contrast the performance of the online and offline tricube kernels with optimal and infinite bandwidths, using L2 regularization. The local model achieved a relative reduction of error rate over the global model by about 8%. As expected, the online kernel generally achieve worse error rates than the offline kernels. In all the experiments mentioned above we averaged over multiple random samplings of the training set to remove sampling noise.

0.5 g(t)

1

1.5

Figure 6. Inverse of the optimal bandwidth derivedg from ^ Equation (13) as a function of g (t): (ht )-1  (c1 / (t ) + g 2/5 c2 (t)) (we take c1 = c2 = 1). The graph show the non-monotonic dep endency b etween ^ opt and g (t). h

References
Chen, S., & Goodman, J. (1998). An empirical study of smoothing techniques for language model ling (Technical Rep ort). Harvard university. Forman, G. (2006). Tackling concept drift by temp oral inductive transfer. Proc. of the ACM SIGIR Conference. Helmb old, D. P., & Long, P. M. (1994). Tracking drifting concepts by minimizing disagreements. Machine Learning, 14. Hulten, G., Sp encer, L., & Domingos, P. (2001). Mining time-changing data streams. Proc. of the ACM SIGKDD Conference. Kuh, A., Petsche, T., & Rivest, R. L. (1990). Learning time-varying concepts. Advances in Neural Information Processing Systems, 3. Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). RCV1: A new b enchmark collection for text categorization research. Journal of Machine Learning Research, 5. Loader, C. (1999). Local regression and likelihood. Springer. Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. MIT Press. Mosteller, F., & Wallace, D. (1964). Inference and disputed authorship: The federalist. Addison Wesley. Mulligan, G. (1999). Removing the spam: Email processing and filtering. Addison Wesley. Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationship for sentiment categorization with resp ect to rating scales. Proc. of the ACL conference. Sharan, U., & Neville, J. (2007). Exploiting time-varying relationships in statistical relational models. Proc. of the joint WebKDD and SNA-KDD workshop. Tibshirani, R., & Hastie, T. (1987). Local likelihood estimation. J. of the American Statistical Association, 82. Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. Chapman and Hall/CRC.

5. Discussion
A large number of textual datasets such as emails, webpages, news stories, etc. contain time stamped documents. For such datasets, considering a drifting rather than a stationary distribution is often appropriate. The local likelihood framework provides a natural extension for many standard likelihood models to the concept drift scenario. As the drift becomes more noticeable and the data size increases the potential benefits of local likelihood methods over their extreme global or local counterparts increase. In this paper we illustrate the drift phenomenon and examine the properties of the local likelihood estimator including the asymptotic bias and variance tradeoff and optimal bandwidth. Experiments conducted on the RCV1 dataset demonstrate the validity of the local likelihood estimators in practice and contrast them with more standard non-local alternatives.