Exploiting Feature Hierarchy for Transfer Learning in Named Entity Recognition
Andrew Arnold, Ramesh Nallapati and William W. Cohen Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA, USA {aarnold, nmramesh, wcohen}@cs.cmu.edu Abstract
We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets. The problem of transfer learning, where information gained in one learning task is used to improve performance in another related task, is an important new area of research. In the subproblem of domain adaptation, a model trained over a source domain is generalized to perform well on a related target domain, where the two domains' data are distributed similarly, but not identically. We introduce the concept of groups of closely-related domains, called genres, and show how inter-genre adaptation is related to domain adaptation. We also examine multitask learning, where two domains may be related, but where the concept to be learned in each case is distinct. We show that our prior conveys useful information across domains, genres and tasks, while remaining robust to spurious signals not related to the target domain and concept. We further show that our model generalizes a class of similar hierarchical priors, smoothed to varying degrees, and lay the groundwork for future exploration in this area.

having them written in the same language. Having successfully trained a named entity classifier on this news data, now consider the problem of learning to classify tokens as names in e-mail data. An intuitive solution might be to simply retrain the classifier, de novo, on the e-mail data. Practically, however, large, labeled datasets are often expensive to build and this solution would not scale across a large number of different datasets. Clearly the problems of identifying names in news articles and e-mails are closely related, and learning to do well on one should help your performance on the other. At the same time, however, there are serious differences between the two problems that need to be addressed. For instance, capitalization, which will certainly be a useful feature in the news problem, may prove less informative in the e-mail data since the rules of capitalization are followed less strictly in that domain. These are the problems we address in this paper. In particular, we develop a novel prior for named entity recognition that exploits the hierarchical feature space often found in natural language domains (§1.2) and allows for the transfer of information from labeled datasets in other domains (§1.3). §2 introduces the maximum entropy (maxent) and conditional random field (CRF) learning techniques employed, along with specifications for the design and training of our hierarchical prior. Finally, in §3 we present an empirical investigation of our prior's performance against a number of baselines, demonstrating both its effectiveness and robustness. 1.2 Hierarchical feature trees In many NER problems, features are often constructed as a series of transformations of the input training data, performed in sequence. Thus, if our task is to identify tokens as either being (O)utside or (I)nside person names, and we are given the labeled

1

Introduction

1.1 Problem definition Consider the task of named entity recognition (NER). Specifically, you are given a corpus of news articles in which all tokens have been labeled as either belonging to personal name mentions or not. The standard supervised machine learning problem is to learn a classifier over this training data that will successfully label unseen test data drawn from the same distribution as the training data, where "same distribution" could mean anything from having the train and test articles written by the same author to 245

Proceedings of ACL-08: HLT, pages 245­253, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics


sample training sentence: OO O O O I Give the book to Professor Caldwell (1) one such useful feature might be: Is the token one slot to the left of the current token Professor? We can represent this symbolically as L.1.Professor where we describe the whole space of useful features of this form as: {direction = (L)eft, (C)urrent, (R)ight}.{distance = 1, 2, 3, ...}.{value = Professor, book, ...}. We can conceptualize this structure as a tree, where each slot in the symbolic name of a feature is a branch and each period between slots represents another level, going from root to leaf as read left to right. Thus a subsection of the entire feature tree for the token Caldwell could be drawn as in Figure 1 (zoomed in on the section of the tree where the L.1.Professor feature resides). direction
L C R

LeftToken.* LeftToken.IsWord.* LeftToken.IsWord.IsTitle.* LeftToken.IsWord.IsTitle.equals.* LeftToken.IsWord.IsTitle.equals.mr
Table 1: A few examples of the feature hierarchy

to the named entity status of the current word. This is easily accomplished by backing up one level from a leaf in the tree structure to its parent, to represent a class of features such as L.1.*. It has been shown empirically that, while the significance of particular features might vary between domains and tasks, certain generalized classes of features retain their importance across domains (Minkov et al., 2005). By backing-off in this way, we can use the feature hierarchy as a prior for transferring beliefs about the significance of entire classes of features across domains and tasks. Some examples illustrating this idea are shown in table 1. 1.3 Transfer learning

distance
1 2 ...

... ...

...

P rof essor

value
b ook ...

... ...

true

false

Figure 1: Graphical representation of a hierarchical feature tree for token Caldwell in example Sentence 1.

Representing feature spaces with this kind of tree, besides often coinciding with the explicit language used by common natural language toolkits (Cohen, 2004), has the added benefit of allowing a model to easily back-off, or smooth, to decreasing levels of specificity. For example, the leaf level of the feature tree for our sample Sentence 1 tells us that the word Professor is important, with respect to labeling person names, when located one slot to the left of the current word being classified. This may be useful in the context of an academic corpus, but might be less useful in a medical domain where the word Professor occurs less often. Instead, we might want to learn the related feature L.1.Dr. In fact, it might be useful to generalize across multiple domains the fact that the word immediately preceding the current word is often important with respect 246

When only the type of data being examined is allowed to vary (from news articles to e-mails, for example), the problem is called domain adapta´ tion (Daume III and Marcu, 2006). When the task being learned varies (say, from identifying person names to identifying protein names), the problem is called multi-task learning (Caruana, 1997). Both of these are considered specific types of the overarching transfer learning problem, and both seem to require a way of altering the classifier learned on the first problem (called the source domain, or source task) to fit the specifics of the second problem (called the target domain, or target task). More formally, given an example x and a class label y , the standard statistical classification task is to assign a probability, p(y |x), to x of belonging to class y . In the binary classification case the labels are Y  {0, 1}. In the case we examine, each example xi is represented as a vector of binary features (f1 (xi ), · · · , fF (xi )) where F is the number of features. The data consists of two disjoint subsets: the training set (Xtrain , Ytrain ) = {(x1 , y1 ) · · · , (xN , yN )}, available to the model for its training and the test set Xtest = (x1 , · · · , xM ), upon which we want to use our trained classifier to make predictions.


In the paradigm of inductive learning, (Xtrain , Ytrain ) are known, while both Xtest and Ytest are completely hidden during training time. In this cases Xtest and Xtrain are both assumed to have been drawn from the same distribution, D. In the setting of transfer learning, however, we would like to apply our trained classifier to examples drawn from a distribution different from the one upon which it was trained. We therefore assume there are two different distributions, Dsource and Dtarget , from which data may be drawn. Given this notation we can then precisely state the transfer learning problem as trying to assign labels Yttarget to test est tar data Xtestget drawn from Dtarget , given training r su data (Xtroairce , Ytsounce ) drawn from Dsource . r ai n In this paper we focus on two subproblems of transfer learning: · domain adaptation, where we assume Y (the set of possible labels) is the same for both Dsource and Dtarget , while Dsource and Dtarget themselves are allowed to vary between domains. · multi-task learning (Ando and Zhang, 2005; Caruana, 1997; Sutton and McCallum, 2005; Zhang et al., 2005) in which the task (and label set) is allowed to vary from source to target. Domain adaptation can be further distinguished by the degree of relatedness between the source and target domains. For example, in this work we group data collected in the same medium (e.g., all annotated e-mails or all annotated news articles) as belonging to the same genre. Although the specific boundary between domain and genre for a particular set of data is often subjective, it is nevertheless a useful distinction to draw. One common way of addressing the transfer learning problem is to use a prior which, in conjunction with a probabilistic model, allows one to specify a priori beliefs about a distribution, thus biasing the results a learning algorithm would have produced had it only been allowed to see the training data (Raina et al., 2006). In the example from §1.1, our belief that capitalization is less strict in e-mails than in news articles could be encoded in a prior that biased the importance of the capitalization feature to be lower for e-mails than news articles. In the next section we address the problem of how to come up with a suitable prior for transfer learning across named entity recognition problems. 247

2

Models considered

2.1 Basic Conditional Random Fields In this work, we will base our work on Conditional Random Fields (CRF's) (Lafferty et al., 2001), which are now one of the most preferred sequential models for many natural language processing tasks. The parametric form of the CRF for a sentence of length n is given as follows: i 1 exp( p (Y = y|x) = Z (x)
n

=1 =1

(2) where Z (x) is the normalization term. CRF learns a model consisting of a set of weights  = {1 ...F } over the features so as to maximize the conditional likelihood of the training data, p(Ytrain |Xtrain ), given the model p . 2.2 CRF with Gaussian priors To avoid overfitting the training data, these 's are often further constrained by the use of a Gaussian prior (Chen and Rosenfeld, 1999) with diagonal covariance, N (µ,  2 ), which tries to maximize: argmax


jF

fj (x, yi )j )

kN

=1

l

og p (yk |xk )

-


F j

(j - µj )2 2 2j

where  > 0 is a parameter controlling the amount of regularization, and N is the number of sentences in the training set. 2.3 Source trained priors One recently proposed method (Chelba and Acero, 2004) for transfer learning in Maximum Entropy models 1 involves modifying the µ's of this Gaussian prior. First a model of the source domain, source , r su is learned by training on {Xtroairce , Ytsounce }. Then a r ai n model of the target domain is trained over a,limited X targ et targ et set of labeled target data but intrain , Ytrain

stead of regularizing this target to be near zero (i.e. setting µ = 0), target is instead regularized towards the previously learned source values source (by setting µ = source , while  2 remains 1) and thus minimizing (target - source )2 .
Maximum Entropy models are special cases of CRFs that use the I.I.D. assumption. The method under discussion can also be extended to CRF directly.
1


Note that, since this model requires Yttarget in orrain der to learn target , it, in effect, requires two distinct labeled training datasets: one on which to train the prior, and another on which to learn the model's final weights (which we call tuning), using the previously trained prior for regularization. If we are unable to find a match between features in the training and tuning datasets (for instance, if a word appears in the tuning corpus but not the training), we backoff to a standard N (0, 1) prior for that feature.
z3

z1

z2

the tree is also associated with a hyper-parameter zn . Note that since the hierarchy is a tree, each node n has only one parent, represented by pa(n). Similarly, we represent the set of children nodes of a node n as ch(n). The entire graphical model for an example consisting of three domains is shown in Figure 2. The conditional likelihood of the entire training (d) (d) (d) (d) data (y, x) = {(y1 , x1 ), · · · , (yMd , xMd )}D 1 is d= given by: × DM d kd (d) (d) P (yk |xk , (d) ) P (y|x, w, z) =  =1  n   =1 =1 dD fFd

w1(1)

w2(1)

w3(1)

w4(1)

w1(2)

w2(2)

w3(2)

w1(3)

w2(3)

xi

(1)

xi

(2)

xi

(3)

×

Tnonleaf

(1) yi M

(1)

(2) yi M

(2)

(3) yi M

  (d) N (f |zpa(f (d) ) , 1)  =1   N (zn |zpa(n) , 1) 

(3)

(3)

Figure 2: Graphical representation of the hierarchical transfer model.

2.4 New model: Hierarchical prior model In this section, we will present a new model that learns simultaneously from multiple domains, by taking advantage of our feature hierarchy. We will assume that there are D domains on which we are learning simultaneously. Let there be Md training data in each domain d. For our experiments with non-identically distributed, independent data, we use conditional random fields (cf. §2.1). However, this model can be extended to any discriminative probabilistic model such as the MaxEnt (d) (d) model. Let (d) = (1 , · · · , Fd ) be the parameters of the discriminative model in the domain d where Fd represents the number of features in the domain d. Further, we will also assume that the features of different domains share a common hierarchy represented by a tree T , whose leaf nodes are the features themselves (cf. Figure 1). The model parameters (d) , then, form the parameters of the leaves of this hierarchy. Each non-leaf node n  non-leaf(T ) of 248

where the terms in the first line of eq. (3) represent the likelihood of data in each domain given their corresponding model parameters, the second line represents the likelihood of each model parameter in each domain given the hyper-parameter of its parent in the tree hierarchy of features and the last term goes over the entire tree T except the leaf nodes. Note that in the last term, the hyper-parameters are shared across the domains, so there is no product over d. We perform a MAP estimation for each model parameter as well as the hyper-parameters. Accordingly, the estimates are given as follows:
(d) f

+ =

zpa(f (d) ) zpa(n) +

M id


(d) =1   f

l

d og P (yi |xi , (d) )

(d)

zn =

1 + |ch(n)|

i

ch(n) (|z )i

(4)

where we used the notation (|z )i because node i, the child node of n, could be a parameter node or a hyper-parameter node depending on the position of the node n in the hierarchy. Essentially, in this model, the weights of the leaf nodes (model parameters) depend on the log-likelihood as well as the prior weight of its parent. Additionally, the weight


of each hyper-parameter node in the tree is computed as the average of all its children nodes and its parent, resulting in a smoothing effect, both up and down the tree. 2.5 An approximate Hierarchical prior model The Hierarchical prior model is a theoretically well founded model for transfer learning through feature heirarchy. However, our preliminary experiments indicated that its performance on real-life data sets is not as good as expected. Although a more thorough investigation needs to be carried out, our analysis indicates that the main reason for this phenomenon is over-smoothing. In other words, by letting the information propagate from the leaf nodes in the hierarchy all the way to the root node, the model loses its ability to discriminate between its features. As a solution to this problem, we propose an approximate version of this model that weds ideas from the exact heirarchical prior model and the Chelba model. As with the Chelba prior method in §2.3, this approximate hierarchical method also requires two distinct data sets, one for training the prior and another for tuning the final weights. Unlike Chelba, we smooth the weights of the priors using the featuretree hierarchy presented in §1.1, like the hierarchical prior model. For smoothing of each feature weight, we chose to back-off in the tree as little as possible until we had a large enough sample of prior data (measured as M , the number of subtrees below the current node) on which to form a reliable estimate of the mean and variance of each feature or class of features. For example, if the tuning data set is as in Sentence 1, but the prior contains no instances of the word Professor, then we would back-off and compute the prior mean and variance on the next higher level in the tree. Thus the prior for L.1.Professor would be N (mean(L.1.*), variance(L.1.*)), where mean() and variance() of L.1.* are the sample mean and variance of all the features in the prior dataset that match the pattern L.1.* ­ or, put another way, all the siblings of L.1.Professor in the feature tree. If fewer than M such siblings exist, we continue backing-off, up the tree, until an ancestor with sufficient descendants is found. A detailed description of the approximate hierarchical algorithm is shown in table 2. 249

su r Input: Dsource = (Xtroairce , Ytsounce ) n r ai tar e Dtarget = (Xtraign t , Yttarget ); rain Feature sets F source , F target ; Feature Hierarchies Hsource , Htarget Minimum membership size M Train CRF using Dsource to obtain feature weights source For each feature f  F target Initialize: node n = f While (n  Hsource / or |Leaves(Hsource (n))|  M ) and n = root(Htarget ) n  Pa(Htarget (n)) Compute µf and f using the sample {source | i  Leaves(Hsource (n))} i Train Gaussian prior CRF using Dtarget as data and {µf } and {f } as Gaussian prior parameters. Output:Parameters of the new CRF target .

Table 2: Algorithm for approximate hierarchical prior: Pa(Hsource (n)) is the parent of node n in feature hierarchy Hsource ; |Leaves(Hsource (n))| indicates the number of leaf nodes (basic features) under a node n in the hierarchy Hsource .

It is important to note that this smoothed tree is an approximation of the exact model presented in §2.4 and thus an important parameter of this method in practice is the degree to which one chooses to smooth up or down the tree. One of the benefits of this model is that the semantics of the hierarchy (how to define a feature, a parent, how and when to back-off and up the tree, etc.) can be specified by the user, in reference to the specific datasets and tasks under consideration. For our experiments, the semantics of the tree are as presented in §1.1. The Chelba method can be thought of as a hierarchical prior in which no smoothing is performed on the tree at all. Only the leaf nodes of the prior's feature tree are considered, and, if no match can be found between the tuning and prior's training datasets' features, a N (0, 1) prior is used instead. However, in the new approximate hierarchical model, even if a certain feature in the tuning dataset does not have an analog in the training dataset, we can always back-off until an appropriate match is found, even to the level of the root. Henceforth, we will use only the approximate hierarchical model in our experiments and discussion.


Table 3: Summary of data used in experiments

Intra-genre transfer performance evaluated on MUC6 0.7 0.6 0.5 F1 0.4 0.3 0.2 0.1 0
(a) GAUSS: tuned on MUC6 (b) CAT: tuned on MUC6+7 (c) HIER: MUC6+7 prior, tuned on MUC6 (d) CHELBA: MUC6+7 prior, tuned on MUC6

Corpus UTexas Yapex MUC6 MUC7 CSPACE

Genre Bio Bio News News E-mail

Task Protein Protein Person Person Person

3

Investigation

3.1 Data, domains and tasks For our experiments, we have chosen five different corpora (summarized in Table 3). Although each corpus can be considered its own domain (due to variations in annotation standards, specific task, date of collection, etc), they can also be roughly grouped into three different genres. These are: abstracts from biological journals [UT (Bunescu et al., ´ 2004), Yapex (Franzen et al., 2002)]; news articles [MUC6 (Fisher et al., 1995), MUC7 (Borthwick et al., 1998)]; and personal e-mails [CSPACE (Kraut et al., 2004)]. Each corpus, depending on its genre, is labeled with one of two name-finding tasks: · protein names in biological abstracts · person names in news articles and e-mails We chose this array of corpora so that we could evaluate our hierarchical prior's ability to generalize across and incorporate information from a variety of domains, genres and tasks. In each case, each item (abstract, article or e-mail) was tokenized and each token was hand-labeled as either being part of a name (protein or person) or not, respectively. We used a standard natural language toolkit (Cohen, 2004) to compute tens of thousands of binary features on each of these tokens, encoding such information as capitalization patterns and contextual information from surrounding words. This toolkit produces features of the type described in §1.2 and thus was amenable to our hierarchical prior model. In particular, we chose to use the simplest default, out-of-the-box feature generator and purposefully did not use specifically engineered features, dictionaries, or other techniques commonly employed to boost performance on such tasks. The goal of our experiments was to see to what degree named entity recognition problems naturally conformed to hierarchical methods, and not just to achieve the highest performance possible. 250

20

40

60

80

100

Percent of target-domain data used for tuning
Figure 3: Adding a relevant HIER prior helps compared to the GAUSS baseline ((c) > (a)), while simply CAT'ing or using CHELBA can hurt ((d)  (b) < (a), except with very little data), and never beats HIER ((c) > (b)  (d)).

3.2 Experiments & results We evaluated the performance of various transfer learning methods on the data and tasks described in §3.1. Specifically, we compared our approximate hierarchical prior model (HIER), implemented as a CRF, against three baselines: · GAUSS: CRF model tuned on a single domain's data, using a standard N (0, 1) prior · CAT: CRF model tuned on a concatenation of multiple domains' data, using a N (0, 1) prior · CHELBA: CRF model tuned on one domain's data, using a prior trained on a different, related domain's data (cf. §2.3) We use token-level F 1 as our main evaluation measure, combining precision and recall into one metric. 3.2.1 Intra-genre, same-task transfer learning Figure 3 shows the results of an experiment in learning to recognize person names in MUC6 news articles. In this experiment we examined the effect of adding extra data from a different, but related domain from the same genre, namely, MUC7. Line a shows the F1 performance of a CRF model tuned only on the target MUC6 domain (GAUSS) across a range of tuning data sizes. Line b shows the same experiment, but this time the CRF model has been tuned on a dataset comprised of a simple concatenation of the training MUC6 data from (a), along with a different training set from MUC7 (CAT). We can see that adding extra data in this way, though


Inter-genre transfer performance evaluated on MUC6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
(e) HIER: MUC6+7 prior, tuned on MUC6 (f) CAT: tuned on all domains (g) HIER: all domains prior, tuned on MUC6 (h) CHELBA: all domains prior, tuned on MUC6

20

40

60

80

100

Percent of target-domain data used for tuning
Figure 4: Transfer aware priors CHELBA and HIER effectively filter irrelevant data. Adding more irrelevant data to the priors doesn't hurt ((e)  (g)  (h)), while simply CAT'ing it, in this case, is disastrous ((f) << (e).

the data is closely related both in domain and task, has actually hurt the performance of our recognizer for training sizes of moderate to large size. This is most likely because, although the MUC6 and MUC7 datasets are closely related, they are still drawn from different distributions and thus cannot be intermingled indiscriminately. Line c shows the same combination of MUC6 and MUC7, only this time the datasets have been combined using the HIER prior. In this case, the performance actually does improve, both with respect to the single-dataset trained baseline (a) and the naively trained double-dataset (b). Finally, line d shows the results of the CHELBA prior. Curiously, though the domains are closely related, it does more poorly than even the non-transfer GAUSS. One possible explanation is that, although much of the vocabulary is shared across domains, the interpretation of the features of these words may differ. Since CHELBA doesn't model the hierarchy among features like HIER, it is unable to smooth away these discrepancies. In contrast, we see that our HIER prior is able to successfully combine the relevant parts of data across domains while filtering the irrelevant, and possibly detrimental, ones. This experiment was repeated for other sets of intra-genre tasks, and the results are summarized in §3.2.3. 3.2.2 Inter-genre, multi-task transfer learning In Figure 4 we see that the properties of the hierarchical prior hold even when transferring across 251

tasks. Here again we are trying to learn to recognize person names in MUC6 e-mails, but this time, instead of adding only other datasets similarly labeled with person names, we are additionally adding biological corpora (UT & YAPEX), labeled not with person names but with protein names instead, along with the CSPACE e-mail and MUC7 news article corpora. The robustness of our prior prevents a model trained on all five domains (g) from degrading away from the intra-genre, same-task baseline (e), unlike the model trained on concatenated data (f ). CHELBA (h) performs similarly well in this case, perhaps because the domains are so different that almost none of the features match between prior and tuning data, and thus CHELBA backs-off to a standard N (0, 1) prior. This robustness in the face of less similarly related data is very important since these types of transfer methods are most useful when one possesses only very little target domain data. In this situation, it is often difficult to accurately estimate performance and so one would like assurance than any transfer method being applied will not have negative effects. 3.2.3 Comparison of HIER prior to baselines Each scatter plot in Figure 5 shows the relative performance of a baseline method against HIER. Each point represents the results of two experiments: the y-coordinate is the F1 score of the baseline method (shown on the y-axis), while the xcoordinate represents the score of the HIER method in the same experiment. Thus, points lying below the y = x line represent experiments for which HIER received a higher F1 value than did the baseline. While all three plots show HIER outperforming each of the three baselines, not surprisingly, the non-transfer GAUSS method suffers the worst, followed by the naive concatenation (CAT) baseline. Both methods fail to make any explicit distinction between the source and target domains and thus suffer when the domains differ even slightly from each other. Although the differences are more subtle, the right-most plot of Figure 5 suggests HIER is likewise able to outperform the nonhierarchical CHELBA prior in certain transfer scenarios. CHELBA is able to avoid suffering as much as the other baselines when faced with large difference between domains, but is still unable to capture

F1


1 .8 GAUSS (F1) .6 .4 .2 0 0 .2 .4 .6 .8 1 HIER (F1) y=x
MUC6@3% MUC6@6%

1 .8 .6 .4 .2 0 0 .2 .4 .6 .8 1 .4 .6 HIER (F1)
CSPACE@13% CSPACE@25% CSPACE@50% CSPACE@100%

.8 CHELBA (F1)

CAT (F1)

.6

.4 .8

HIER (F1)
MUC6@13% MUC6@25% MUC6@50% MUC6@100% CSPACE@3% CSPACE@6%

~

Figure 5: Comparative performance of baseline methods (GAUSS, CAT, CHELBA) vs. HIER prior, as trained on nine prior datasets (both pure and concatenated) of various sample sizes, evaluated on MUC6 and CSPACE datasets. Points below the y = x line indicate HIER outperforming baselines.

as many dependencies between domains as HIER.

4

Conclusions, related & future work

In this work we have introduced hierarchical feature tree priors for use in transfer learning on named entity extraction tasks. We have provided evidence that motivates these models on intuitive, theoretical and empirical grounds, and have gone on to demonstrate their effectiveness in relation to other, competitive transfer methods. Specifically, we have shown that hierarchical priors allow the user enough flexibility to customize their semantics to a specific problem, while providing enough structure to resist unintended negative effects when used inappropriately. Thus hierarchical priors seem a natural, effective and robust choice for transferring learning across NER datasets and tasks. Some of the first formulations of the transfer learning problem were presented over 10 years ago (Thrun, 1996; Baxter, 1997). Other techniques have tried to quantify the generalizability of cer´ tain features across domains (Daume III and Marcu, 2006; Jiang and Zhai, 2006), or tried to exploit the common structure of related problems (Ben-David ¨ et al., 2007; Scholkopf et al., 2005). Most of this prior work deals with supervised transfer learning, and thus requires labeled source domain data, though there are examples of unsupervised (Arnold 252

et al., 2007), semi-supervised (Grandvalet and Bengio, 2005; Blitzer et al., 2006), and transductive approaches (Taskar et al., 2003). Recent work using so-called meta-level priors to transfer information across tasks (Lee et al., 2007), while related, does not take into explicit account the hierarchical structure of these meta-level features of´ ten found in NLP tasks. Daume allows an extra degree of freedom among the features of his domains, implicitly creating a two-level feature hierarchy with one branch for general features, and another for domain specific ones, but does not extend his hierar´ chy further (Daume III, 2007)). Similarly, work on hierarchical penalization (Szafranski et al., 2007) in two-level trees tries to produce models that rely only on a relatively small number of groups of variable, as structured by the tree, as opposed to transferring knowledge between branches themselves. Our future work is focused on designing an algorithm to optimally choose a smoothing regime for the learned feature trees so as to better exploit the similarities between domains while neutralizing their differences. Along these lines, we are working on methods to reduce the amount of labeled target domain data needed to tune the prior-based models, looking forward to semi-supervised and unsupervised transfer methods.


References
Rie K. Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. In JMLR 6, pages 1817 ­ 1853. Andrew Arnold, Ramesh Nallapati, and William W. Cohen. 2007. A comparative study of methods for transductive transfer learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM) 2007 Workshop on Mining and Management of Biological Data. Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7­39. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. 2007. Analysis of representations for domain adaptation. In NIPS 20, Cambridge, MA. MIT Press. John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In EMNLP, Sydney, Australia. A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998. NYU: Description of the MENE named entity system as used in MUC-7. R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. Mooney, A. Ramani, and Y. Wong. 2004. Comparative experiments on learning information extractors for proteins and their interactions. In Journal of AI in Medicine. Data from ftp://ftp.cs.utexas.edu/pub/mooney/biodata/proteins.tar.gz. Rich Caruana. 1997. Multitask learning. Machine Learning, 28(1):41­75. Ciprian Chelba and Alex Acero. 2004. Adaptation of maximum entropy capitalizer: Little data can help a lot. In Dekang Lin and Dekai Wu, editors, EMNLP 2004, pages 285­292. ACL. S. Chen and R. Rosenfeld. 1999. A gaussian prior for smoothing maximum entropy models. William W. Cohen. 2004. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net. ´ Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. In Journal of Artificial Intelligence Research 26, pages 101­126. ´ Hal Daume III. 2007. Frustratingly easy domain adaptation. In ACL. David Fisher, Stephen Soderland, Joseph McCarthy, Fangfang Feng, and Wendy Lehnert. 1995. Description of the UMass system as used for MUC-6. ´ Kristofer Franzen, Gunnar Eriksson, Fredrik Olsson, Lars ¨ Asker, Per Lidn, and Joakim Coster. 2002. Protein names and how to find them. In International Journal of Medical Informatics.

Yves Grandvalet and Yoshua Bengio. 2005. Semisupervised learning by entropy minimization. In CAP, Nice, France. Jing Jiang and ChengXiang Zhai. 2006. Exploiting domain structure for named entity recognition. In Human Language Technology Conference, pages 74 ­ 81. R. Kraut, S. Fussell, F. Lerch, and J. Espinosa. 2004. Coordination in teams: evidence from a simulated management game. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282­289. Morgan Kaufmann, San Francisco, CA. S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller. 2007. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of International Conference on Machine Learning (ICML). Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In HLT/EMNLP. Rajat Raina, Andrew Y. Ng, and Daphne Koller. 2006. Transfer learning by constructing informative priors. In ICML 22. ¨ Bernhard Scholkopf, Florian Steinke, and Volker Blanz. 2005. Object correspondence as a machine learning problem. In ICML '05: Proceedings of the 22nd international conference on Machine learning, pages 776­ 783, New York, NY, USA. ACM. Charles Sutton and Andrew McCallum. 2005. Composition of conditional random fields for transfer learning. In HLT/EMLNLP. M. Szafranski, Y. Grandvalet, and P. MorizetMahoudeaux. 2007. Hierarchical penalization. In Advances in Neural Information Processing Systems 20. MIT press. B. Taskar, M.-F. Wong, and D. Koller. 2003. Learning on the test data: Leveraging `unseen' features. In Proc. Twentieth International Conference on Machine Learning (ICML). Sebastian Thrun. 1996. Is learning the n-th thing any easier than learning the first? In NIPS, volume 8, pages 640­646. MIT. J. Zhang, Z. Ghahramani, and Y. Yang. 2005. Learning multiple related tasks using latent independent component analysis.

253