Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation
Yee Seng Chan and Hwee Tou Ng Department of Computer Science National University of Singapore 3 Science Drive 2, Singapore 117543 chanys,nght @comp.nus.edu.sg
б  

Abstract
Instances of a word drawn from different domains may have different sense priors (the proportions of the different senses of a word). This in turn affects the accuracy of word sense disambiguation (WSD) systems trained and applied on different domains. This paper presents a method to estimate the sense priors of words drawn from a new domain, and highlights the importance of using well calibrated probabilities when performing these estimations. By using well calibrated probabilities, we are able to estimate the sense priors effectively to achieve significant improvements in WSD accuracy.

1

Introduction

Many words have multiple meanings, and the process of identifying the correct meaning, or sense of a word in context, is known as word sense disambiguation (WSD). Among the various approaches to WSD, corpus-based supervised machine learning methods have been the most successful to date. With this approach, one would need to obtain a corpus in which each ambiguous word has been manually annotated with the correct sense, to serve as training data. However, supervised WSD systems faced an important issue of domain dependence when using such a corpus-based approach. To investigate this, Escudero et al. (2000) conducted experiments using the DSO corpus, which contains sentences drawn from two different corpora, namely Brown Corpus (BC) and Wall Street Journal (WSJ). They found that training a WSD system on one part (BC or WSJ) of the DSO corpus and applying it to the
89

other part can result in an accuracy drop of 12% to 19%. One reason for this is the difference in sense priors (i.e., the proportions of the different senses of a word) between BC and WSJ. For instance, the noun interest has these 6 senses in the DSO corpus: sense 1, 2, 3, 4, 5, and 8. In the BC part of the DSO corpus, these senses occur with the proportions: 34%, 9%, 16%, 14%, 12%, and 15%. However, in the WSJ part of the DSO corpus, the proportions are different: 13%, 4%, 3%, 56%, 22%, and 2%. When the authors assumed they knew the sense priors of each word in BC and WSJ, and adjusted these two datasets such that the proportions of the different senses of each word were the same between BC and WSJ, accuracy improved by 9%. In another work, Agirre and Martinez (2004) trained a WSD system on data which was automatically gathered from the Internet. The authors reported a 14% improvement in accuracy if they have an accurate estimate of the sense priors in the evaluation data and sampled their training data according to these sense priors. The work of these researchers showed that when the domain of the training data differs from the domain of the data on which the system is applied, there will be a decrease in WSD accuracy. To build WSD systems that are portable across different domains, estimation of the sense priors (i.e., determining the proportions of the different senses of a word) occurring in a text corpus drawn from a domain is important. McCarthy et al. (2004) provided a partial solution by describing a method to predict the predominant sense, or the most frequent sense, of a word in a corpus. Using the noun interest as an example, their method will try to predict that sense 1 is the predominant sense in the BC part of the DSO corpus, while sense 4 is the predominant sense in the WSJ part of the

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 89н96, Sydney, July 2006. c 2006 Association for Computational Linguistics


(1)
 &

90

'

 8

@ C DA 8

в

To estimate the sense priors, or a priori probabilities of the different senses in a new dataset,

&

&

' 9б

 E

&

2

Estimation of Priors

Assuming the within-class densities , given the i.e., the probabilities of observing class , do not change from the training set D to the new data set, we can define: . To determine the a priori probability estimates of the new data set that will maximize the likelihood of (1) with respect to , we can apply the iterative procedure of the EM algorithm. In effect, through maximizing the likelihood of (1), we obtain the a priori probability estimates as a by-product. Let us now define some notations. When we apply a classifier trained on D on an instance drawn from the new data set D , we get , which we define as the probability of instance being classified as class by the classifier trained on D . Further, let us define as the a priori probabilities of class in D . This can be estimated by the class frequency of in D . We also define and as estimates of the new a priori and a posteriori probabilities at step s of the iterative EM procedure. As, then for suming we initialize each instance in D and each class , the EM
    &  7 & ' 9б & ' 3б & '  8 1 ' 6 в б   в1   @ & в & ' 9б & ' CBA 8 '    ' 9б & ' 9б CBA 8    & ' 9б в   в1 8  & & 1 ' 5 в б ' & ' 9б   в    8

 
1 ' 5 в б

) 4

&

' 3б

) 0


&

&

' (д  в б

1 ' 2 в б

  вб

&% # $" г

г

&% # $"

г    г    !г   


   в ззиг в б  джжжд


  в йзизег в б джжжд

corpus. In our recent work (Chan and Ng, 2005b), we directly addressed the problem by applying machine learning methods to automatically estimate the sense priors in the target domain. For instance, given the noun interest and the WSJ part of the DSO corpus, we attempt to estimate the proportion of each sense of interest occurring in WSJ and showed that these estimates help to improve WSD accuracy. In our work, we used naive Bayes as the training algorithm to provide posterior probabilities, or class membership estimates, for the instances in the target domain. These probabilities were then used by the machine learning methods to estimate the sense priors of each word in the target domain. However, it is known that the posterior probabilities assigned by naive Bayes are not reliable, or not well calibrated (Domingos and Pazzani, 1996). These probabilities are typically too extreme, often being very near 0 or 1. Since these probabilities are used in estimating the sense priors, it is important that they are well calibrated. In this paper, we explore the estimation of sense priors by first calibrating the probabilities from naive Bayes. We also propose using probabilities from another algorithm (logistic regression, which already gives well calibrated probabilities) to estimate the sense priors. We show that by using well calibrated probabilities, we can estimate the sense priors more effectively. Using these estimates improves WSD accuracy and we achieve results that are significantly better than using our earlier approach described in (Chan and Ng, 2005b). In the following section, we describe the algorithm to estimate the sense priors. Then, we describe the notion of being well calibrated and discuss why using well calibrated probabilities helps in estimating the sense priors. Next, we describe an algorithm to calibrate the probability estimates from naive Bayes. Then, we discuss the corpora and the set of words we use for our experiments before presenting our experimental results. Next, we propose using the well calibrated probabilities of logistic regression to estimate the sense priors, and perform significance tests to compare our various results before concluding.

we used a confusion matrix algorithm (Vucetic and Obradovic, 2001) and an EM based algorithm (Saerens et al., 2002) in (Chan and Ng, 2005b). Our results in (Chan and Ng, 2005b) indicate that the EM based algorithm is effective in estimating the sense priors and achieves greater improvements in WSD accuracy compared to the confusion matrix algorithm. Hence, to estimate the sense priors in our current work, we use the EM based algorithm, which we describe in this section. 2.1 EM Based Algorithm

Most of this section is based on (Saerens et al., 2002). Assume we have a set of labeled data D with n classes and a set of N independent instances from a new data set. The likelihood of these N instances can be defined as:


algorithm provides the following iterative steps:
 CBA 8 г   в 1 3б ' % &  CA 8 C  иA  б  1   8 ' жC д  и зев б8 A  в 3б й C и A  б8 ' й ждв и зег8б   в 1 9б & ' 9б @ в   в1  8 CA 8 C и A  б   C и 8б A  C йи A  б8 й и б8 & ' 9б  &   в1  8 ' 9б 8   в1  8  в 1  9б ' & ' 9б   в1 &  8 ' 9б ' 3б  г й#    в1 & @ ' 9б ' 9б

(2)

(3)

Assume for each instance , a classifier outbetween 0 and 1, of puts a probability S belonging to class . The classifier is wellcalibrated if the empirical class membership probability S converges to the probability value S as the number of examples classified goes to infinity (Zadrozny and Elkan, 2002). Intuitively, if we consider all the instances to which the classifier assigns a probability S of say 0.6, then 60% of these instances should be members of class . 3.2 Being Well Calibrated Helps Estimation
 вб

To see why using an algorithm which gives well calibrated probabilities helps in estimating the sense priors, let us rewrite Equation (3), the Mstep of the EM procedure, as the following:
  в1 & ' 9б CBA 8  % й %

(4)

B

B

91

1 I 1 I

 1 I 1 0 I

UaVгVXVbU 1 г

UaVгVгV2U 1 г

1г б

1 й`б г0

й и


&

' 9б

C

г

 B A 8

In our eariler work (Chan and Ng, 2005b), the posterior probabilities assigned by a naive Bayes classifier are used by the EM procedure described in the previous section to estimate the sense priors in a new dataset. However, it is known that the posterior probabilities assigned by naive Bayes are not well calibrated (Domingos and Pazzani, 1996).
 & ' 9б 8

S

3

Calibration of Probabilities

B

B

(6)

T


S

1I1

 вб

UVVV ггXWU 1 S 1

й и R

Q

S S 0 Y S ' & UVVV XггWU 1 г 1

@

Q

джж йзиж д


T

where denotes the a priori probability of class from D and denotes the adjusted predictions.
% )B # ( " '8
  '

(5) where S = denotes the set of posterior probability values for class , and S denotes the posterior probability of class assigned by the classifier for instance . Based on , we can imagine that we have bins, where each bin is associated with a specific value. Now, distribute all the instances in the new dataset D into the bins according to their posterior probabilities . Let B , for , denote the set of instances in bin . Note that B B B=. Now, let denote the proportion of instances with true class label in B . Given a well calibrated algorithm, by definition and Equation (5) can be rewritten as:


8

в б&

' й и

8

' H & P 3Aйзиззй0 I0дж ж ж д г D @8 F G% C EC A B 5 A97 4 6 5 4%

в

I0дж жждг 9йиззз0

й и

If a classifier estimates posterior class probabilities when presented with a new instance from D , it can be directly adjusted according to estimated a priori probabilities on D :
' 9б  8    в1   в1 & ' 3б ' 3б  8 в

й


0

C

Q

 3B A 8

2.2 Using A Priori Estimates
&

й и

в

 в&б

й и 0 2  в б й и 1 0  (1E в б

' й и

&

'


&

' 9б

&

' 9б

г

в

where Equation (2) represents the expectation Estep, Equation (3) represents the maximization Mstep, and N represents the number of instances in D . Note that the probabilities and in Equation (2) will stay the same throughout the iterations for each particular instance and class . The new a posteriori probabilities at step s in Equation (2) are simply the a posteriori probabilities in the conditions of the labeled data, , weighted by the ratio of the new priors to the old priors . The denominator in Equation (2) is simply a normalizing factor. and a priori probaThe a posteriori bilities are re-estimated sequentially during each iteration s for each new instance and each class , until the convergence of the estimated probabilities . This iterative procedure will increase the likelihood of (1) at each step.
в  &  & &  9б '  & в 1 C BA 8 CBA 8 & C BA 8 ' 9б  8  & & ' 9б & '   в 1 9б ' & ' CBA 8  & ' 9б CBA 8 @  8

It is important to use an algorithm which gives well calibrated probabilities, if we are to use the probabilities in estimating the sense priors. In this section, we will first describe the notion of being well calibrated before discussing why having well calibrated probabilities helps in estimating the sense priors. Finally, we will introduce a method used to calibrate the probabilities from naive Bayes. 3.1 Well Calibrated Probabilities


&

' 9б

C

г  &

 B A 8

% &B

 8 &

# $

CBA 8

" !8


, where

Figure 1: PAV algorithm. denotes the number of instances in D where with true class label . Therefore, reflects the proportion of instances in D with true class label . Hence, using an algorithm which gives well calibrated probabilities helps in the estimation of sense priors. 3.3 Isotonic Regression Zadrozny and Elkan (2002) successfully used a method based on isotonic regression (Robertson et al., 1988) to calibrate the probability estimates from naive Bayes. To compute the isotonic regression, they used the pair-adjacent violators (PAV) (Ayer et al., 1955) algorithm, which we show in Figure 1. Briefly, what PAV does is to initially view each data value as a level set. While there are two adjacent sets that are out of order (i.e., the left level set is above the right one) then the sets are combined and the mean of the data values becomes the value of the new level set. PAV works on binary class problems. In a binary class problem, we have a positive class and a negative class. Now, let , where represent N examples and is the probability of belonging to the positive class, as predicted by a classifier. Further, let represent the true label of . For a binary class problem, we let if is a positive example and if is a negative example. The PAV algorithm takes in a set of , sorted in ascending order of and returns a series of increasing step-values, where each step-value (denoted by m in Figure 1) is associated with a lowest boundary value and a highest boundary value . We performed 10-fold crossvalidation on the training data to assign values to . We then applied the PAV algorithm to obtain values for . To obtain the calibrated probability estimate for a test instance , we find the boundary values and where S and assign as the calibrated probability estimate. To apply PAV on a multiclass problem, we first reduce the problem into a number of binary class
в в  @  ' 9б C г @  3B A 8 ' & ' & &

4

Selection of Dataset

In this section, we discuss the motivations in choosing the particular corpora and the set of words used in our experiments. 4.1 DSO Corpus

The DSO corpus (Ng and Lee, 1996) contains 192,800 annotated examples for 121 nouns and 70 verbs, drawn from BC and WSJ. BC was built as a balanced corpus and contains texts in various categories such as religion, fiction, etc. In contrast, the focus of the WSJ corpus is on financial and business news. Escudero et al. (2000) exploited the difference in coverage between these two corpora to separate the DSO corpus into its BC and WSJ parts for investigating the domain dependence of several WSD algorithms. Following their setup, we also use the DSO corpus in our experiments. The widely used SEMCOR (SC) corpus (Miller et al., 1994) is one of the few currently available manually sense-annotated corpora for WSD. SEMCOR is a subset of BC. Since BC is a balanced corpus, and training a classifier on a general corpus before applying it to a more specific corpus is a natural scenario, we will use examples from BC as training data, and examples from WSJ as evaluation data, or the target dataset. 4.2 Parallel Texts

Scalability is a problem faced by current supervised WSD systems, as they usually rely on manually annotated data for training. To tackle this problem, in one of our recent work (Ng et al., 2003), we had gathered training data from parallel texts and obtained encouraging results in our
92

'

with m

'

&

Set Replace

&

в иб

S

E


д " й Dд  й  C  "  4 A @ 8  з2 0 ( & $ 3)1)'%  " й  #  в й 7 B9576 в й  #   й ## ## " д в й д й ! й д  й  в д  вже  й в


F

в


 вб

в

 в йзизег в джжжд

Q

й и


Q R S 

F

в

з же д гб   вв
S 
Q

Input: training set Initialize While k such that and

sorted in ascending order of

problems. For reducing a multiclass problem into a set of binary class problems, experiments in (Zadrozny and Elkan, 2002) suggest that the oneagainst-all approach works well. In one-againstall, a separate classifier is trained for each class , where examples belonging to class are treated as positive examples and all other examples are treated as negative examples. A separate classifier is then learnt for each binary class problem and the probability estimates from each classifier are calibrated. Finally, the calibrated binary-class probability estimates are combined to obtain multiclass probabilities, computed by a simple normalization of the calibrated estimates from each binary classifier, as suggested by Zadrozny and Elkan (2002).

S


F PHIG F

S  T U


T

й и

Q  д б

S  T U

д вд  б


evaluation on the nouns of SENSEVAL-2 English lexical sample task (Kilgarriff, 2001). In another recent evaluation on the nouns of SENSEVAL2 English all-words task (Chan and Ng, 2005a), promising results were also achieved using examples gathered from parallel texts. Due to the potential of parallel texts in addressing the issue of scalability, we also drew training data for our earlier sense priors estimation experiments (Chan and Ng, 2005b) from parallel texts. In addition, our parallel texts training data represents a natural domain difference with the test data of SENSEVAL2 English lexical sample task, of which 91% is drawn from the British National Corpus (BNC). As part of our experiments, we followed the experimental setup of our earlier work (Chan and Ng, 2005b), using the same 6 English-Chinese parallel corpora (Hong Kong Hansards, Hong Kong News, Hong Kong Laws, Sinorama, Xinhua News, and English translation of Chinese Treebank), available from Linguistic Data Consortium. To gather training examples from these parallel texts, we used the approach we described in (Ng et al., 2003) and (Chan and Ng, 2005b). We then evaluated our estimation of sense priors on the nouns of SENSEVAL-2 English lexical sample task, similar to the evaluation we conducted in (Chan and Ng, 2005b). Since the test data for the nouns of SENSEVAL-3 English lexical sample task (Mihalcea et al., 2004) were also drawn from BNC and represented a difference in domain from the parallel texts we used, we also expanded our evaluation to these SENSEVAL-3 nouns. 4.3 Choice of Words Research by (McCarthy et al., 2004) highlighted that the sense priors of a word in a corpus depend on the domain from which the corpus is drawn. A change of predominant sense is often indicative of a change in domain, as different corpora drawn from different domains usually give different predominant senses. For example, the predominant sense of the noun interest in the BC part of the DSO corpus has the meaning "a sense of concern with and curiosity about someone or something". In the WSJ part of the DSO corpus, the noun interest has a different predominant sense with the meaning "a fixed charge for borrowing money", reflecting the business and finance focus of the WSJ corpus. Estimation of sense priors is important when
93

there is a significant change in sense priors between the training and target dataset, such as when there is a change in domain between the datasets. Hence, in our experiments involving the DSO corpus, we focused on the set of nouns and verbs which had different predominant senses between the BC and WSJ parts of the corpus. This gave us a set of 37 nouns and 28 verbs. For experiments involving the nouns of SENSEVAL-2 and SENSEVAL-3 English lexical sample task, we used the approach we described in (Chan and Ng, 2005b) of sampling training examples from the parallel texts using the natural (empirical) distribution of examples in the parallel texts. Then, we focused on the set of nouns having different predominant senses between the examples gathered from parallel texts and the evaluation data for the two SENSEVAL tasks. This gave a set of 6 nouns for SENSEVAL-2 and 9 nouns for SENSEVAL3. For each noun, we gathered a maximum of 500 parallel text examples as training data, similar to what we had done in (Chan and Ng, 2005b).

5

Experimental Results

Similar to our previous work (Chan and Ng, 2005b), we used the supervised WSD approach described in (Lee and Ng, 2002) for our experiments, using the naive Bayes algorithm as our classifier. Knowledge sources used include partsof-speech, surrounding words, and local collocations. This approach achieves state-of-the-art accuracy. All accuracies reported in our experiments are micro-averages over all test examples. In (Chan and Ng, 2005b), we used a multiclass naive Bayes classifier (denoted by NB) for each word. Following this approach, we noted the WSD accuracies achieved without any adjustment, in the column L under NB in Table 1. The predictions of these naive Bayes classifiers are then used in Equation (2) and (3) to estimate the sense priors , before being adjusted by these estimated sense priors based on Equation (4). The resulting WSD accuracies after adjustment are listed in the column EM in Table 1, representing the WSD accuracies achievable by following the approach we described in (Chan and Ng, 2005b). Next, we used the one-against-all approach to reduce each multiclass problem into a set of binary class problems. We trained a naive Bayes classifier for each binary problem and calibrated the probabilities from these binary classifiers. The WSD

  б


&

  в1 ' 9б 8

&

' 9б

 8


Table 1: Micro-averaged WSD accuracies using the various methods. The different naive Bayes classifiers are: multiclass
naive Bayes (NB) and naive Bayes with calibrated probabilities (NBcal).

Table 2: Relative accuracy improvement based on calibrated probabilities.

Table 3: KL divergence between the true and estimated sense distributions.

6

Discussion

accuracies of these calibrated naive Bayes classifiers (denoted by NBcal) are given in the column L under NBcal.1 The predictions of these classifiers , are then used to estimate the sense priors before being adjusted by these estimates based on Equation (4). The resulting WSD accuracies after adjustment are listed in column EM in Table 1.
 & ' 9б 8

The results show that calibrating the probabilities improves WSD accuracy. In particular, EM achieves the highest accuracy among the methods described so far. To provide a basis for comparison, we also adjusted the calibrated probof the test abilities by the true sense priors data. The increase in WSD accuracy thus obtained is given in the column True L in Table 2. Note that this represents the maximum possible increase in accuracy achievable provided we know these true sense priors . In the column EM in Table 2, we list the increase in WSD accuracy when adjusted by the sense priwhich were automatically estimated usors ing the EM procedure. The relative improvements obtained with using (compared against using ) are given as percentages in brackets. As an example, according to Table 1 for the DSO verbs, EM gives an improvement of 49.5% 46.9% = 2.6% in WSD accuracy, and the relative improvement compared to using the true sense priors is 2.6/10.3 = 25.2%, as shown in Table 2.
 &

The experimental results show that the sense priors estimated using the calibrated probabilities of naive Bayes are effective in increasing the WSD accuracy. However, using a learning algorithm which already gives well calibrated posterior probabilities may be more effective in estimating the sense priors. One possible algorithm is logistic regression, which directly optimizes for getting approximations of the posterior probabilities. Hence, its probability estimates are already well calibrated (Zhang and Yang, 2004; NiculescuMizil and Caruana, 2005). In the rest of this section, we first conduct experiments to estimate sense priors using the predictions of logistic regression. Then, we perform significance tests to compare the various methods. 6.1 Using Logistic Regression

We trained logistic regression classifiers and evaluated them on the 4 datasets. However, the WSD accuracies of these unadjusted logistic regression classifiers are on average about 4% lower than those of the unadjusted naive Bayes classifiers. One possible reason is that being a discriminative learner, logistic regression requires more training examples for its performance to catch up to, and possibly overtake the generative naive Bayes learner (Ng and Jordan, 2001). Although the accuracy of logistic regression as a basic classifier is lower than that of naive Bayes, its predictions may still be suitable for estimating
1 Though not shown, we also calculated the accuracies of these binary classifiers without calibration, and found them to be similar to the accuracies of the multiclass naive Bayes shown in the column L under NB in Table 1.

94

" зе в  жб

б в 

S

й б  


' 9б


&

' 9б

и " зе в  жб


&

' 3б

8

и

Dataset DSO nouns DSO verbs SE2 nouns SE3 nouns

True L 11.6 10.3 3.0 3.7

EM L 1.2 (10.3%) 2.6 (25.2%) 0.9 (30.0%) 3.4 (91.9%)

EM L 5.3 (45.7%) 3.9 (37.9%) 1.2 (40.0%) 3.0 (81.1%)

Dataset DSO nouns DSO verbs SE2 nouns SE3 nouns

EM 0.621 0.651 0.371 0.693

EM 0.586 0.602 0.307 0.632

EM 0.293 0.307 0.214 0.408

дAг


" зе в  жб

и дAг

б в 

L 44.5 46.7 61.7 53.9

EM 46.6 48.7 63.0 55.7

L 45.8 46.9 62.3 55.4

EM 51.1 50.8 63.5 58.4

дAг


дAг


Classifier Method DSO nouns DSO verbs SE2 nouns SE3 nouns

NB EM 46.1 48.3 62.4 54.9

NBcal EM 47.0 49.5 63.2 58.8


S


й б  

S

й б  

S


&

&

й б  
' 9б ' 3б 8


Table 4: Paired t-tests between the various methods for the 4 datasets. sense priors. To gauge how well the sense priors are estimated, we measure the KL divergence between the true sense priors and the sense priors estimated by using the predictions of (uncalibrated) multiclass naive Bayes, calibrated naive Bayes, and logistic regression. These results are shows shown in Table 3 and the column EM that using the predictions of logistic regression to estimate sense priors consistently gives the lowest KL divergence. Results of the KL divergence test motivate us to use sense priors estimated by logistic regression on the predictions of the naive Bayes classifiers. To elaborate, we first use the probability estimates of logistic regression in Equations (2) . These and (3) to estimate the sense priors estimates and the predictions of the calibrated naive Bayes classifier are then used in Equation (4) to obtain the adjusted predictions. The resulting WSD accuracy is shown in the column EM under NBcal in Table 1. Corresponding results when the predictions of the multiclass naive Bayes is used in Equation (4), are given in the column EM under NB. The relative improvements against using the true sense priors, based on the calibrated probabilities, L in Table 2. are given in the column EM The results show that the sense priors provided by logistic regression are in general effective in further improving the results. In the case of DSO nouns, this improvement is especially significant. 6.2 Significance Test Paired t-tests were conducted to see if one method is significantly better than another. The t statistic of the difference between each test instance pair is computed, giving rise to a p value. The results of significance tests for the various methods on the 4 datasets are given in Table 4, where the symbols " ", " ", and " " correspond to p-value 0.05, (0.01, 0.05], and 0.01 respectively. The methods in Table 4 are represented in the form a1-a2, where a1 denotes adjusting the preи

dictions of which classifier, and a2 denotes how the sense priors are estimated. As an example, NBcal-EM specifies that the sense priors estimated by logistic regression is used to adjust the predictions of the calibrated naive Bayes classifier, and corresponds to accuracies in column EM under NBcal in Table 1. Based on the significance tests, the adjusted accuracies of EM and in Table 1 are significantly better than EM their respective unadjusted L accuracies, indicating that estimating the sense priors of a new domain via the EM approach presented in this paper significantly improves WSD accuracy compared to just using the sense priors from the old domain. NB-EM represents our earlier approach in (Chan and Ng, 2005b). The significance tests show that our current approach of using calibrated naive Bayes probabilities to estimate sense priors, and then adjusting the calibrated probabilities by these estimates (NBcal-EM ) performs significantly better than NB-EM (refer to row 2 of Table 4). For DSO nouns, though the results are similar, the p value is a relatively low 0.06. Using sense priors estimated by logistic regression further improves performance. For example, row 1 of Table 4 shows that adjusting the predictions of multiclass naive Bayes classifiers by sense priors estimated by logistic regression (NB) performs significantly better than using EM sense priors estimated by multiclass naive Bayes (NB-EM ). Finally, using sense priors estimated by logistic regression to adjust the predic) tions of calibrated naive Bayes (NBcal-EM in general performs significantly better than most other methods, achieving the best overall performance. In addition, we implemented the unsupervised method of (McCarthy et al., 2004), which calculates a prevalence score for each sense of a word to predict the predominant sense. As in our earlier work (Chan and Ng, 2005b), we normalized the prevalence score of each sense to obtain estimated sense priors for each word, which we then used
95
е дв г е дв г

 
  б

S

б

         
 й б

  б

б

б

б

    е дв г

  б

 
  б

           
S

й б  

е дв г

 
б б      

 
  в1

  в1

&

" зе в  жб

' 9б

&

е дв г

 8

&

' 9б

д A в  бг д A в  бг б в 
' 9б    8 8

Method comparison NB-EM vs. NB-EM NBcal-EM vs. NB-EM NBcal-EM vs. NB-EM NBcal-EM vs. NB-EM NBcal-EM vs. NB-EM NBcal-EM vs. NBcal-EM

DSO nouns

DSO verbs

SE2 nouns

SE3 nouns

е жв г


е жв г

 
дAг  дAг  г " зде A б   ж " жзе б   дAг


F

й


&

е дв г

' 9б

8

  в1  
и

&

' 9б

 8
з


to adjust the predictions of our naive Bayes classifiers. We found that the WSD accuracies obtained with the method of (McCarthy et al., 2004) are on average 1.9% lower than our NBcal-EM method, and the difference is statistically significant.

George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identification. In Proc. of ARPA Human Language Technology Workshop. Andrew Y. Ng and Michael I. Jordan. 2001. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Proc. of NIPS14. Hwee Tou Ng and Hian Beng Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In Proc. of ACL96. Hwee Tou Ng, Bin Wang, and Yee Seng Chan. 2003. Exploiting parallel texts for word sense disambiguation: An empirical study. In Proc. of ACL03. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proc. of ICML05. Tim Robertson, F. T. Wright, and R. L. Dykstra. 1988. Chapter 1. Isotonic Regression. In Order Restricted Statistical Inference. John Wiley & Sons. Marco Saerens, Patrice Latinne, and Christine Decaestecker. 2002. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1). Slobodan Vucetic and Zoran Obradovic. 2001. Classification on data with biased class distribution. In Proc. of ECML01. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proc. of KDD02. Jian Zhang and Yiming Yang. 2004. Probabilistic score estimation with piecewise logistic regression. In Proc. of ICML04.

7

Conclusion

Differences in sense priors between training and target domain datasets will result in a loss of WSD accuracy. In this paper, we show that using well calibrated probabilities to estimate sense priors is important. By calibrating the probabilities of the naive Bayes algorithm, and using the probabilities given by logistic regression (which is already well calibrated), we achieved significant improvements in WSD accuracy over previous approaches.

References
Eneko Agirre and David Martinez. 2004. Unsupervised WSD based on automatically retrieved examples: The importance of bias. In Proc. of EMNLP04. Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. 1955. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4). Yee Seng Chan and Hwee Tou Ng. 2005a. Scaling up word sense disambiguation via parallel texts. In Proc. of AAAI05. Yee Seng Chan and Hwee Tou Ng. 2005b. Word sense disambiguation with distribution estimation. In Proc. of IJCAI05. Pedro Domingos and Michael Pazzani. 1996. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proc. of ICML-1996. Gerard Escudero, Lluis Marquez, and German Rigau. 2000. An empirical study of the domain dependence of supervised word sense disambiguation systems. In Proc. of EMNLP/VLC00. Adam Kilgarriff. 2001. English lexical sample task description. In Proc. of SENSEVAL-2. Yoong Keok Lee and Hwee Tou Ng. 2002. An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proc. of EMNLP02. Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. 2004. Finding predominant word senses in untagged text. In Proc. of ACL04. Rada Mihalcea, Timothy Chklovski, and Adam Kilgarriff. 2004. The senseval-3 english lexical sample task. In Proc. of SENSEVAL-3.

е дв г

 
96