Combining Fields in Known-Item Email Search Craig Macdonald Depar tment of Computing Science, University of Glasgow, Scotland, UK Iadh Ounis Depar tment of Computing Science, University of Glasgow, Scotland, UK craigm@dcs.gla.ac.uk ounis@dcs.gla.ac.uk ABSTRACT Emails are examples of structured documents with various fields. These fields can b e exploited to enhance the retrieval effectiveness of an Information Retrieval (IR) system that searches mailing list archives. In recent exp eriments of the TREC 2005 Enterprise track, various fields were applied to varying degrees of success by the participants. In this work, using a field-based weighting model, we investigate the retrieval p erformance attainable by each field, and examine when fields evidence should b e combined or not. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Performance, Exp erimentation Keywords: Email, retrieval, fields, structure, metadata. 1. INTRODUCTION In a known-item task (KI), there is only one relevant document that must b e ranked as early as p ossible by the retrieval system. The evaluation measure in a KI task is Mean Reciprocal Rank (MRR), which rewards retrieval systems that rank the target document as early as p ossible. In TREC 2005, the Enterprise track (TREC-Ent) had a known-item task for email search, using an archive of mailing lists emails of the World Wide Web Consortium (W3C). Emails are comp osed of two parts: the written message of the email, and various header fields such as sub ject and sender information. These fields may bring evidence of different imp ortance, which can b e taken into account to enhance retrieval p erformance. We use a field-based weighting model to combine the fields evidence of emails. A research problem is to determine which fields to apply and combine in retrieval. Our ob jectives are two-fold: Firstly, to determine how useful each separate field is for retrieval purp oses. Secondly, to find indications of when the combination of two fields is effective. fields), Title (which contains a mix of sub ject, sender and date), and Anchor Text of the incoming hyp erlinks (which mostly contains the sub ject and sender of the email) as fields in our exp eriments. We denote the field that is the concatenation of the Body, Title and Anchor Text fields by All. We index each field individually, removing stopwords and applying the first two steps of Porter's stemming algorithm. In Table 1, the second column shows the average length of each field over the 174,311 email documents of the W3C collection. For our exp eriments, we use the topics and the W3C collection from the TREC-Ent 2005 KI task. To rank email documents, we use the Divergence from Randomness field-based weighting model PL2F. This model has shown a good retrieval p erformance on this task [1]. For the PL2F model, the relevance score of an email document d to a query Q is given by: X q tf ` tf n 1 scor e(d, Q) = · tf n · log2 (1) q tfmax tf n + 1 t Q ´ +( - tf n) · log2 e + 0.5 · log 2 (2 · tf n) where is given by = F /N . F is the frequency of the query term in the collection and N is the numb er of documents in the whole collection. q tf is the query term frequency. q tfmax is the maximum query term frequency among the query terms. tf n is given by: « X,, a v g lf tf n = ) , (cf > 0) (2) wf · tff · log2 (1 + cf · lf f where tff is the term frequency of term t in field f , av g lf is the average length of the field and lf is the length of f in d. cf is a hyp er-parameter for each field controlling the term frequency normalisation, and the contribution of the field is controlled by the weight wf . In our exp eriments, we set cf and wf using training. 3. SEPARATE FIELD PERFORMANCE Firstly, we assess the p erformance of each field separately in ranking the emails. Table 1 shows the retrieval effectiveness when each field is used for retrieval separately. In the third column of the table, the system has b een trained using 25 topics that are not in the test topics. In the fourth column, the parameters of the PL2F weighting model have b een trained directly to the test topics. Training for the optimal setting allows the maximum p otential of each field to b e assessed. We can see that the training topics are, in general, representative of the test topics, as the results are roughly similar 2. FIELDS IN EMAIL SEARCH In the W3C collection, there are six fields that we apply, namely Sub ject, Sender, mailing List name, message Text, and finally the Unquoted and Quoted parts of the message text. As the W3C collection is in the form of a small Web crawl we additionally apply Body (which contains all the email Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 675 Field All Body Title Atext Sender Sub ject List Text Unquoted Quoted a v g lf 394.35 328.59 16.04 49.72 5.90 4.09 2.78 193.02 144.93 48.08 Train/Test 0.593 0.504 0.504 0.439 0.029 0.468 0.018 0.401 0.424 0.026 Test/Test 0.608 0.536 0.508 0.461 0.031 0.468 0.025 0.437 0.448 0.036 Table 1: Average Length of each field (av g lf ), and performance in MRR when used separately for retrieval. Train/Test denotes when the system is trained using the training topics, and Test/Test denotes when trained using the test topics. Note that the TREC 2005 best performing official run had MRR 0.621, while the median was 0.4545. All runs were statistically different from the best run in each column at p < 0.05. b etween b oth trainings. The All field, which contains the most evidence, p erforms signficantly b etter than all other fields (Signed Rank test, p < 0.05). Interestingly, there are fields that achieve an MRR of 0.4 to 0.5, namely Sub ject, Title and Anchor text (Atext), even though these do not contain the actual message text of the email. As each of these fields contains the sub ject of the email, we can infer that the sub ject is useful for retrieval, and alone can outp erform the median run of the submitted runs on this task. When considering the fields containing the message text of the email, i.e. All, Body, Text and Unquoted, we can see that the additional evidence present in the All and Body fields increases the p erformance over the Text field. However, the Quoted text field is of little retrieval value, and removing Quoted text from the Text, i.e. the Unquoted field, increases retrieval p erformance (from MRR 0.401 to 0.424 and 0.437 to 0.448). Finally, the Sender and List fields are not useful for retrieval for these topics, p erhaps due to the lack of p ersonal involvement of the topic creators in W3C. Fields Atext Body Atext List Atext Quoted Atext Sender Atext Sub ject Atext Text Atext Title Atext Unquoted Body List Body Sender Body Sub ject Body Title List Quoted List Sender List Sub ject List Text List Title List Unquoted Quoted Sender Quoted Sub ject Quoted Title Quoted Unquoted Sender Sub ject Sender Text Sender Title Sender Unquoted Sub ject Title Sub ject Unquoted Text Sub ject Text Title Title Unquoted Train/Test 0.599 0.465 0.417 0.450 0.453 0.583 0.481 0.623 0.482 0.436 0.571 0.605 0.033 0.040 0.483 0.398 0.499 0.358 0.059 0.381 0.441 0.401 0.509 0.413 0.510 0.396 0.507 0.565 0.527 0.590 0.621 Test/Test 0.618 0.493 0.456 0.475 0.460 0.611 0.501 0.637 0.544 0.551 0.608 0.615 0.045 0.058 0.486 0.461 0.506 0.466 0.056 0.394 0.471 0.455 0.516 0.435 0.523 0.445 0.514 0.596 0.559 0.637 0.637 Table 2: MRR scores for combinations of pairs of fields. Runs not statistically different from the best run in column (p < 0.05) are denoted with underline. dent evidence, for instance Sender and Sub ject, or List and Sub ject amounts to an increased p erformance roughly equal to the sum of the individual p erformances of b oth fields. On the other hand, note that using two indep endent fields, such as Title and Unquoted has led to one of the b est p erformances, even though the achieved MRR is not equal to the sum of the individual p erformances. 4. COMBINING FIELDS In this section, we investigate applying pairs of fields using the PL2F weighting model (Eq. (1)). The ob jective is to show when two fields should b e combined. Table 2 presents the p erformance of pairs of fields. Note that some related pairs of fields are omitted, e.g. Text and Unquoted, b ecause the Unquoted field is entirely contained in the Text field. From the table, we can see that several combinations of fields achieve a good MRR, including some that outp erform the b est official run of TREC-Ent 2005 (run uogEDates12T: MRR 0.621). In general, a field containing the unquoted text of the email, and one containing the sub ject must b e used to achieve a high MRR. Moreover, it is p ossible to deduce when fields are similar or indep endent. For instance, although the Atext and Subject fields p erform relatively well in Table 1, combining them (as in Table 2) does not improve on the retrieval effectiveness of either alone. This indeed suggests that they contain similar evidence, which matches what we know ab out these two fields (they b oth contain terms from the sub ject of the emails). The combination of Atext and Title exhibits similar prop erties. In contrast, applying fields that contain indep en- 5. CONCLUSIONS Our study investigates ten p ossible fields that could b e applied by an email search system. We show that using more evidence from each email increases the retrieval p erformance of an email search system. In particular, it is essential that the chosen fields contain the sub ject and text of an email, though the quoted text of previous emails in the thread were not shown to b e useful. Moreover, our results suggest that when different sources of evidence are combined, retrieval p erformance can b e enhanced if the chosen sources provide indep endent evidence. In the future, we intend to work towards automatically assessing the usefulness of fields and their combinations. 6. REFERENCES [1] C. Macdonald, V. Plachouras, B. He, and I. Ounis. University of Glasgow at TREC 2005: Exp eriments in Terabyte and Enterprise tracks with Terrier. In Proceedings of TREC2005. 676