Getting Work Done on the Web: Supporting Transactional Queries Yunyao Li Rajasekar Krishnamur thy University of Michigan Shivakumar Vaithyanathan H. V. Jagadish IBM Almaden Research Center Ann Arbor, MI 48109 yunyaol,jag@umich.edu San Jose, CA 95120 rajase,shiv@almaden.ibm.com ABSTRACT Many searches on the web have a transactional intent. We argue that pages satisfying transactional needs can b e distinguished from the more common pages that have some information and links, but cannot b e used to execute a transaction. Based on this hyp othesis, we provide a recip e for constructing a transaction annotator. By constructing an annotator with one corpus and then demonstrating its classification p erformance on another, we establish its robustness. Finally, we show exp erimentally that a search procedure that exploits such pre-annotation greatly outp erforms traditional search for retrieving transactional pages. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Exp erimentation Keywords Information Extraction, Intranet Search, Transactional Search 1. INTRODUCTION A user p erforming a web search may b e interested in obtaining information regarding a topic of interest (informational search), in finding a sp ecific web-page (navigational search), or p erforming a desired action, such as filing a claim, purchasing a ticket, or downloading a software package (transactional search). Search engines based on classical information retrieval assume that web searches are motivated by users' information needs. Models based on this traditional assumption have b een found inadequate to serve "the need b ehind the query" in navigational and transactional search [8, 10, 18]. Meanwhile, it has b een observed Supp orted in part by NSF grant I IS 0219513 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. [3, 17] that an increasing fraction of all web search queries have a transactional intent. It is this class of queries that we address in this pap er. Consider, for example, an employee at the University of Michigan seeking to file a "prop erty damage rep ort." There is a unique university wide form available online. However, a search on the university's web site using the keyword query property damage report does not have a link to this form in the first 20 hits. Most links returned by the search are to pages that discuss prop erty damage, and many are sp ecific to particular departments. From several of these results it may b e p ossible to navigate to the desired form, but the path is not obvious. Our goal in this pap er is to provide b etter results for such a transactional search. An obvious problem, when we consider an example such as the one ab ove is that most web pages are informational and not transactional. Given a set of keywords, it is likely that there are many more non-transactional pages that include the given keywords than actual transactional pages. In fact, for many transactional searches, such as the one in the example ab ove, there is precisely one web page of interest. The question is how to return this page with a high score. Given our thesis that only a small fraction of pages in the rep ository are transactional, if we have a means to identify this set of pages (with high probability), then we can make sure to return only such pages (with high score) in resp onse to a transactional search query. This is precisely what we set out to do, identifying a numb er of cues on a web page that suggest that it is transactional. It turns out that from the effort to identify transactional pages, we can get more than just whether the page is transactional. In particular, we can identify information regarding the nature of transaction supp orted by the page, and terms that are associated with this transaction. Our exp erimental results show that identifying such information is a critical comp onent for supp orting transactional search. In traditional IR, there is a preparatory phase, when documents are inserted into a collection and indices are created (or up dated), and an op erational phase, when search queries are efficiently evaluated. For transactional queries, we do some additional work in the preparatory phase -- sp ecifically, we recognize documents that are likely to b e relevant to transactional queries, and annotate them suitably. We call such documents transactional pages. The set of all transactional pages is a subset of the complete document collection. These documents can then b e processed in different ways (Sec. 3) to create transactional col lection for transactional search. 557 procedure TransactionAnnotator (C ,W) O: Set of transactional objects A: Set of actions 1. O= Ob jectIdentifier C (W ) 2. A= ActionIdentifier C (W ) 3. if (PageClassifier C (W ,O,A)) 4. then return (O,A) Figure 1: Template for Transaction Annotator The recognition of transactional pages is p erformed by a transaction annotator, whose principles of op eration are given in Sec. 2. Sp ecifically, we identify all transactions supp orted by a page.1 At the core is a classification algorithm which identifies web-pages that act as gateways to forms and applications (i.e., transactions). While the notion of identifying such gateways is not new [9], we present a novel templatized procedure aimed at increasing precision. Transaction annotators have no counterparts in traditional IR, and have to b e constructed from scratch by us. Thereafter, we would like to leverage standard techniques as far as p ossible. Sp ecifically, in our implementation, we use the p opular op en-source Lucene [16] infrastructure. There are several design decisions to b e made in this regard: whether to index all terms or only transactional terms, whether to index only the transactional collection or other documents also, and so forth. We consider such issues in Sec. 3. An exp erimental assessment of the prop osed techniques is given in Sec. 4. In Sec. 5 we discuss b oth reasons for the good p erformance observed by our system as well as the limits of applicability. The final sections present a discussion of related work and then conclusions. In this pap er, we focus solely on transactional queries; in a complete system, it may b e necessary to determine whether a given user query is transactional by analyzing it. Such query classification is not included in the scop e of this pap er, so as not to confound issues of classification with issues of retrieval for the class (transaction queries) of interest. Instead, we refer the reader to techniques for automatic user goal identification that have b een prop osed recently [9, 10, 11, 14] procedure Ob jectIdentifier C (W, RE P , RE N ) RE P = {} is a set of positive patterns G N = {Gi , fi >} is a set of negative patterns ERE is the pattern matching evaluation function EG is the gazetteer matching evaluation function BE is a boolean expression for class C R is the set of objects identified 1. C O= CandidateOb jectIdentifier C (W ) 2. foreach o C O 3. C OP (o) = {< r ej , fj > RE P |ERE (o, r ej , fj )is true} 4. C ON (o) = {< Gj , fj > GN |EG (o, Gj , fj )is true} 5. if BE (C OP (o), C ON (o)) 6. then add (o) to R. 7. return ConsolidateOb jects C (R) Figure 2: Algorithm to Identify Transactional Objects procedure ActionIdentifier C (W, RE P , RE N ) G A = {gi } is a gazetteer of terms C A= set of actions identified in W E is the gazetteer matching evaluation function 1. 2. 3. 4. 5. C A= CandidateActionIdentifier C (W ) foreach a C A if (E (a,G A ,f ) is true) then discard a. return C A. Figure 3: Algorithm to Identify Actions an annotator. In this section we provide an algorithm template that captures the essential elements of a transaction annotator. We then focus on two classes of transactions ­ software downloads (SD ) and form-entry (FE ), and describ e the sp ecifics of instantiating the template. Figure 1 shows the template for our transaction annotator. The first two steps identify transactional objects and actions, resp ectively. The former outputs the actual ob ject of the transaction ­ e.g., the name of the software for SD while the latter identifies the action to b e p erformed ­ e.g., downloadable links. Both steps rely primarily on checking for the existence of p ositive patterns and verifying the absence of negative patterns (Figure 2 and 3). Sp ecifically, p ositive pattern matches are carefully constructed regular expression patterns and gazetteer lookups while negative pattern matches are regular expressions based on the gazetteer. Consider, for example, the classifier that identifies SD. Firstly, candidate software names are extracted by CandidateOb jectIdentifier C (Figure 2, line 1) by looking for patterns resembling software names with version numb ers (e.g., Acrobat Reader 7.1 ). Some of these will b e false p ositives such as "Chapter 1.1 ". For each candidate ob ject, Ob jectIdentifier evaluates patterns comprising features in p ortions of the web page that are p ertinent to the candidate ob ject. Each pattern comprises a regular expression r e and a feature f . For SD the only feature of interest is the objecttext ­ i.e., the text that describ es the software name (e.g., Acrobat Reader and Chapter ). An example p ositive pattern for ob ject-text requires that the first letter b e capitalized. It is imp ortant to note that complex transactions (such as FE ) contain a richer set of features. False p ositives such as "Chapter 1.1 " will b e pruned as a negative pattern using a 2. TRANSACTION ANNOTATOR The transaction annotator serves two purp oses: (a) classifies each web-page as b eing either transactional or not; and (b) returns those sp ecific sections that supp ort the transaction(s) (called transactional features ). While multiple choices for a classifier exist (rule-based, machine learning etc.), our problem has several distinguishing characteristics that motivate our choice. Foremost, the interest is in very highprecision classification. Consequently the lack of a large training set combined with the fact that this is a single-class, classification problem makes transactional annotation a less app ealing task for machine learning approaches [22]. Furthermore, the requirement of extracting relevant p ortions of the document necessitates the careful engineering of features. For the ab ove stated reasons the advantages of a highly-optimized, hand-crafted, rule-based classifier clearly outweighs the obvious overhead involved in building such 1 Note that an individual web-page may supp ort more than a single transaction. 558 gazetteer. A b oolean expression BE , over this set of p ositive (C OP (o)) and negative (C ON (o)) pattern matches, decides whether a candidate ob ject (o) is relevant. The b oolean expression we currently use is simple: it returns true if and only if C OP (o) is non-empty and C ON (o) is empty. Finally, the relevant ob jects are consolidated and returned by ConsolidateOb jects C (Figure 2, line 7). For example, Adobe Acrobat Reader and Acrobat Reader will b e consolidated into a single ob ject. Similarly, ActionIdentifier (Figure 3) b egins with the identification of several candidate actions. Again, through the use of several hand-crafted regular expressions and gazetteer lookups the candidate list is pruned to get a final list of actions and corresp onding features. At this p oint PageClassifier C (Figure 1) classifies web pages based on the transaction ob ject and actions on each web page: any web page that contains at least one transactional ob ject and an action associated with the ob ject is classified as a transactional page. choices with resp ect to how these pages are processed to create a transactional collection that is indexed by the search engine. We consider these design issues at multiple levels, in turn. 1. Collection-level ­ Document Filtering: Each transactional page must contain at least one transaction ob ject. We could evaluate transactional queries solely against the transactional collection so obtained. 2. Document-level ­ Term Filtering: Filtering at a document-level is accomplished by retaining only p ortions of the document that have b een identified as containing transactional features. Each transactional page is likely to contain many terms, only a small numb er of which are actually associated with the transaction. We could choose to index only terms that app ear in the transactional features, and ignore the rest. 3. Term-level ­ Synonym Expansion: Transactional queries typically have a general form of . In many cases, the action has multiple synonyms and there is the p ossibility of a mismatch b etween the term app earing in the user query and that app earing in the web-page. (E.g. "obtain" rather than "download" some software package). The ob ject on the other hand, b eing the name of an entity, is less likely to b e obfuscated by the user. To address this p otential mismatch we create the transactional collection by p erforming synonym expansion on all verbs app earing in the transactional features. Note that p erforming synonym expansion over the entire document collection will dramatically increase the size of the index. Restricting it to only verbs in the transactional collection, as in our case, prevent this problem (Sec. 4.4.3). 2.1 Instantiating and Optimizing the Classifier We have describ ed ab ove an overall template for building transaction annotators. However, several decisions such as choice of transactional features, regular expression patterns and gazetteer entries are to b e made dep endent on the class of transactions b eing considered. For SD and FE used in this pap er, the transaction annotator was constructed and optimized using a subset of a million documents from the IBM corp orate intranet. These were then used unchanged in our experimental evaluation on an entirely different dataset (the University of Michigan intranet). Feature engineering (identifying transactional features) and defining regular-expressions and gazetteers was accomplished using a manual iterative process (using the IBM intranet data). It is worthwhile mentioning that there is a complex interaction b etween the choice of features and regular expressions/gazetteers. The final set of features included hyp erlinks, anchor-texts and html tags along with more sp ecific features such as window of text around candidate ob jects and actions. In Table 1 we present several example patterns (regular expressions and gazetteers)2 used by the transaction annotator for SD. Similarly in Table 2 we present example patterns used for FE. The first two columns describ e where in the algorithm the patterns are used. The third column lists some example regular expressions or gazetteer entries as the case may b e. The fourth column lists the feature on which the regular expression (or gazetteer as the case may b e) is evaluated. For example, the first row describ es an example pattern to identify candidate transactional objects. The regular expression is evaluated over the document text. This template can b e utilized to identify other classes of transaction annotators. 4. EXPERIMENTAL EVALUATION In this section, we rep ort exp eriments conducted to evaluate the effectiveness of transactional search and compare the processing choices discussed in Sec. 3. 4.1 Dataset and Queries Ideally, we would have liked to have used a standard corpus and query set for our evaluation. Unfortunately, we found no suitable standard document collection available for transactional search, let alone a query set. As such, we had to create our own3 . In our exp eriments, we used a collection of intranet Web pages obtained by using Wget [20]: given a single start p oint (URL of the entry page of a research university), the software recursively collected textual documents with a small set of MIME typ es (e.g., html, php) within the domain of umich.edu in Novemb er 2005. This Web page collection includes 434,211 documents with a total size of 6.49GB. A set of 15 transactional search tasks was derived from an informal survey conducted among administrative staff and graduate students in the university: 10 of the tasks are to find a particular form(s); the rest are for software downloads. The list of tasks was then given to 26 sub jects, who are all either current (graduate) students or recent graduates. Each sub ject was asked to write at least one keyword query for each search task. We obtained an average of 1.48 queries 3 3. TRANSACTIONAL COLLECTION AND OPTIMIZATION The results of the transaction annotator describ ed ab ove is a set of transactional pages, each with associated set of transactional features. In this section, we discuss some For ease of exp osition, the example patterns presented in Tables 1 and 2 are simplified versions of the actual patterns used. 2 Available at http://www.eecs.umich.edu/db/transactionalSearch 559 Table 1: Example patterns for SD Ob jectIdentifier CandidateOb jectIdentifier (Positive)Regular Expr.(RE P ) (Negative)Gazetteer(G N ) CandidateActionIdentifier Regular Expressions/Gazetteer Entries [a-zA-Z]{2,} [Vv]?[0-9]([0-9.])+ [A-Z][a-zA-Z ]+ {chapter, section, version, since, . . .} · [Tt]o download \w+ (click (on)?|go to) · download [a-zA-Z]+ from · .(exe|zip|tar(.gz)?|jar) · ^ [Dd]ownload {tutorial, instruction, form, survey, . . .} Features document text text of candidate ob jects text of candidate ob jects ·window of text around hyperlinks ·window of text around hyperlinks ·hyperlinks in web-page ·title of web-page window of text around hyperlinks ActionIdentifier (Negative) Gazetteer G A Table 2: Example patterns for FE Ob jectIdentifier CandidateOb jectIdentifier (Positive) Regular Expr.(RE P ) Regular Expressions/Gazetteer Entries (.)+ · click here to · following form is (used|related) to · \w+ form$ · \w+ form$ {toolkit,presentation,cookbook,. . .} {cancel,claim,register,nomination,. . .} Features hyperlinks in the web-page ·window of text around hyperlink ·window of text around hyperlink ·anchor-text ·heading appearing before hyperlink window of text around hyperlink · title of web-page · window of text around hyperlinks · headings in web-page ActionIdentifier (Negative) Gazetteer (G N ) CandidateActionIdentifier (Positive) Gazetteer Task 1 Table 3: Example tasks (sample user queries) Get recent trip reimbursed (travel reimburse, reimbursement form) Get the official letter of recommendation form Task 3 (recommendation letter form, candidate evaluation) Alter some of the courses you have registered Task 6 (modify course registration, add/drop course) Task 12 Obtain Remedy client for your windows machine (Remedy client windows, Remedy client download) Task 14 Obtain the virus scan software available local ly (download virus scan, virus scan software) 4.3 Evaluation Metrics Multiple different metrics, including precision, recall, and F-measure, can b e used to evaluate the quality of the result returned by a search. In the case of a transactional search, it is most often the case that the user is only interested in one way to p erform the transaction -- that there are other additional ways matters less. In other words, the user is likely to care the most ab out the top ranked relevant match returned. In light of the ab ove, results of most exp eriments are rep orted in terms of the mean reciprocal rank (MRR) measure. For each unique query of each task, the reciprocal value (1/n) of the rank (n) of the highest ranked correct result is obtained. This value is then averaged over all the queries corresp onding to the same task. The reciprocal rank of a query is set to 0 if no correct result is found in the first 100 pages returned. In our study, correct answers are web pages that can supp ort the desired transaction tasks. For example, a correct answer for "download Remedy Client " must b e a web page from which the software Remedy Client can b e downloaded directly. As such, there is little sub jectivity in determining relevance. For completeness, we also present recall measurements, but only for the main exp eriment in Section 4.4.1. Table 4: Query statistics: mean and (std. deviation) Avg. Numb er of Unique Queries/Task (SD) Avg. Numb er of Words/Query (SD) 26.3 (6.34) 2.95 (1.14) p er task from each sub ject. After removing duplicates, a total of 394 unique queries were collected for the 15 tasks. Example tasks are listed in Table 3 and statistics on the query collection is rep orted in Table 4. 4.2 Experimental Set Up We used Apache Lucene [16], a high-p erformance, fullfeatured text search engine library, to index and search the following data collections: · S-DOC: This is the original dataset describ ed in Sec. 4.1. · S-TDC: Based on the existence of transactional objects, each document in S-DOC was classified as b eing a transactional page or not. We separately indexed the collection of transactional pages. Note that this collection is a strict subset of the pages in S-DOC. · S-ANT-NE: This collection is created by writing all the transaction features (for b oth SD and FE ) on the same document into a single file. The identifier associated with each file is the original document. · S-ANT: This collection is generated similar to S-ANTNE, but with term-level synonym expansion. We used WordNet [21] as our general thesaurus to expand the verbs in the transactional features. 4.4 Experimental Results 4.4.1 Transactional Search As a first test, we wish to judge how effective our system is for transactional search. The MRR for each task over S-DOC and S-ANT is rep orted in Figure 4. As can b e seen, search based on S-ANT almost always outp erforms that based on S-DOC. For nearly two third of the tasks, S-ANT achieves higher than 0.5 in the MRR, while S-DOC achieves similar p erformance for only 3 of them. In particular, for 5 of the tasks (Task 1,2,4,5,10), MRR of S-ANT is significant higher than that of S-DOC. For tasks 1,2,4, and 5, the MRR of S-ANT is higher than 0.5, indicating that S-ANT on average returned a correct answer in its top two results. In contrast, the MRR of S-DOC for the same set 560 of tasks is lower than 0.05. For Task 10, S-ANT p erformed slightly worse, with its MRR b eing right b elow 0.3; however, it still significantly outp erforms S-DOC, whose MRR is merely 0.009. Based on the results rep orted in Figure 4, the search tasks can essentially b e divided into two categories (i) tasks where S-ANT significantly outp erforms S-DOC (Task 1, 2, 4, 5, 10), and (ii) tasks where S-ANT achieves b etter MRR than S-DOC by a smaller margin (such as Task 3, 6). Figure 6 presents reciprocal rank of S-ANT and S-DOC on individual queries for selective tasks in these two categories. For tasks in the first category (Figure 6(a), (b)), S-ANT p erforms extremely well over individual queries: it returned correct results in the first rank for more than half of the queries, and in up to fifth rank for more than three fourth of the queries. In contrast, S-DOC never returned correct results in the first rank for any query, and only returned correct results in up to tenth rank for less than one tenth of the queries for the same tasks. For tasks in the second category, the advantage of S-ANT over S-DOC is still obvious but smaller (Figure 6(c), (d)). S-ANT returned correct results in the first and second rank for ab out half of the queries but failed to return correct results in the top ten rank for others. Meanwhile, S-DOC returned correct results in up to tenth rank for nearly one third of the queries, and even returned correct results in the first rank for ab out one tenth of them. Note that for Task 9, b oth S-ANT and S-DOC p erform badly, with S-ANT p erforming worse than S-DOC. This result is not surprising. Task 9 asks for a form used to request telephone repair service, but there is no such form in our dataset (At the University of Michigan, telephone repair is requested by telephoning a repair numb er, not by filling in a web form). For this task, web pages with the correct contact information for the service were counted as correct results. One such page was included in S-ANT due to a different transaction b eing identified on that page. Recall, however, that transaction annotators only keep features related to transactions. Since there is no transaction ob ject related to this task, keyword search for the task on S-ANT is essentially a search over less related document content to the task, explaining the worse results. While it is straightforward to see how many relevant answers there are in the result set, it is hard to know how many answers there are in the document collection as a whole. As such, it is difficult to measure recall. Instead, for each task we approximate the total numb er of relevant web pages by obtaining the union of the correct answers in the top 100 results for S-DOC and S-ANT4 . The recall of S-DOC and S-ANT at the cutoff rank 100 (Recall@100) is rep orted in Figure 5. As can b e seen, even though S-ANT (12.6MB) is only a tiny fraction of S-DOC (6.49GB), it actually retrieved more correct results5 in its top 100 results for two third of the tasks, and only returned fewer correct results for three tasks. 1 S-DOC 0.8 0.6 MRR 0.4 0.2 0 1 2 3 4 5 6 7 8 Tasks 9 10 11 12 13 14 15 S-ANT Figure 4: MRR for S-DOC and S-ANT. 1 S-DOC 0.8 Recall @ 100 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 Tasks 9 10 11 12 13 14 15 S-ANT Figure 5: Recall@100 on S-DOC and S-ANT. ing. The result of the study b etween S-ANT (term filtering) and S-TDC (document filtering) is shown in Figure 7. As can b e seen, S-ANT p erforms b etter than S-TDC in 13 out of 15 tasks. This result shows that when the transaction annotator does a good job in identifying relevant features, retaining only these features in the transactional collection improves search p erformance, as opp osed to retaining the entire document. The cases where unrelated content may help involve queries that are less well defined, such as those for Task 6 and 14. For example, none of the keyword queries sp ecified for Task 14 contains the actual software name "VirusScan "; rather, more general terms such as "anti-virus " and "virus scan " are used. In future work, we intend to investigate techniques to obtain useful context for such cases. 4.4.3 Synonym Expansion Finally, to evaluate the effectiveness of synonym expansion for transactional features, we compare the p erformance of S-ANT and S-ANT-NE in Figure 8(a) for queries containing a verb6 . As can b e seen, for such queries in FE tasks (Task 1-10), the advantage of synonym expansion is evident, while for SD tasks, synonym expansion does not provide b etter results. For SD tasks, almost all user queries with a verb use the same word "download", which is contained by all transactional features of SD pages. S-ANT p erforms worse than S-ANT-NE for Task 9 and 14 due to the abnormalities discussed previously: (i) there is no form for Task 9 in the dataset; and (ii) no query sp ecified for Task 14 contains the actual software name. When we consider all the user queries for each task (including those without a verb), the comparison of S-ANT and S-ANT-NE is shown in Fig6 Our query set includes 154 such queries, out of a total of 394 unique queries. 4.4.2 Term Filtering vs. Document Filtering We also wish to compare the effectiveness of transactional collection generated via term filtering and document filter4 Sp ecifically, the two most frequent keyword queries for each task was used to get the 100 top results. 5 There frequently are multiple pages linking to the same form, and each of the page is considered a correct result in our study. 561 S-DOC 1 0.8 Reciprocal Rank S-ANT 1 0.8 Reciprocal Rank 0.6 0.4 S-DOC S-ANT 0.6 0.4 0.2 0 0 5 10 Queries 15 20 0.2 0 0 10 20 Queries 30 40 (a) Task 1 S-DOC 1 S-ANT 1 S-DOC (b) Task 2 S-ANT 0.8 Reciprocal Rank Reciprocal Rank 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0 0 5 10 15 Queries 20 25 30 0 0 5 10 Queries 15 (c) Task 3 (d) Task 6 Figure 6: Reciprocal rank on individual queries. 1 S-TDC 0.8 0.6 MRR 0.4 0.2 0 1 2 3 4 5 6 7 8 Tasks 9 10 11 12 13 14 15 S-ANT 5.1 Collection Size We b egin by noting that for the exp erimental dataset, as exp ected, the size of the transactional collection was 434,211 pages, only a very small fraction of the total collection size of 22,967 pages are transactional. A fundamental question to ask is whether the transactional collection as identified by our classifier is indeed the correct set. Could there b e errors in classification? We manually took a random sample of 100 documents from the transactional collection and found that each of these was indeed transactional. We then took a random sample of 500 documents from the full document collection. Of these, we found that all documents that were identified as transactional, indeed were transactional, but not vice versa. Sp ecifically, we found 30 transactional pages in total, of which 23 were included in the transactional collection by the classifier, and 7 had b een missed. In other words, our transaction annotator is identifying roughly 3/4 of the transactional pages. This gives us more confidence in our sup erior search p erformance as it indicates that a fair-sized fraction of the transactional pages from S-DOC are present in S-ANT. Figure 7: MRR for S-TDC and S-ANT. ure 8(b). Among the FE tasks, tasks 2, 6, 7 and 10 still show considerable improvement of S-ANT over S-ANT-NE, while the difference is marginal for the rest. This can b e explained by the fact that a larger fraction of user queries for these four tasks contain a verb, as opp osed to the other tasks. The cost of synonym expansion in terms of index size is 89.3%, with the size of Lucene index increases from 1.77MB for S-ANT-NE to 3.35MB for S-ANT. 5. DISCUSSION 5.2 Probabilistic Model For the most part in our exp eriments the retrieval engine, Lucene, has b een treated as a black-b ox thus making it difficult to isolate the different asp ects of transactional search and evaluate them. To understand the value of identifying transactional pages, we implemented a simple languagemodel retrieval engine where all assumptions can b e made explicitly. A standard language model, in resp onse to a It is clear from the exp erimental results ab ove that preidentifying and annotating transactional pages leads to a vastly sup erior p erformance in transactional search. In this section, we consider why that might b e the case. 562 1 S-ANT-NE S-ANT 1 S-ANT-NE 0.8 S-ANT 0.8 0.6 MRR MRR 0.6 0.4 0.2 0 0.4 0.2 0 1 2 3 4 5 6 7 8 Tasks 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 Tasks 10 11 12 13 14 15 (a) Over queries containing a verb (b) Over all queries Figure 8: MRR for S-ANT and S-ANT-NE. query, ranks documents based on7 P (Q|d ) · P (d) (1) where Q is the keyword query, P (Q|d ) is the likelihood of producing the query Q given d the language model for document d. P (d) is the prior probability for document d often assumed to b e uniform. Since our goal is transactional search, apriori we have a bias towards pages that are transactional. It is precisely this b elief that we encode in the computation of P (d) where each page has a prior probability that is prop ortional to the numb er of transactional ob jects identified in Section 2. Similar reasoning for home page search can b e found in [13]. P (d ) = 1 + O (d ) N + O(Nd ) (2) Figure 9: MRR for S-DOC with uniform prior and with transactional prior. intend to continue this line of research further to b etter understand transactional search. where O(d) is the numb er of transactional ob jects in document d, O(Nd ) is the total numb er of transactional ob jects identified in the collection and N is the total numb er of documents in the collection. Smoothing has b een added to account for documents that have no transactional ob jects. d was modeled as a multinomial and the following mixture probability was used to compute P (Q|d )8 . P (Q|d ) = 6. RELATED WORK This research has its roots in web page search, web genre detection, and information extraction. The following section discuss related work in these fields. Web Genre Detection: Automatic web genre identification (AWGI) has b een recently studied as a key factor to improve search results [5, 12]. For example, genre classification could allow users to sort search results according to their immediate interests [12]. While current AWGI techniques could improve navigational search (e.g., home pages [5]), we are not aware of similar work for transaction search. Web Page Search: Structural information on a Web page, such as URL, anchor text, and title, has b een used to improve entry page search and link-based page classification with great success [4, 7, 9, 13, 19]. We have shown that structural information can also b e exploited in transaction annotators to improve transactional search. Most work on web page search is focused on Internet search. A notable exception is [6]. In this study, the authors identified a few fundamental differences b etween intranet and Internet. Among these differences, the following directly motivated our work: (i) "intranet documents are often created for simple dissemination of information;"(ii) "a large fraction of queries tend to have a small set of correct answers (often unique), and the unique answer pages do not usually have any sp ecial characteristic;" (iii) "intranets are spam-free." Our example annotators are also designed by taking such differences into consideration. t ( P (t|d ) + (1 - ) P (t)) Q (3) where t refers to an individual term; P (t) is the collection level probability for term t computed as the ratio of the numb er of times term app ears in the collection and the total numb er of terms in the collection. Term-level synonym expansion was not p erformed for the language model implementation. The results obtained are shown in Figure 9. As can b e seen, transactional prior help to increase search effectiveness in almost all tasks on document collection (S-DOC). These results are very pleasing and validate our intuition that pre-identifying transactional pages is imp ortant. Certainly, more p owerful language models incorp orating other assumptions (transactional terms and imp ortant parts of sp eech) is p ossible but b eyond the scop e of this pap er. We Strictly sp eaking we are interested in the p osterior P (d|Q) d · which is given by P (Q|P (q)) P (d) where P (Q|d) has b een replaced by the model for d. 8 Some details are not mentioned here in the interests of space such as minimal feature selection using stop-word lists, removal of very frequent terms etc. 7 563 Search Goal Detection: Techniques based on classification and user b ehavior [9, 10, 11, 14] have b een prop osed to supp ort automatic user goal identification in web search. These techniques are imp ortant for transactional search, as it is often necessary to identify transaction queries b efore search. The most relevant work to ours is [9], in which Kang prop osed the notion of Service Link and used it to improve query typ e classification. The idea b ehind Service Link b ears great similarity to that of our transaction annotator: b oth classify pages based on structural information within each page. One key difference is that [9] does not supp ort careful extraction of transaction features. In the Internet, the context of [9], simple adjustment based on Service Link to the search results of commercial search engine seems serve the purp ose well. However, as our comparison study has shown, such page level annotation is not adequate for intranets, which is far less search-engine-friendly. For such environments, supp orting extraction of transactional features are necessary for b etter transactional search . Information Extraction: Transaction annotators often heavily dep end on information extraction techniques for page classification and transactional feature extraction. Dep ending on the typ e of transaction and/or dataset of interest, different information extraction techniques can b e applied. For example, in ob ject identification for software download pages (Sec. 2), finding software name on each web page is essentially a Named Entity Recognition (NER) task [2], which is to identify prop er names in text, such as geographic names [1, 15]. Our current implementation of transaction annotator mostly relies on structural pattern matching. But more advanced information extraction techniques can b e easily adopted to improve transaction annotators and b enefit transactional search. additional information to search engines, in order to enable them to supp ort transactional requests b etter. On the other hand, intranets are usually managed in a de-centralized fashion (e.g., departments in universities and geographical b oundaries or product divisions in large multinational companies). Groups within the organization that supp ort various transactions (such as filing a claim and downloading software) have little coordination while exp osing these services. Since the economic imp eratives for the intranet are simply not as strong as the extranet, it is unlikely that anyone will manually identify and manage webpages that serve transactional requests. Furthermore, only a small fraction of pages in the intranet are typically relevant to transactional search, and ordinary search is not good at finding such pages. Therefore we exp ect that techniques develop ed in this pap er, while useful for all transactional web searches, will b e of greatest value for intranet searches. 8. REFERENCES [1] E. Amitay et al. Web-a-where: geotagging web content. In SIGIR, 2004. [2] D. M. Bikel et al. An algorithm that learns what's in a name. Machine Learning, 34(1-3):211­231, 1999. [3] A. Broder. A taxonomy of Web search. In SIGIR Forum, 2002. [4] N. Craswell et al. Effective site finding using link anchor information. In SIGIR, 2001. [5] A. Dillon and B. Gushrowski. Genres and the web - is the home page the first digital genre? JASIS, 51(2):202­205, 2000. [6] R. Fagin et al. Searching the workplace Web. In WWW, 2003. [7] J. Furnkranz. Exploiting structural information for text classificatio on advances in intelligent data analysis. In IDA, 1999. [8] D. Hawking et al. Which search engine is b est at finding online services? In WWW, 2000. [9] I.-H. Kang. Transactional query identification in web search. In AIRS, 2005. [10] I.-H. Kang and G. Kim. Query typ e classification for web document retrieval. In SIGIR, 2003. [11] I.-H. Kang and G. C. Kim. Integration of multiple evidences based on a query typ e for web search. Inf. Process. Manage., 40(3):459­478, 2004. [12] B. Kessler et al. Automatic detection of text genre. In ACL, 1997. [13] W. Kraaij et al. The imp ortance of prior probabilities for entry page search. In SIGIR, 2002. [14] U. Lee et al. Automatic identification of user goals in web search. In WWW, 2005. [15] H. Li et al. InfoXtract location normalization: a hybrid approach to geographic references in information extraction. In Workshop on the Analysis of Geographic References, 2003. [16] Lucene: http://lucene.apache.org/java/docs/. [17] D. E. Rose and D. Levinson. Understanding user goals in web search. In WWW, 2004. [18] T. Upstill et al. Buying b estsellers online: A case study in search & searchability. In ADCS, 2002. [19] T. Westerveld et al. Retrieving web pages using content, links, urls and anchors. In TREC, 2001. [20] Wget: http://www.gnu.org/software/wget/wget.html. [21] Wordnet: http://wordnet.princeton.edu/. [22] H. Zhu et al. Topic learning from few examples. In PKDD, 2003. 7. CONCLUSIONS In this pap er we introduced a methodology for transactional search that is based up on identifying and pre-annotating transactional pages. We develop ed a template-driven architecture for a rule-based transaction annotator that is capable of highly sp ecific lab eling of many distinct transaction typ es. We showed the robustness of our annotator design by constructing and optimizing it on one data set (from a corp orate intranet) and rep orting results of its use, with no modifications, on a different data set (from a university intranet). Even without dataset sp ecific optimization, the new methodology was able to show dramatic improvement in search results produced. Transactional search is very imp ortant, and is growing in imp ortance on the web. On the extranet web, b etter serving transactional need has significant economic b enefit p otential associated with it. It is typically the case that a user is willing to sp end $500 to b ook a hotel on the web, including a $50 commission to the web site, but is unwilling to sp end even an amount smaller than $50 for information to help choose the hotel. The business model for many web businesses is to provide information for free in the hop e of making money from a transaction that eventually results. Not surprisingly, generic search engines (as far as we can tell) already sp ecial case selected transactional searches. For instance, a Google query naming a pair of cities will bring up a link to flight b ooking services as the top hit, b efore providing the standard list of matches. It is therefore in the interests of companies like Orbitz and Exp edia to provide appropriate 564