DeWild: A Tool for Searching the Web Using Wild Cards Haobin Li Computing Science Depar tment University of Alber ta Davood Rafiei Computing Science Depar tment University of Alber ta haobin@cs.ualber ta.ca drafiei@cs.ualber ta.ca Categories and Sub ject Descriptors: H.3.3 [Information Systems]: Information Search and Retrieval General Terms: Algorithms Keywords: DeWild, Data Extraction, Web Search 1. OVERVIEW A large volume of facts are available on the Web and manually extracting these facts is time consuming and often impractical. Example extraction tasks include compiling a list of scientists, a list of a company's acquisitions, etc. Unless such lists have already b een compiled and made available on the Web, one has to query a search engine, examine the pages returned, and extract a handful of instances from each page. Consider the case of extracting researchers ; many b ona fide names are not referred to as researchers. Instead, they are often coined as scientists, experts, professors, etc. If only the term "researchers" is used in the query, many qualified instances will not b e extracted. We demonstrate DeWild, a domain indep endent system for searching and data extraction on the Web. A search in DeWild is expressed using a simple query with some wild cards, and the result of a query is a ranked list of rows that match the wilds cards. For instance, given the query "Oracle acquired %", the output is exp ected to b e a ranked list of companies that were purchased by Oracle, preferably the real Oracle acquisitions ranked the highest. One typ e of wild card in DeWild is an extractor. An extractor is used to indicate a probable p osition of desired data that needs to b e extracted. Another typ e of wild card, used for query relaxation, can indicate terms that are semantically similar to the given one should also b e considered. For instance, the wild card can sp ecify that words similar to "researchers" (e.g. scientists) should b e part of the search. Building a unified query interface for a large numb er of extraction tasks is challenging. A problem with phrase queries, esp ecially long ones, is that they can retrieve very few or no matches. Query relaxation techniques (e.g. [2]) are not generally applicable to phrase queries. DeWild uses an online Figure 1: Instances extracted and their weights for the query "% is a car manufacturer" query rewriting engine to improve b oth recall and the quality of the results. For example, the query "Oracle acquired %" can b e rewritten as "Oracle purchased %", "%, acquired by Oracle", and etc. Using only syntactic matching for an extraction task often retrieves results which would b e considered incorrect by a domain exp ert. DeWild uses a new algorithm for ranking extracted data. The ranking algorithm, which exploits the mutual reinforcing relationship b etween extracted data and rewritings, has shown to outp erform a comparable system [1] which uses Mutual Information for ranking. In particular, DeWild achieves higher precision at almost all recall for the same extraction tasks. DeWild has also shown to p erform well on extracting answers for questions in the Question Answering Track of TREC [3]. In this demo, we will show how extraction tasks can b e expressed using DeWild's simple and declarative user interface (dewild.cs.ualberta.ca) and the results are returned in seconds. A user can click on each extracted row and examine the context where the row app ears. The query rewriting comp onent can b e customized in order to incorp orate exp ert knowledge to the system. 2. REFERENCES [1] O. Etzioni et al. Web-scale information extraction in KnowItAll: (preliminary results). In Proc. of the WWW Conf., pages 100­110, New York, 2004. [2] M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proc. of the SIGIR Conf., pages 206­214, 1998. [3] E. M. Voorhees. Overview of the TREC 2004 question answering track. In Text REtrieval Conf., 2004. Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 731