WWW 2007 / Poster Paper Topic: Search Automatic Searching of Tables in Digital Libraries Ying Liu, Kun Bai, Prasenjit Mitra, C. Lee Giles The College of Information Sciences and Technology Pennsylvania State University University Park, PA 16802 {yliu, kbai, pmitra, giles}@ist.psu.edu ¢ 13%¥% 2¥%¤£¢¡( "§¥0) ¥¤'¢¡ £ §¥&¥ %$ ¢!¢¢!¡# ¥ ¥¤£¢ §¢© ¥¤£¢¡ "§! ¥ ¥¤£¢¡© ¥¤£¢¡ ¢ §¥¢¤¦ © §¥¤¨¢¡ § ¥¤£¢ ¦ "¥ 1!¤0¥¥§¥§%0)( ¢ ! $ "§¥0) Figure 1: TableRank in the TableSeer System Our pap er has three main contributions: a table search engine TableSeer, an innovative table ranking algorithm TableRank, and an extensive set of table metadata. Although table-related research received considerable attention, most of them focus on the table extraction from a sp ecific document medium. Although some researchers try to associate the table extraction with question answering (QA) or information retrieval (IR) [5] [6], none of them provides a real table search engine. To the b est of our knowledge, TableSeer is the first search engine for table search. Empirical results show that TableSeer achieves encouraging results. The remainder of the pap er is organized as follows. Section 2 presents the architecture of TableSeer with explanation for each part. Section 3 discusses the exp eriment and the result analysis. Section 4 is the conclusion. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Information Search and Retrieval ­ search process General Terms Algorithms, Exp erimentation, Documentation, Performance, Design Keywords Table search, table crawler, table metadata, table extraction, table indexing, table ranking 2. THE ARCHITECTURE OF TABLESEER Figure 1 highlights the procedure of the TableSeer in handling the table search queries. TableSeer crawls documents from the web, classifies them into two groups (document with/without tables) and discards the latter, extracts the metadata [4][3] for each table, and ranks the tables in resp onse to the user query with the TableRank algorithm. 1. INTRODUCTION Tables app ear everywhere, from web pages to scientific publications, from financial rep orts to news pap ers. Scientists always use tables to display the latest exp erimental results or statistical data. Tables have gradually accumulated a huge amount of valuable information as the explosive development of the Internet. However, current search engines do not supp ort the table search. When applying a table search query, end-users will receive a flood of unwanted and sometimes unsolicited results from them. Moreover, among the returned documents, the ranking order of the top n results does not precisely reflect the relevance to the queries. Table searching is a challenging problem b ecause of three reasons: the incapability of current search engines to recognize table contents, the impropriate ranking schemes, and the lack of a standard table representation scheme. Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 2.1 Table Crawler TableSeer harvests online scientific documents by crawling op en-access digital libraries and scientists' web pages. The crawler supp orts a numb er of document media, such as PDF, HTML, WORD, PowerPoint, etc. In this pap er, we focus on scientific documents in PDF format b ecause it gains p opularity in digital libraries and is overlooked in the table extraction and information retrieval fields. 2.2 Table Metadata Extraction and Indexing We design a universal table metadata representation scheme by classifying the table metadata into six mutually exclusive categories: 1) table environment/geography (documentlevel), 2) table-frame metadata, 3) table affiliated metadata, 4) table-layout metadata, 5) table cell-content metadata, 1135 7654 Tables are ubiquitous. Unfortunately, no search engine supp orts table search. In this pap er, we prop ose a novel table sp ecific searching engine, TableSeer, to facilitate the table extracting, indexing, searching, and sharing. In addition, we prop ose an extensive set of medium-indep endent metadata to precisely present tables. Given a query, TableSeer ranks the returned results using an innovative ranking algorithm ­ TableRank with a tailored vector space model and a novel term weighting scheme. Exp erimental results show that TableSeer outp erforms existing search engines on table search. In addition, incorp orating multiple weighting factors can significantly improve the ranking results. ABSTRACT § ¢§!&' ¢!¢!¢!¡# ¥ ¥¤£¢ WWW 2007 / Poster Paper 6) and table-typ e metadata. For each identified table, a corresp onding table metadata file is created. We design a page box-cutting method to detect and extract table metadata (see details in [4]). Table metadata indexer adopts the Lucene Index Toolb ox1 to index and rate the pairs instead of the pairs. To index a table, a "document" is created where the table metadata fill the "fields". Topic: Search 2.3 Table Ranking TableSeer search engine adopts an novel table ranking algorithm ­ TableRank. TableRank tailors the classical vector space model [1] to calculate the relevance of each pair. As shown in Table 1, each row represents the vector of a table tbj or a query q . All the table vectors and query vectors construct a vector matrix. Each table row is comp ose of k metadata and each metadata is comp osed of a set of alphab etically ordered terms. wi,j,k is the term weight of the ith term in the kth metadata of the table tbj and wi,q,k refers to the term weight of the ith query term in the kth metadata. To determine wi,j,k , we design an novel term weighting scheme: Table Term Frequency - Inverse Table Term Frequency (TTF-ITTF), a tailored TF-IDF [2] weighting scheme. Compared with TF-IDF, TTF-ITTF has two ma jor advantages. First, it calculates the term frequency in the table metadata file instead of the document. Second, it calculates the weight of a term with a comprehensive consideration at three levels: the term, the table, and the document level. Cosine measure is used to determine the similarity b etween the query vectors and the table vectors. The details of the ranking algorithm can b e seen in [3]. Table 1: The Vector Space for Tables and Queries t1,1 tb1 w1,1,1 tb2 w1,2,1 ... ... tbb w1,b,1 q w1,q,1 ittf ... m1 (M W1 ) ... tx,1 ... wx,1,1 ... wx,2,1 ... ... ... wx,b,1 ... ... ... t1,k w1,1,k w1,2,k ... w1,b,k w1,q,k ... mk (M Wk ) ... tz ,k ... wz ,1,k ... wz ,2,k ... ... ... wz ,b,k ... ... tlb ... ... ... ... ... ... ... d lb ... ... ... ... ... ... ... Figure 2: An Example of the Query Results by Basic Search document set with 200 randomly selected PDF documents. Based on testers, the precision and the recall values of table metadata extraction are over 95% resp ectively. In order to evaluate the TableRank, we established a "golden standard " to define the "correct " ranking based on human judgement and apply pairwise accuracy to evaluate the ranking quality. We also set up the common test-b ed with the manually "b ottom-up" method and the custom search engine method. Exp erimental results show that TableSeer outp erforms existing search engines on table search. In addition, incorp orating multiple weighting factors can significantly improve the ranking results (See details in [3]). 4. CONCLUSIONS AND FUTURE WORK In this pap er, we present the TableSeer system that arms with a novel table ranking algorithm, TableRank, to retrieve the tables contained in Web and digital libraries. There are several areas in which we still hop e to make progress. First, currently we focus on the scientific documents in PDF format. Next, we will extend to handle other kinds of documents in Web. Second, although we present preliminary results showing the effect of the impact factors prop osed, many parameter settings are based on empirical studies. In the future, more extensive exp eriments are needed to determine more suitable parameter settings. 2.4 Query Interface TableSeer consists of two levels of search: basic search and advanced search. Basic search allows the search with one or more simple search keywords. For the advanced searching, users can set more complex queries. To facilitate the result browsing, TableSeer provides a user-friendly interface to present the ranked results (see Figure 2). For each matched table, it not only lists the basic document information (e.g., the document title, the author and the affiliation), highlights the reference texts to the table in the document, but also provides the links for the original PDF document, the table metadata file, and the snapshot of the matched tables. 5. REFERENCES [1] R. Baeza-Yates and B. Rib eiro-Neto. Modern information retrieval. In ACM Press/Addison-Wesley, 1999. [2] C. B. G. Salton. Term-weighting approaches in automatic text retrieval. In Information Processing and Management 24(5), pages 513­523, 1988. [3] Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: Automatic table metadata extraction and searching in digital libraries. In Technical Report, 2006. [4] Y. Liu, P. Mitra, C. L. Giles, and K. Bai. Automatic extraction of table metadata from digital documents. In JCDL, pages 339­340, 2006. [5] P. Pyreddy and W. Croft. Tintin: A system for retrieval in text tables. In In Proceedings of the Second International Conference on Digital Libraries, pages 193­200, 1997. [6] J. Wang and J. Hu. A machine learning based approach for table detection on the web. In Proceedings of the 11th Int'l Conf. on World Wide Web (WWW'02), pages 242­250, Nov 2002. 3. EXPERIMENTAL RESULTS The total crawled 10000 PDF documents come from three sources: scientific digital libraries (Royal Chemistry Society), the web pages of research scientists in chemistry departments in universities, and the CiteSeer archive. We p erformed a five-user study to evaluate the p erformance of our TableSeer. The evaluation metrics are precision and recall. The exp eriment on table detection is conducted on a 1 http://lucene.apache.org/java/docs/index.html 1136 ¢ £ ¡