Cheshire3: Retrieving from Tera-Scale Grid-Based Digital Libraries Ray R. Larson School of Information University of California, Berkeley Berkeley, California, USA, 94720-4600 Rober t Sanderson Depar tment of Computer Science University of Liverpool Liverpool, L69 3BX, U.K. ray@sims.berkeley.edu Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval--retrieval models, search process ; H.3.7 [Information Systems]: Digital Libraries--systems issues General Terms: Algorithms, Performance, Design Keywords: Grid-Based Digital Libraries azaroth@liverpool.ac.uk based parallel approaches in indexing and retrieval for a variety of information resources, ranging from small test collections like the TREC and INEX collections, to medium-scale metadata collections like Medline and a test version of University of California Online Union Catalog, MELVYL (with 15 million and 9.5 million records resp ectively) ranging up to large-scale collections like the US National Records and Archives Administration (NARA) Preservation Prototyp e. We will describ e our approaches to indexing and retrieving from these collections and the architecture of the system that supp orts them, as well as providing a live demonstrations of Grid-based retrieval over these collections. We will also show how the system is managed, in b oth standalone and distributed processing, by a set of user defined workflow descriptions (which are sp ecified simply through XML configuration files). We will demonstrate how Cheshire3 databases are created and how each database is actually a logical collection of records, that can b e easily split across many nodes in a Grid environment or combined at a single location. We will show the p erformance statistics for indexing and retrieval using standard test collections, and examine the p erformance issues in a distributed environment. 1. EXTENDED ABSTRACT Recent research in designing and developing digital library services has b een focused on approaches to indexing and searching in a steadily increasing range of genres and materials. An imp ortant asp ect of this research is concerned with providing effective and scalable IR services for digital libraries as these diverse collections grow to sizes measured in terabytes and p etabytes. The Cheshire pro ject has had a central research focus on large-scale digital library collections for more than a decade, with a current focus on supp orting distributed digital libraries in a Grid evironment. At the same time we have have b een prototyping systems for very long term digital preservation, and examining how grid-scale information retrieval systems can interop erate with p etabytes of diverse data stored over many years. In order for Information Retrieval (IR) in the evolving "Grid" parallel distributed computing environment[1] to work effectively, there must b e a single flexible and extensible series of "Grid Services" with identifiable ob jects and a known API to handle the IR functions needed for Digital Libraries or other retrieval tasks. The Cheshire3 system builds on the work of the Cheshire pro ject[4] over the past decade to define and implement an easy to use set of IR ob jects with precisely defined roles that can effectively provide a Grid Service for IR. This demonstration will show how the Cheshire3 system is b eing applied in distributed information retrieval tasks in large-scale grid-based digital libraries. We will show how the system has b een integrated with "Datagrid" services provided by distributed storage systems like the Storage Resource Broker (SRB)[3], and how this enables very large scale storage and retrieval systems with supp ort for data preservation services using the "Multivalent Document" framework[2]. In this demo we will present the results of testing GridCopyright is held by the author/owner. SIGIR '06 August 6-11, 2006, Seattle, Washington, USA ACM 1-59593-369-7/06/0008. 2. ACKNOWLEDGMENTS Development of the Cheshire3 system was supp orted in part by the Joint Information Systems Committee(U.K.) 3. REFERENCES [1] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Elsevier, Amsterdam, 2 edition, 2004. [2] T. A. Phelps and P. B. Watry. A no-compromises architecture for digital document preservation. In Research and Advanced Technology for Digital Libraries 9th European Conference, ECDL2005, Proceedings, pages 266­277, 2005. [3] A. Ra jasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Jagatheesan, C. Cowart, B. Zhu, S.-Y. Chen, and R. Olschanowsky. Storage resource broker - managing distributed data in a grid. Computer Society of India Journal, 33(4):42­54, 2003. [4] R. Sanderson and R. R. Larson. Indexing and searching tera-scale grid-based digital libraries. In INFOSCALE 2006: First International Conference on Scalable Information Systems, 2006 (in press). 730