WWW 2007 / Poster Paper Topic: Systems Towards a Scalable Search and Query Engine for the Web Aidan Hogan, Andreas Har th, Jurgen Umbrich and Stefan Decker ¨ National University of Ireland, Galway Digital Enterprise Research Institute Galway, Ireland firstname.lastname@deri.org ABSTRACT Current search engines do not fully leverage semantically rich datasets, or sp ecialise in indexing just one domainsp ecific dataset. We present a search engine that uses the RDF data model to enable interactive query answering over richly structured and interlinked data collected from many disparate sources on the Web. Categories and Subject Descriptors H.3.0 [Information Storage and Retrieval]: General General Terms Algorithms, Design hence attention is given to tracking the provenance of data. Complementary data on entities is consolidated for more succinct results. The user interface is intended to remain close to current search engine interfaces; casual users p ose queries without a priori schema knowledge and explore datasets of any schema using uniform processes. Results navigation and browsing is offered through relationships from the current result entities to related entities. Presentation of results is ordered according to relevancy and imp ortance. Our initial use-case for the system is to provide search, query and browsing functionality over a dataset relevant to computer science research. 2. SYSTEM OVERVIEW In this section we will briefly describ e our data model and present the SWSE architecture shown in Figure 1. We use a graph-based data model established in the area of databases for the descriptions of entities. One commonly used framework for semantic graphs is RDF, which maps to triples (sub ject, predicate, ob ject). We deem entities to b e resources characterised by URIs and their description to consist of the triples containing that URI in the sub ject or ob ject p osition. Given that we deal with web data which may originate from a plethora of sources, we need to keep track of the provenance of information. Thus, we extend the classic RDF data model with the notion of context [1]; sometimes called named graphs. The context part of the quadruple (sub ject, predicate, ob ject, context) encodes the URL from hence the triple originated. Keywords entity-centric, semantic web, web search 1. INTRODUCTION Google and Yahoo! index vast amounts of documents on many topics, and rely on keyword user input to offer links to relevant documents. In recent years, there has b een a trend towards domain sp ecific, database-backed sites such as CiteSeer, DBLP, IMDB and Wikip edia, which supp ort browsing of information within the interface and offer structured search functionality. Instead of only links to relevant documents, these domain sp ecialised sites return answers to user queries. However, the sites are isolated and have a finite coverage of data. The search engine presented draws up on technologies from the research area of the Semantic Web to provide search and query functionality over data integrated from a large numb er of sources. We call our system Semantic Web Search Engine (SWSE). SWSE aims to present the user with the information they require within the interface. SWSE is an entity-centric application; results are structured descriptions of entities, including p eople, conferences, pap ers and documents, as opp osed to just HTML pages. SWSE integrates data on entities from multiple sources, This work has b een supp orted by Science Foundation Ireland (SFI/02/CE1/I131), EU pro jects DIP (FP6-507483) and TripCom (IST-4-0027324-STP). Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. Ranking Query Proc. Object Consolidation W eb (RDF/ X M L/ HTML) Crawler Lucene Quad Store Results Query Index UI Pre - Runtime Runtime Figure 1: High-level architecture For our initial test-case of computer science research data, we require extraction of relevant data from the Web (e.g., Friend Of A Friend data) for which we use the crawling architecture MultiCrawler [2]. The crawler is able to syntactically transform data from a variety of sources (e.g., HTML, 1301 WWW 2007 / Poster Paper Topic: Systems Figure 2: SWSE prototype screenshot XML) into RDF, to allow for further integration. As the web of crawled data expands, the collected graph b ecomes more densely interlinked as URIs match up, e.g., the crawler fetches and extracts the title of a HTML page which is a homepage prop erty of a p erson in the graph. Thus, we observe b etter connectivity in our data graph versus crawling of the RDF web alone. Our computer science research dataset also includes the DBLP and CiteSeer knowledge bases. These required manual conversion to RDF via XSLT. We also created a research publications ontology which we used as the common target sp ecification, along with current sp ecifications such as FOAF descriptions for authors. The RDF data model allows for good data integration b oth on a schema and instance level. However in practice, on the instance level, there exists p oor agreement on common URIs for the same entities across data sources. Using object consolidation, we merge equivalent instances under one identifier. We detect equivalence by analysing shared values of inverse functional prop erties for multiple instances. For example, we can merge equivalent Person instances from the web-crawl data and the DBLP author data through having the same value for the homepage prop erty. In merging the instances, we improve in-degree and out-degree in the graph and move towards resolving the fragmentation of metadata on a particular entity over multiple instances. For indexing the data we use JARS2; a hybrid indexing structure to provide b oth information retrieval and database query functionality necessary for SWSE. The information retrieval functionality stems from Apache Lucene, an inverted on-disk keyword index which returns relevant entities given a keyword query. For graph based lookups, we employ an optimised custom built quad index structure. We are currently finalising a distributed architecture which uses distributed hash tables and multiple Lucene indexes; distribution and parallelisation provide scalability and sp eed. To score imp ortance of results, we use ReConRank[3]. ReConRank is a links based analysis for deriving ranks of resources and contexts which incorp orates tf-idf scores from Lucene with link weighting mechanisms. Thus, for example, pap ers with a high keyword query tf-idf match that are welllinked (e.g., cited) will app ear higher in the results listings. The web user interface offers an interaction model within the comfort zone of the current average user. A user b egins by sp ecifying a keyword query to narrow in on a sp ectrum of entities relevant to a topic. A list of summaries of the relevant entities are then displayed. The user can click on an entity in the list view and get all of the information available ab out that entity. Additionally, the user is offered guided exploration. Users may restrict the typ e of entities in the results from a list of p ossibles, e.g., Person, Publication. They may then optionally traverse the data-graph through the relationships b etween entities, i.e., the inlinks or outlinks of the current entity(s) such as knows inlinks, authorOf outlinks; selection is from a list of p ossibles. Such traversal is offered from one or many current results. The p ossible restrictions and navigations offered comprise the guided incremental complex query creation system which we coin the 'nodebrowsing compass'. Figure 2 shows a screenshot of the prototyp e system. At the top, the query trail is displayed and denotes that the user has issued the keyword query "rudy studer", restricted results to typ e Person and from these results has followed the knows relation outwards. Also shown in Figure 2 is the offer of p ossible outlinks to traverse through next. The user can click on any entity in the results pane, getting the entity's information and allowing the user to browse from that entity. 3. CONCLUSION We have given the overview of a system to acquire, integrate, index and provide advanced search over information from a large numb er of data sources on the Web. The user interface, which is supp orted by a scalable and distributable index structure, provides novel ways of exploring a graph-based dataset, including incremental query formulation. An early prototyp e of the system is available online at http://swse.deri.org/. 4. REFERENCES [1] A. Harth and S. Decker. Optimized Index Structures for Querying RDF from the Web. In Procs of 3rd Latin American Web Congress, 2005. [2] A. Harth, J. Umbrich, and S. Decker. MultiCrawler: A Pip elined Architecture for Crawling and Indexing Semantic Web Data. In 5th International Semantic Web Conference, 2006. [3] A. Hogan, A. Harth, and S. Decker. ReConRank: A Scalable Ranking Method for Semantic Web Data with Context. In 2nd Workshop on Scalable Semantic Web Know ledge Base Systems, 2006. 1302