WWW 2007 / Poster Paper Topic: Search Electoral Search Using the VerkiezingsKijker An Experience Repor t Valentin Jijkoun Maar ten Marx Maar ten de Rijke Frank van Waveren ISLA, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands jijkoun,marx,mdr,fwaveren@science.uva.nl ABSTRACT The Netherlands had parliamentary elections on Novemb er 22, 2006. We built a system which help ed voters to make an informed choice among the many participating parties. One of the most imp ortant pieces of information in the Dutch election and subsequent coalition government formation is the of 45 pages. Our system provides the voter with fo cused We complemented this and argue that our ndings are applicable to the more general problem of making decisions when faced with comp eting solutions (oered as p o orly structured textual information). 2. REQUIREMENTS AND DESIGN The most recent Dutch parliamentary elections were held on Novemb er 22, 2006. Memb ers of the Dutch parliament (150 seats) are chosen according to the principle of prop ortional representation. for a single seat (http://www.kiesraad.nl). In 2006, 65,591 votes were needed This system party program, a text document with an average length access to party programs, enabling her to make a topic-wise comparison of parties' viewp oints. typ e of access (What do the parties promise?) with access to news (What happ ens around these topics?) and blogs (What do p eople say ab out them?). We describ e the system, including design technical details, and user statistics. leads to a proliferation of p olitical parties; in general, some twenty parties participate in the national elections (in 2006: 24 parties), each with its own party program (in 2006: with an average length of 45 pages). Asked to set up a search engine for party programs by the Instituut vo or Publiek en Politiek (IPP ), a public-private Categories and Subject Descriptors H.4.m [ 2 Information Systems]: Miscellaneous; D.2 [ Software]: non-prot organization aimed at bringing p olitics and the general public closer together, we identied three groups of requirements. User's requirements included paragraphbased access to party programs, providing b oth thematic search (with themes based on previous elections and current issues) and free-text search; facilities to compare parties' viewp oints on topics; integration with additional sources of information (news and blogs), and ways of identifying imp ortant events and trends in the latter sources. Develop er's requirements concerned the gathering of domain knowledge (sp ecically, the themes for the thematic search facility) and the data prepro cessing eort (with the party programs b ecoming available at a late stage). System requirements b oiled down to the use of op en source, o-the-shelf technology, the provision of a simple API to the search engine, and robustness. We decided on the design given in Figure 1. News RSS Software Engineering General Terms Design, Exp erimentation Keywords Elections, demo cracy, domain sp ecic search 1. INTRODUCTION We describ e the VerkiezingsKijker (election watcher), an electoral search engine aimed at helping the general public in its electoral decision making. 1 Based on interest from real users, on user feedback and media coverage, we b elieve that this application of search and language technology is one of wide interest. We motivate the choices made in our design, describ e the technical challenges and our solutions, This research was supp orted by the Netherlands Organization for Scientic Research (NWO) under pro ject numb ers 017.001.190, 220-80-001, 264-70-050, 354-20005, 600.065.120, 612-13- 001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, 640.002.501, and by the E.U. IST programme of the 6th FP for RTD under pro ject MultiMATCH contract IST-033104. News scraper Elections filter News Trend analyzer User interface IR engine B lo g s RSS Blogs Raw p a r ty p ro g ra m s Paragraph splitter Programs Figure 1: Architecture of the system. 1 nl/2006. VerkiezingsKijker is available at www.verkiezingskijker. 3. IMPLEMENTATION We describ e the implementation of the VerkiezingsKijker search engine in two steps: Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. components and overal l. 2 http://www.publiek- politiek.nl/english 1155 WWW 2007 / Poster Paper Topic: Search We pro cessed Query kinderbijslag minister-president gekozen kinderopvang Turkije ontslagrecht bijstand meningsuiting dieren nationaliteit kindergarten Turkey law governing dismissal so cial security freedom of sp eech animals nationality 4464 3969 3284 3123 3069 2754 2664 English child allowance elected prime-minister # 5314 5252 Preprocessing and Indexing Components. matic search. three typ es of dataparty programs, news feeds and blog feedsand had to generate expansion terms to enable theSixteen (of the 24 participating) parties made available their programs, 2 in HTML, and the rest in Word or PDF. Programs were automatically split into paragraphs based on layout, yielding a corpus of 4618 paragraphs. On average, half an hour of extra manual work p er program was needed. We decided to implement thematic search as regular free text search, but with queries consisting of a theme (e.g., education) and a numb er of additional terms. For each of the 179 themes prop osed by IPP (our customer) we asked a domain exp ert to (use the search engine to) identify at least 5 relevant paragraphs. For each theme we collected the top-15 overused terms as characteristic for the topic. Overusage was determined using the log-likeliho o d statistical test [2], comparing the paragraphs marked relevant with the set of all paragraphs in the index. Terms likely to lead to topic drift were manually removed. The other data sources used by the VerkiezingsKijker (news and blogs) needed frequent and rep eated harvesting, extraction and indexing. As we were dealing with national elections, we restricted ourselves to feeds of nation-wide daily newspap ers, and included eight such newspap ers, covering the entire p olitical sp ectrum. For these we obtained the Table 1: Ten most popular free-text search queries. 4. RESULTS VerkiezingsKijker went online on Octob er 23, 2006 (ab out a month b efore the elections). Novemb er 30, 2006: Here are some statistics for the p erio d of ve weeks b etween Octob er 23, 2006 and · 109,954: the numb er of unique IP hosts accessing the system; 20,624 unique hosts (19% of the total) accessed the system on the day of the elections; · · 76,360: the numb er of unique IP hosts that used the search facilities of the system; 148,026: the total numb er of searches made in the system, in particular: HTML articles, extracted the text content from the HTML, classied the contents into election-related vs non-electionrelated, and indexed it. For extraction, a robust, unsuFor p ervised metho d based on blo ck length was used [4]. classicaton, we used a Naive Bayes classier, which help ed us increase the prop ortion of election-related articles from around 20% (prior to classication) to well over 90%. As our source of (Dutch) blogs we used · · 117,132: the numb er of free text searches; 28,025: the numb er of thematic searches; 2,788: the numb er of free text trend requests; 81: the numb er of thematic trend requests; nl, http://web- log. 6,014: the numb er of distinct free text queries; 175: the numb er of distinct thematic queries (out of 179 available themes). one of the largest Dutch weblog hosts. At the time of Within the measured time-p erio d the elections, 43,984 blogs were hosted with an average of 4,179 p ostings p er day. a week). there were 7,768 active bloggers (having at least one p ost We did not p erform election-related ltering on Similar to news blog p osts. Because we obtained clean data from the blog host, no additional cleanup was needed. items, blog p osts were indexed for retrieval and stored in a database, along with meta-data (blogger, URL, publication date and time, etc.). The distribution of the actual frequencies of search queries follows a p ower law and, moreover, the 40 most frequent free text queries (1% of all distinct queries) account for 80% of all free text searches in the system. Table 1 lists frequencies of the most p opular free text queries (targeting party programs). 5. CONCLUSION The main contribution of the p oster is b est summarized as a recip e describing how to use Putting Things Together. VerkiezingsKijker allows users o-the-shelf technology to to search the three sources (party programs, news, and blogs), either by theme or free text. In addition, the system provides trend functionality for news and blogs: a visualization of the volume of news items or blogp osts relevant to a theme or query, p eak detection and explanation. For search on programs, VerkiezingsKijker resp onds with a list of paragraphs ordered by party (all, or a selection) or relevance. For news and blogs, results can b e ordered by relevance or publication date. The system is implemented using Lucene [3] for retrieval in programs, news and blogs, and a MySQL database for data storage. As to trends, the system displays counts of news items or blogp osts relevant to topics, identies p eaks (comparing actual counts against exp ected counts based on earlier observations), and provides explanations of unusual p eaks in blogp ost counts on a topic by generating links from blogp osts in p eak p erio ds to related news items, using the metho d describ ed in [1]. quickly build a web accessible search engine for cases which resemble our scenario: i.e., to create supp ort for users that need to make an informed choice among several comp etitors which drown the choice-maker in textual, mostly unstructured, information, with multiple p ersp ectives. 6. REFERENCES [1] K. Balog, G. Mishne, and M. de Rijke. Why are they excited? In Proceedings EACL 2006, April 2006. [2] T. Dunning. Accurate metho ds for the statistics of surprise and coincidence. 1993. [3] Lucene. The Lucene search engine. Comput. Ling., 19(1):6174, http://lucene.apache.org/. [4] F. van Waveren. Extracting and classifying election-related news items from the world wide web. Master's thesis, University of Amsterdam, 2006. 1156