WWW 2007 / Track: Web Engineering Session: End-User Perspectives and Measurement in Web Engineering Supporting End-Users in the Creation of Dependable Web Clips Sandeep Lingam University of Nebraska-Lincoln Lincoln, USA Sebastian Elbaum University of Nebraska-Lincoln Lincoln, USA slingam@cse.unl.edu ABSTRACT Web authoring environments enable end-users to create applications that integrate information from other web sources. Users can create web sites that include built-in components to dynamically incorporate, for example, weather information, stock-quotes, or the latest news from different web sources. Recent surveys conducted among end-users have indicated an increasing interest in creating such applications. Unfortunately, web authoring environments do not provide support beyond a limited set of built-in components. This work addresses this limitation by providing end-user support for "clipping" information from a target web site to incorporate it into the end-user site. The support consists of a mechanism to identify the target clipping with multiple markers to increase robustness, and a dynamic assessment of the retrieved information to quantify its reliability. The clipping approach has been integrated as a feature into a popular web authoring tool on which we present the results of two preliminary studies. elbaum@cse.unl.edu ing various airlines' and travel web sites for low-price alerts, checking the availability and prices of rental cars and hotel rooms for his customers, and maintaining updated information about package tours from various operators in his web site. In order to obtain that information, Tom has to go through the tedious process of navigating to the appropriate web page, filling in all the relevant details, extracting sections of desired information and integrating them into his own site. Several aspects of the process just described could be automated, freeing the end-user to perform other activities while reducing the chances of introducing errors in this repetitive task. At a high level, we envision Tom specifying what data to obtain from what web sites, and how those pieces of data should be included in his web site, and then an automated process would create Tom's site and the necessary components so that the target site's content is automatically retrieved and integrated. Tom's situation is a common one, experienced by several end-users who wish to repeatedly retrieve, consolidate, and tailor content such as search results, weather and financial data from other web sources. In a recent survey conducted by Rode and Rosson [21] to determine the types of web applications non-professional programmers would be interested in, 34% of the sub jects surveyed expressed interest in applications that required facilities for custom collections of data generated by other web applications. Still, the process of finding, extracting and integrating data from multiple web sites remains laborious. Several commercial web-authoring environments such as Microsoft Frontpage [7] and Dreamweaver [6] enable end-users to create increasingly sophisticated web-applications capable of integrating information from various web-sources in the form of built-in components. These web-authoring tools, however, do not support integration of information beyond a limited set of components. End-users attempting to retrieve information from sites not covered by these built-in components receive no further support from web authoring tools. Furthermore, deploying such functionality requires a level of technical expertise above an end-user like Tom. Our work addresses this limitation by assisting end-users in the creation of "Web clips". Web clips are components within the end-user web site which dynamical ly extract information from other web-sources. For instance, in the example described earlier, Tom's site would include several web clips, each extracting and incorporating information from a target web site as shown in Figure 1. Categories and Subject Descriptors D.2.6 [Software Engineering]: Interactive Environments; D.2.5 [Software Engineering]: Testing and Debugging; H.5.2 [Information Interfaces and Presentation]: User Interfaces General Terms Experimentation, Reliability, Human Factors Keywords Web authoring tools, Dependability, End-users 1. INTRODUCTION Web authoring environments have enabled end-users who are non-programmers to design and quickly construct web pages. Today, end-users can create web pages of increasing sophistication with simple drag-drop type operations, while the authoring environment keeps the underlying code infrastructure hidden. Consider the example of Tom, the owner and sole employee of a travel agency, who arranges airline tickets, car rentals, hotel reservations and package tours for his local customers. Tom has designed his own site, and since he cannot afford to pay travel consolidators and specialized programmers, his daily routine consists of monitorCopyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 812, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 953 WWW 2007 / Track: Web Engineering Session: End-User Perspectives and Measurement in Web Engineering 2. RELATED WORK There have been numerous research efforts addressing automated extraction and customization of content from web sites that we now proceed to summarize. Multiple toolkits and specialized languages enabling content extraction through manual and automated wrapper generation have been discussed by Kuhlins et al.[22] and Laender et al.[23]. On the user end, stand alone languages such as Jedi [18] and W4F (World Wide Web Wrapper Factory) [31] provide sophisticated modules for generating wrappers that enable web content extraction. Furthermore, GUIbased toolkits such as Lapis [26] and Sphinx [25] enable endusers who lack any programming skills to create wrappers by providing enhanced user-interaction. A different type of extraction mechanism is provided by web site companies in the form of web-feeds and web services. Web-based feeds such as RSS [8] provide web content or summaries of web content together with links to the full versions. Web-feeds originated with news and blog sites, but more web sites are now making content available through them. Information is usually delivered as XML content, organized in a pre-defined format, which could be collected by feed aggregators. Web services [9] on the other hand enable creation of Mashups [20], which combine information from multiple web sites into a single one using public interfaces and APIs provided by web site companies. There are several instances of techniques and tools that operate as web-browser extensions and provide end users with basically web macro recorders of different sophistication. Smart bookmarks are shortcuts to web content that would require multiple steps of navigation to be reached. Systems supporting smart bookmarks such as WebVCR [11] are capable of recording the steps taken during navigation, which could later be replayed automatically to reach the target web page. Furthermore, some such as Webviews [16] also enable some level of content customization to, for example, fit smaller screens. Tools such as Greasemonkey [2], Chickenfoot [14] and Robofox [33] enable end-users to, for example, customize the look and content of a web page, associate actions with triggers in observed web sites, fill and submit forms, and filter content. Essentially, these tools provide programming language capabilities with different levels of abstraction, each for a slightly distinct range of users, to manipulate and utilize web page content as perceived by the user through the browser. Stand-alone research systems such as Hunter Gatherer [32], Internet Scrapbook [34] and Lixto [13] enable end-users to create "within-web page collections" which can contain text, images, links and other active components like forms. These tools enable users to make selections of the desired content within the visited web page as it is visited. A slightly different approach is taken by web-integration frameworks such as InfoBeans [12], CCC [17], and Robosuite [4] which enable users to create several configurable components, that encapsulate information or specify an information gathering and providing process. Several such components could be integrated into a common interface to create an application where information obtained from one component can be used to drive other components. One particular feature that is worth noting in these systems is example-based clipping guided by a training process where the user demonstrates the selection while the system generates a unique characterization of the selected content. This approach partly inspired Figure 1: Tom's travel site There are several challenges that must be addressed to provide adequate support for end-users like Tom. First, although we can assume that the end-user is familiar with web authoring tools, we cannot expect end-users to have any programming experience or desire to learn programming. This implies that the support mechanisms for the creation of web clips must be transparently integrated within web-authoring tools, while the underlying support infrastructure and code necessary to deploy the web clips must be automatically generated and hidden from the end-user. Second, it is reasonable to expect that the target web sources of information will change in ways that go beyond content. For example, a web source may present data in a new format, or in a different sequence, or not include it at all. A deployed web clip must be able to at least detect and alert the end-user about the degree to which the changes in the web source may impact their web site. Following with Tom's scenario, if his web clip is not robust enough to detect changes in the structure of the source web site, it might end up retrieving "cruise deals" instead of "flight deals", thereby resulting in inconvenience, delay or even a transaction loss because of incorrect information. At the same time, we would like for web clips to be resilient to, for example, cosmetic changes, to reduce the number of times a web clip must be created. Hence, it is desirable to develop clips that not only accurately identify what to clip but are also robust in the presence of changes. To address these challenges, we have designed and prototyped an approach to support an end-user in creating a dependable web clip. The approach is unique in three fundamental aspects: 1) it maintains the web authoring tool interface metaphor so that the end user keeps operating as usual through the standard directives while imposing no additional programming requirements, 2) it increases the robustness of the web clip through a training procedure in which the end user is iteratively queried to infer multiple markers that can help to identify the correct data in the presence of change, and 3) it deploys multiple filters to increase the confidence in the correctness of the retrieved information, and assessment code to provide a measure of correctness associated with the retrieved and integrated data. 954 WWW 2007 / Track: Web Engineering Session: End-User Perspectives and Measurement in Web Engineering Figure 2 gives an overview of the various steps involved in web clip creation - clipping, training, deployment, filtering and assessment. Clipping enables the end-user to make a selection within the target web page and is followed by a training session which generates several valid extraction patterns capable of identifying the clipped content. The deployment stage results in the creation of several scripts which dynamically retrieve and filter content to create the target web clip. It is followed by an assessment of the validity of information within the generated web clip. 3.1 Clipping The clipping process enables the end-user to select data from a target web page. Consider the example of Tom discussed earlier. One of the web clips that Tom could be interested in, is the "sample airfares" section on travelocity.com. There are two basic requirements for the creation of this web clip which are described in the following sections. Figure 2: Outline of web clip creation pro cess our training component. Characteristics such as document structure, XPATH and page fragmentation have been used by systems discussed in [27, 15] to mark the clipped content. Alternate approaches use visual patterns within web pages [19] and tree-edit distances [30] (based on the tree structure of a web page) to identify the selected content. The component of our approach that aims to identify and mark the entities to be clipped combines the merits of the above approaches. It is worth noticing that although sufficient correctness and robustness have been cited as critical aspects of applications that integrate information from other web sources [29], the approaches just described have not directly addressed these issues. Furthermore, most of the approaches discussed above are either stand alone applications, have been integrated into web browsers, or are constrained by the availability of a specific API or content. Our work addresses that particular niche, the dependability aspects associated with clips developed through web authoring environments that are intended to be published by end users. 3.1.1 Target Clip Selection Since we have prototyped our approach in Microsoft Frontpage, we will utilize this tool's context to explain our approach. Clipping is available in the tool in the same manner as several other HTML controls, so the incorporation of the target web clip into the user's web page is provided through a simple drag-and-drop operation or by a menu-selection. The web clipper control has a custom browser associated with it, which enables the user to navigate to the desired web page and clip content from it as shown in Figure 3. The selection operation is supported by visual feedback in the form of a border around the corresponding HTML element over which the user is hovering. As the user moves the mouse, every extractable document element is highlighted and the user can make a selection by clicking on it. Options to refine a selection such as zoom in and zoom out, are available through a context menu. These operations are based on the DOM [1] structure of the HTML document where, for example, a zoom out operation selects the first visible HTML element which encloses the selected element. Move up and move down are also available to enable a more detailed DOM traversal. 3. OUR APPROACH: WEB CLIPPER The primary goal for web clipper is to enable an end-user to create a dependable web clip without imposing additional requirements on or expecting higher levels of expertise from the end user. The following ob jectives were defined in order to achieve the above goal: 1. To develop and integrate an intuitive and user-friendly selection process into the web-authoring environment that would enable an end-user to create a clipping from any given web site with ease. 2. To enable training of the web clip to better identify the clipped content within the structure of the target web site. 3. To provide an extent to the validity of the extracted information, once the web clip(s) is deployed. 4. To alert the user about structural changes in the "clipped" web site that may affect the reliability of information collected. 3.1.2 Extraction Pattern Once a selection is made, an extraction pattern is generated. An extraction pattern constitutes a unique characteristic of the user's selection that can later be used to dynamically extract it from the target web page. During the clipping process, the user's selection is uniquely identified by its HTML-Path. HTML-Path is a specialized XPATH [10] expression which uniquely identifies any HTML-document portion corresponding to a DOM tree node. For example, the HTML-Path corresponding to the "sample airfares" section on travelocity.com shown in Figure 3 enables the identification and extraction of that section within the web page. The URL of the clipped web page and the HTML-Path of the clipped section are extracted during the clipping process and will be later embedded (among other components) in the user's web page for dynamic extraction of content after deployment. 3.2 Training Although an extraction pattern based on HTML-Path is capable of uniquely identifying a web page element, it is not 955 WWW 2007 / Track: Web Engineering Session: End-User Perspectives and Measurement in Web Engineering Figure 3: Clipping pro cess quite robust. Even a small change in the structure of the web page might affect the content retrieved by a web clip based on just the HTML-Path. For instance, in the example described in the previous section, if a