WWW 2007 / Track: Search Session: Web Graphs Web Projections: Learning from Contextual Subgraphs of the Web Jure Leskovec Susan Dumais Microsoft Research Redmond, WA, USA Eric Horvitz Microsoft Research Redmond, WA, USA Carnegie Mellon University Pittsburgh, PA, USA jure@cs.cmu.edu ABSTRACT sdumais@microsoft.com (a) Query horvitz@microsoft.com (b) Results ˇ -- -- ---ˇ --- --- ---ˇ ------ --ˇ ----- --- -ˇ ------ ----ˇ ------ ----- Graphical relationships among web pages have b een leveraged as sources of information in methods for ranking search results. To date, sp ecific graphical prop erties have b een used in these analyses. We introduce web projections, a methodology that generalizes prior efforts on exploiting graphical relationships of the web in several ways. With the approach, we create subgraphs by pro jecting sets of pages and domains onto the larger web graph, and then use machine learning to construct predictive models that consider graphical prop erties as evidence. We describ e the method and present exp eriments that illustrate the construction of predictive models of search result quality and user query reformulation. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Exp erimentation. Keywords: web graph, web search, web pro jection, contextual subgraph, query reformulation Projection on the web graph Q (c) Query projection graph (d) Query connection graph (e) Gener ate graphical features Cons truct case library Predictions 1. INTRODUCTION Figure 1: Web pro jection methodology. Given a query and respective search results, a query pro jection graph is generated and graph-theoretic features are then used for building predictive models. Information retrieval methods have traditionally considered documents as indep endent. A key insight coming to the fore with the pursuit of effective web search is that inferences ab out relevance can b e enhanced by considering the hyp erlink relationships among documents [13, 21]. We present a methodology we refer to as web projections that centers on the creation and the use of graphical prop erties of subgraphs of the web. With the approach, we pro ject a set of web pages of interest, such as the results generated by a search engine for queries, on the larger web graph to extract a subgraph, which we call the web projection graph. We then identify and exploit graph-centric prop erties of this subgraph for a variety of search-related tasks. The method can b e viewed as a general approach of using context-sensitive collections of web pages to define and focus attention on relevant subsets of the web graph, and then using graph-theoretic features within this subgraph as input to statistical models that can provide predictions ab out content, relevance, and user b ehavior. Research p erformed during an internship at Microsoft Research. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. We highlight in this pap er the use of the subgraphs for analyzing search result quality and for predicting user b ehavior in reformulating queries. Sp ecifically, we investigate the following questions: ˇ How do query search results pro ject onto the underlying web graph? ˇ What can we say ab out the quality of search results, given the prop erties of their pro jection on the web graph? ˇ Can we predict the difficulty of the query given the pro jection graph? ˇ Can we predict users' b ehaviors with issuing and reformulating queries given the query pro jection graph? ˇ How do query reformulations reflect on the pro jection graphs? The rest of the pap er is organized as follows. In Section 2, we introduce web pro jections and explain the methodology and the attributes used to model query pro jection graphs. In Section 3, we describ e the data used in our studies. We then describ e applications of our approach to predict the quality of sets of search results (Section 4), and to model user b ehavior when reformulating queries (Section 5). In Section 6, we compare our work to prior research. Finally, we summarize and conclude in Section 7. 471 WWW 2007 / Track: Search 18 9 15 10 4 19 5 13 12 17 8 6 2 1 7 17 13 14 12 8 6 2 1 7 16 18 16 Session: Web Graphs 18 9 15 15 19 6 2 12 13 4 1 11 17 20 9 18 3 13 17 20 9 8 14 5 10 6 2 12 4 1 11 3 14 7 19 10 7 3 11 20 15 10 4 19 3 11 20 8 5 5 14 (a) Pro jection graph (b) Connection graph (a) Pro jection graph (b) Connection graph Figure 2: Query pro jection graph and query connection graph for the top 20 results of the query Yahoo search engine pro jected on the URL graph. Figure 3: Query pro jection graph and query connection graph for the top 20 results of the query Subaru pro jected on the domain graph. Notice that pro jection on the domain graph is denser than pro jection on the URL graph (figure 2). 2. QUERY PROJECTIONS Figure 3 shows the pro jection of results for the query Subaru onto the domain graph rather than the URL graph. For b oth pro jections, the most relevant nodes (colored dark) app ear in central locations in the graph and are p ointed at by other search results. In contrast, the least relevant nodes (colored bright) are usually not as well connected, often requiring connection nodes to join them to the subgraph. We b egin by describing the main steps with building web pro jections and then provide formal definitions of the key comp onents. Figure 1 shows the basic steps of applying the method to analyze search results. We start with a query and a set of results for the query, generated by some procedure, typically via the use of a preexisting search engine (a). We pro ject the search results on the web graph (b), by finding the search results (square nodes) in the larger web graph and then inducing a subgraph based on these identified nodes (c). We name the induced subgraph the query projection graph. Given the typical distances among search results, the query pro jection graph often contains disconnected comp onents. We connect the nodes of query pro jection graph to create a query connection graph (d). The disconnected comp onents of the query pro jection graph are connected by identifying web pages that join the comp onents via shortest paths. The connection nodes that are introduced during the connecting of the pro jection graph (circular nodes in Fig. 1) are not drawn from the search result set. Given the query pro jection graph and query connection graph we generate a set of evidential features describing the top ology of the graph for use in the creation of predictive models via machine learning (e). We provide details ab out sample top ological features in Section 2.2. Finally, we build a case library from a consideration of the top ological properties for multiple queries for different outcomes (e.g., highquality versus low-quality sets of results), and use the case library of graph-theoretic relationships and outcomes to train models that can make predictions ab out the outcomes. We shall focus in this pap er on the tasks that harness graphical prop erties of web pro jections generated from sets of results. Two such tasks are the construction of statistical models for predicting quality of a set of search results and modeling user b ehavior in reformulating queries. As we shall discuss later, there are also opp ortunities to use the web pro jection approach to assist with the ranking of individual search results. In such applications, prop erties and relationships of single results to the subset of pages in the query pro jection are of interest. Figure 2 shows an example of a query pro jection graph and query connection graph for the query Yahoo search engine. Square nodes represent web pages and directed edges represent hyp erlinks. Circular nodes represent connection nodes. The numb er in each square represents the rank of the search result in the list returned by a search engine. The color (monochromatic shade) of the node indicates a human-evaluated relevance score of the result to the query. The most relevant results are colored dark (red), the next most relevant orange, then yellow, green, blue and purple. 2.1 Query projection and connection graphs We now present formal definitions of query pro jection graph and query connection graph. Consider a directed web graph G(N , E ) with node set N and directed edge set E , and a given set of search results S . First, we pro ject the results on the web graph G to obtain a set of projection nodes Np , where Np = S N . Note that, ideally, we would like Np = S but since the coverage of the web graph may not b e complete, some search results may not b e found in the graph. Thus, in general, Np S . We define: ˇ Query pro jection graph is a subgraph Gp (Np , Ep ) of G induced on Np nodes, i.e., edge set Ep = {(u, v ) E ; u Np v Np } ˇ Query connection graph is a subgraph Gc (Nc , Ec ) of G induced on Nc nodes, where Nc = Np C , i.e. edge set Ec = {(u, v ) E ; u Nc v Nc }. Set C is a set of connection nodes, i.e., minimal set of nodes that makes graph Gp connected. Note that finding the minimal set of connection nodes C is NP-hard, since the problem of finding a Steiner tree [11] reduces to this problem. In our exp eriments, we used a heuristic p olicy to find the set C , i.e., to connect the comp onents of Gp . We found that the heuristic p olicy was reliable and p erformed well on the datasets that we considered. The p olicy is as follows: Let Di denote the node sets of connected comp onents of Gp ordered by decreasing size (|Di | |Di+1 |). We connect each comp onent via the shortest path to the largest comp onent and continue until all comp onents are connected. More precisely, we start with D2 and connect it via a shortest path on nodes C2 to D1 . This creates a new largest comp onent D12 = D1 C2 D2 . Now, we proceed and connect D3 to D12 via shortest path C3 , creating D123 = D12 C3 D3 , and so on until all comp onents are connected. A set of connection nodes C is then C = Ci . We define a shortest path b etween the sets of nodes U and V as the shortest undirected path over all pairs of nodes (u, v ), u U, v V . 472 WWW 2007 / Track: Search GF-PROJ: query pro jection graph (Gp ) features (12) Gp Nodes numb er of nodes in Gp Gp Edges numb er of edges in Gp Gp Comp onents numb er of connected comp onents nodes in largest comp onent Gp GccNodes edges in largest comp onent Gp GccEdges maximal node degree Gp MxDeg numb er of isolated nodes Gp Deg0Nodes numb er of degree 1 nodes Gp Deg1Nodes numb er of triangles in Gp Gp Triads Gp Density density of Gp (|Ep |/(|Np |(|Np | - 1))) size of largest comp onent (|D1 |/|Np |) Gp GccSize clustering coefficient of Gp Gp Clustering GF-CONN: query connection graph (Gc ) features (16) Gc Nodes numb er of nodes in Gc Gc Edges numb er of edges Gc Gc CNodes numb er of connector nodes C numb er of edges incident to C Gc CEdges maximal connector node C degree Gc MxCnDeg Gc MxCnOutDeg maximal connector node C out-degree max pro jection node (Np ) degree in Gc Gc MxPnDeg Gc AvgPnPath mean path length of Np nodes in Gc Gc MxPnPath max path length of Np nodes in Gc Gc AvgPath mean path length of Nc nodes in Gc Gc MxPath max path length of Nc nodes in Gc Gc Triads numb er of triangles in Gc Gc Density density of Gc (|Ec |/(|Nc |(|Nc | - 1))) clustering coefficient of Gc Gc Clustering GF-COMB: Combined features (17 features) DomsToUrls Ratio of domains to urls in result set Coverage of the pro jection (Np /S ) Coverage Node ratio (|Np |/|Nc |) Gp Gc Nodes Edge ratio (|Ep |/|Ec |) Gp Gc Edges Path ratio (AvgPnPath/Gc AvgPath) Gp Gc AvgPath Path ratio (MxPnPath/Gc MxPath) Gp Gc MxPath F-QUERY: query features (10 features) QueryChLen numb er of characters in the query numb er of query words QueryWrdLen numb er of search results QuerySrcRes numb er of domains in result set QueryNDoms numb er of URLs in result set QueryNUrl numb er of results with human rating QueryNRated Table 1: Sample features used to represent query pro jection, and connection graphs, and the query. Session: Web Graphs various asp ects of the connectivity of Gp . Similarly, query connection graph features (GF-CONN, 16 features) are obtained from Gc and aim to capture the relations b etween the pro jection nodes Np in the context of connection nodes C . We also consider combination features (GF-COMB, 17 features), defined as comp ositions of features from the other groups. They largely include various ratios and normalizations of more atomic features contained in the other categories. Last, Query features (F-QUERY, 10 features) represent non-graphical prop erties of the result set, calculated from the text of the query and a list of returned search results, including the numb er of results and domains in the result set. 3. GENERATING CASE LIBRARIES We now present details on constructing libraries of cases consisting of sets of top ological prop erties that characterize the pro jections of queries onto the web graph. We used two different representations of the web as a graph. For the study of relevance, we constructed pro jections from nearly 30 thousand queries, each with a corresp onding set of search results. Most of the search results were lab eled with a human-assigned relevancy score. For our study of query reformulations, we employed a set of 42 million query-toquery transitions with corresp onding lists of search results generated at each step in the search session. For all of the exp eriments, the search results were obtained from a stateof-the-art ranking algorithm which considers a large numb er of content features as well as some top ological prop erties on the web as a whole. 3.1 Web as a graph We now present the web graphs that provided the substrate for the query-focused pro jections. We use two variants of the web graph: a URL graph and a domain graph. 3.1.1 URL graph URL graphs are the most commonly used representation of the web as a directed graph. Nodes represent web pages, and there is a directed edge from node u to node v if there is a hyp erlink from web pages u to web pages v . We created our web graph based on a sample of 22 million web pages from a crawl of the web p erformed in March 2006. We used a sample of the web considered to b e of high quality. We started crawling from a seed set of p opular, high quality web pages with good reputation. The graph contains 345 million edges and is well connected; the largest weakly connected comp onent contains 21 million nodes, while the second largest has less than a thousand nodes. The strongly connected comp onent is also large, containing 14 million nodes. The second largest comp onent has 7000 nodes. The graph has diameter of 8, and node degrees follow a p ower-law distribution. For some prediction tasks, we focused on subsets of these URLs, e.g., those for which we have human relevance judgments. When we pro ject the URLs tagged with relevance judgments onto this URL graph, results may b e missing. URLs may b e absent for several reasons, including the limited nature of the sample of URLs that we worked with, changes in pages that are returned and judged over time, and the volatile nature of dynamically generated pages. For some tasks, (e.g., predicting the top 20 versus b ottom 20 results set, describ ed in detail in Section 4.3), the difference in coverage alone can b e a good predictor of class. As we wanted to focus more on the graphical prop erties than on 2.2 From graph to evidential features Given query pro jection and connection graphs, we seek to extract a set of features that captures key prop erties of the top ology of the graphs. In all, we considered 55 features to describ e the characteristics of the pro jection and connection graphs, and of the query. Table 1 presents representative features drawn from the larger set of attributes that we used in our exp eriments. The ma jority of the features are graphical prop erties, including the numb er of nodes and edges, the numb er and size of connected subgraphs, etc. See [26] for a review of basic graphtheoretic concepts and detailed definitions of the features we use in this work. In one set of exp eriments, we also considered non-top ological prop erties derived solely from the text of the query and results. We group the evidential features into four classes: Query projection graph features (GF-PROJ, 12 features) are calculated from the pro jection graph. These features measure 473 WWW 2007 / Track: Search 16 18 2 3 4 7 Session: Web Graphs 20 9 4 3 11 18 2 13 12 19 1 17 10 9 13 17 14 6 16 6 19 15 8 5 1 10 14 (a) URL graph (b) Domain graph Figure 4: Query pro jection graph for the top 20 results of the query encyclopedia pro jected on the URL and domain graphs. coverage p er se, we normalized the numb er of pro jected results in the graph for the different classes. We did this by first noting how many URLs for the top 20 results were in the pro jection graph, then considering as many results as needed from the b ottom to get the same coverage. p opular web search engine. We obtained query-query transitions from the logs as describ ed in [23]. For each query qi , we measured ni , the numb er of times the query was observed. For a pair of queries (qi , qj ), we also measured the probability of reformulation or transition from query i to j , pij . If we let nij b e the numb er of times that qi was followed by qj within a thirty-minute window, then pij = nij /ni is the probability of qi b eing followed by qj . And similarly, probability pi ofj query qi participating in a transition is denij /ni . fined as pi = We started with a set of 35 million queries and 80 million query transitions as defined ab ove. For the analysis describ ed b elow, we considered only queries and reformulations that app eared at least 10 times in our corpus. Our analyses used 48,458 queries and 120,914 query transitions. We then used the top 20 search results for each of the 48 thousand queries and pro jected them on the URL and the domain graphs. 3.1.2 Domain graph In the domain graph, nodes represent domain names, (e.g, cmu.edu or microsoft.com), and there is a directed edge from node u to node v , if there is are web pages inside domain u that contain a hyp erlink to web pages inside domain v . It is imp ortant to note that nodes in a domain graph are not arbitrary domains; all sub-domains are collapsed into a second-level domain name. For example, web pages from domains cs.cmu.edu, ml.cmu.edu, and lti.cs.cmu.edu are merged into a single node (domain name) cmu.edu. We considered a complete domain graph of the web from February 2006. The graph contains 39 million domain names and 720 million directed edges. The graph is densely connected, has a diameter of 4, and the largest comp onent contains 99.9% of the nodes. Since this is a complete domain graph we have no problems with pro jection coverage. The domain of every search result in our dataset can b e found in this graph. Figure 4 shows the differences b etween pro jections on URL and the domain graph when pro jecting the top 20 results for the query encyclopedia. Domain graph pro jections are usually denser and much b etter connected than the URL graph pro jections. Domain graphs also have b etter coverage of the search results, with very few missing nodes. 4. QUALITY OF SEARCH RESULTS For predicting the quality of search results, we asked the following questions: By analyzing the pro jection of a query onto the web graph, what can we tell ab out the quality of the returned result set? What can we tell ab out the difficulty of the query? More sp ecifically, we explored the following two tasks: 1. Discriminate good (top 20) versus p oor (b ottom 20) search result sets. 2. Given a set of results, predict how good the set is, i.e., predict the highest human relevancy rating in the set. We now describ e the problem setting and exp erimental setup as well as the baseline method. 4.1 Problem definition We focus on the following general setting: We are given a query qi with a set of search results Si . Each query qi b elongs to class yi . A class is a categorical value that can b e, as an example, the rating of the most relevant search result in the set, or an indicator of whether the result set Si is comp osed of the top-ranked or b ottom-ranked search results. We start with Si and pro ject it on b oth the URL and the domain graphs (see Section 3.1), create b oth pro jection and connection graphs, and extract the attributes as describ ed in Section 2.2. This means that we pro ject every query qi onto two different graphs of the web, and for each pro jection, we extract a set of features as defined in table 1. We generate a case library of training examples qi describ ed with features and we learn a model to predict class yi via a machine learning procedure. 3.2 Human-rated search results In one set of exp eriments, we explored the use of the webpro jection methodology for the task of predicting the quality of a set of search results. This task requires assessments of result quality, which we obtained from human judges. For each query, the top k results from one or more systems were presented to the judges for evaluation. The quality of a query-result pair was explicitly lab eled by the judges using a six p oint scale ranging from "Perfect" to "Bad". We note that the lab eling was p erformed over the results already highly ranked by a web search engine, and thus corresp onds to a typical user exp erience. Out of 30,000 total available queries, we focused our exp eriments on a set of 13,000 queries with at least 40 rated results, with averages of 57 results (URLs) and 46 domains p er query. 4.2 Experimental setup For learning the models, we used the WinMine toolkit [3] that uses the GES algorithm [4] in Bayesian structure search to learn a Bayesian network. We model the continuous features as Gaussians and discrete features with a multinomial distribution. For all exp eriments we rep ort the average classification accuracy over a 10-fold cross validation. We compare the predictive p ower of the learned models with two baseline methods. The first baseline model is the marginal model, which predicts the most common class. The 3.3 Query reformulation corpus In a second set of exp eriments, we used web pro jections to explore patterns of query reformulation. We examined a sample of query logs captured over a six week p eriod by a 474 WWW 2007 / Track: Search 6 1 16 10 14 12 9 2 11 20 32 36 Session: Web Graphs 47 46 7 33 46 48 40 3 7 19 5 8 39 43 40 34 35 28 29 37 44 33 11 9 19 3 10 8 4 1 17 16 14 39 43 34 44 45 36 31 38 29 42 41 38 13 37 (a) Good result set (b) Poor result set (a) Good result set (b) Poor result set Figure 5: Domain graph pro jections of good and poor result sets for query med line. Figure 6: Pro jections of good and poor result sets for query Wisconsin pro jected on the URL graph. second baseline algorithm we use is based on a ranking algorithm that uses a large numb er of textual and global graph features to rank search results. For the classification tasks, we learn a threshold on the score to predict the class. The baseline ranking algorithm we used is RankNet [1], a sup ervised machine-learning technique develop ed to learn a ranking function. The learning methodology is a neural net algorithm that optimizes feature weights to b est match explicitly provided pairwise user preferences. Over 350 input features are used to train RankNet. These features include various asp ects of document content, anchor text features, and basic hyp erlink features. The output of RankNet is used to rank results. Since the output of RankNet is a combination of state of the art features for ranking, we use it as a discriminatory feature that serves as input to the Bayesiannetwork classifier. We think this serves as a strong baseline. Feature set Baseline­Marginals Baseline­RankNet GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL URL graph 0.50 0.74 0.62 0.60 0.87 0.88 Domain graph 0.50 0.74 0.82 0.86 0.90 0.88 Table 2: Classification accuracy for predicting good versus poor result sets. 4.3 Relative quality of result sets The first task we consider is the classification of good (top 20) versus p oor (b ottom 20) result sets. For this task, we used the explicit relevance judgments describ ed in 3.2. For each query, we order the search results from b est to worst using the human judgments. We note that this ordering can b e different than the output of the search engine. We then pro ject the top 20 search results, and b ottom 20 results ordered by human judgments onto the URL and domain graphs, and compute the features describ ed in table 1 for the two graphs. We learn a model that can predict, for a previously unseen query with a set of search results, whether it is good (top 20) or p oor (b ottom 20). Given the average numb er of judged search results p er query, we are effectively learning to discriminate the top 20 results versus the search results with ranks 40 to 60. Note that this task involves predicting the relative quality of results for a given query. We examine predicting the absolute quality of the results in the next section. First, we p oint to Figure 5 which displays examples of projections of good and p oor result sets for the query med line on the domain graph. We can see that results in a good set are tightly clustered, while those in the p oor result set are more spread out, requiring many connection nodes to connect the comp onents of the pro jection graph. Also notice that the results marked by humans as most relevant (darkest nodes) app ear as the central (high-degree) nodes in the graph (Figure 5(a)). Similarly, Figure 6 shows the good and p oor result sets for the query Wisconsin pro jected on the URL graph. The central node in the good result set (panel a) is one of the search results, whereas in p oor set (panel b) the central node is a derived connector node. Table 2 shows the results for the task of predicting good versus p oor result sets using several different methods and feature sets. Results are shown separately for URL and domain graph pro jections. The Baseline­Marginals row displays the classification accuracy of predicting the most common class. Baseline­ RankNet is the second baseline where only the RankNet score is used for learning. We are using a combination of ab out 350 textual features to discriminate the good and the p oor result sets. GF-PROJ uses the 12 features extracted from the pro jection graph, GF-CONN uses the 16 connection graph features, GF-PROJ+GF-CONN uses b oth of these feature sets (28 features), and GF-ALL refers the case where all 55 features, most of which are describ ed in table 1, are used for learning. We were not surprised to find that RankNet and the new graphical features outp erformed the marginal baseline. The RankNet output reflects the extent to which human judgments agree with the output of the learned ranking function. The "GF-" results reflect the extent to which graphical features of the results subset are predictive of human relevance judgments. For the URL graph, the RankNet baseline outp erforms models trained only on pro jection or connection graph features, but the models trained on b oth sets of features shows substantial improvement over RankNet (18% relative improvement). For the domain graph, all models trained on graphical features outp erform RankNet. We obtained the b est classification accuracy of 90% when combining pro jection and connection graph features. We note that we obtained higher classification accuracies when pro jecting on the domain graph than for URL pro jections. This is likely due to the sparser coverage of the URL graph. We found interesting the p erformance of the "GF-" models, considering only top ological features of the web pro jections, and bypassing analysis of content matches b etween the query and web pages. Figure 7 shows a simple model learned from the query pro jections on the domain graph using the GF-PROJ feature set. The model has a classification accuracy of 0.82. The figure shows the decision tree for the output variable Top 20 vs. Bottom 20. Nodes corresp ond to input fea- 475 WWW 2007 / Track: Search Feature set Baseline­Marginals Baseline­RankNet GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL Session: Web Graphs URL graph 0.55 0.63 0.80 0.79 0.82 0.83 Domain graph 0.55 0.60 0.64 0.66 0.69 0.71 Table 4: Result set quality classification accuracy for a binary classification problem. URL graph, we obtained a 15% relative improvement over using the RankNet to predict the quality when using all attributes. For the domain graph the improvement was even larger, 25%. Note that all methods using any combination of graphical attributes outp erform b oth baseline methods. The model (not displayed p er space limitations) for the 6-level result set quality classification problem is more complex. The first split of the induced decision tree is on the node ratio of the pro jection and connection graphs. If the connection graph is much larger than the pro jection graph, the results are likely to b e of p oor quality. Moving down the tree, we see that if maximum degree in a graph is relatively small, the results are likely to b e of medium quality, with results getting worse as the numb er of domains in a top 20 set increases. The model revealed that high quality search result sets are associated with pro jection nodes with large degrees, few domains, small domains to URL ratios, and are well connected. Next, we examine the same problem at a coarser granularity. The task is to predict whether the set contains a result with the rating in the top or the b ottom half of the 6 p oint rating scale. Table 4 shows the classification accuracies for the classification problem. We note that the difference in p erformance b etween the domain and URL graph pro jections increased even further and that the relative increase in p erformance over the RankNet baseline increased (31% for the URL and 18% for the domain graph). This task is similar to that of discriminating the good versus p oor result sets (as describ ed in Section 4.3). However, it is also more difficult since we are only working with top 20 results for each query and predicting the absolute quality of the set. The good versus p oor prediction requires only a relative judgment. For the task of distinguishing good versus p oor result sets, we found that pro jections on the domain graph outp erformed the pro jections on the URL graph. For the case of predicting the exact quality of a result set, the pro jections on the URL graph generally p erformed b etter even in cases where the URL graph has the problems with coverage. This may b e explained by the difference in the goals and representation. For good versus p oor discriminations, the quality of the whole set is imp ortant and the domain graph likely represents an appropriate level of abstraction for handling this challenge. In contrast, the quality of a result set is a single result (single node) prop erty. Here the domain graph may b e too coarse to capture fine-grained prop erties of the high quality nodes (search results). A pro jection on the URL graph may b e needed to capture necessary prop erties. Figure 7: Learned model for discriminating good versus poor search result sets based on query projection on the domain graph. Feature set Baseline­Marginals Baseline­RankNet GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL URL graph 0.36 0.48 0.51 0.50 0.54 0.55 Domain graph 0.36 0.44 0.53 0.52 0.54 0.55 Table 3: Result set quality classification accuracy for a 6-way classification problem. tures, and each leaf node shows the probability distribution for the output variable, which is shown as a histogram. In this case, the variable has only two p ossible values; the green (darker) area indicates the prop ortion of good (top 20) sets and the grey area p oor (b ottom 20) sets. Lab els on the edges show the splitting criteria of the parent node variable, and the numb ers in parenthesis show the numb er of training examples routed over the edge. The pro jection graphs of good result sets (shown as large green (dark) area of the histograms) have few isolated nodes (low values of Gp Deg0Nodes) and results are coming from a few domains (low values of Gp Nodes). On the other hand, p oor result sets have many domains and many isolated nodes. 4.4 Absolute quality of a result set In the previous section, we considered the problem of discriminating b etween good and p oor result sets. Now we focus only on top-rated (top 20) results for each query and aim to predict the absolute quality of a query result set. More sp ecifically, we lab el each query with the highest humanassigned rating for any result in the set. We note that we could use other measures to summarize the quality of result sets. The highest human rating is easy to describ e and is of practical imp ortance. Since the human relevance judgments were on a 6-p oint scale (Section 3.2), we can examine the problem at several granularities. Here, we present results for the 6-class problem (predict top lab el exactly) and the 2-class problem (predict whether the top lab el is from the three highest or three lowest rating categories). First, we consider the 6-class task of predicting the exact rating of the highest rated document in the set of top 20 results. Table 3 shows the classification results. For the 5. QUERY REFORMULATIONS As a second illustration of the use of web pro jections, we explore the learning of models to predict users' query- 476 WWW 2007 / Track: Search Feature set Baseline­Marginals GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL URL graph 0.54 0.59 0.63 0.63 0.71 Domain graph 0.56 0.58 0.59 0.60 0.67 Session: Web Graphs pro jections on the URL graph provided b etter p erformance than cases generated from pro jections on the domain graph. We note the baselines for predicting the most common class are slightly different b etween the URL and the domain graph since we discarded a few queries that produced very small URL pro jection graphs. Examining the model (not shown for brevity) we see that, queries that are likely to get reformulated come from many domains, are generally longer than queries that are not reformulated, and have relatively high degree (> 4) connection nodes. Such findings again suggest that queries whose results are tightly knit together on the web are of higher quality, given that such queries are less likely to b e reformulated. The findings also suggest result sets with central high degree nodes and a small numb er of connector nodes are of higher quality. In another set of exp eriments, we explored transitions b etween queries. For this task, we took pairs of queries where there is a strong tendency of transition in only one direction, and then trained a model that learns whether a given query is likely to b e the transition source or destination. Figure 8 shows two examples of source and destination graphs. Our models were able to predict whether a given query is a source or a destination of the transition with an 85% classification accuracy. The learned model provided insights ab out the relationship b etween top ological prop erties and the likelihood of the direction of a transition. We saw that sources of query transitions tend to have some isolated nodes, short query strings, many connected comp onents, and nodes that lie far apart in the connection graph, which indicates the returned search results are not satisfactory. In contrast, reformulation destinations (esp ecially sp ecializations) tend to b e b etter connected, and to have higher in-degrees of pro jection nodes. Intuitively, these results make sense: a searcher probably wants to sp ecify a new query if the search results are somewhat "random", i.e., are scattered widely around on the web. The results of this exp eriment led another question, which we explore in the following section. Table 5: Classification accuracy of predicting whether the query is likely to be reformulated. reformulation b ehavior and characteristics. Web searchers often refine their queries one or more times, as they seek information on the web. Prior research has explored query reformulations, considering such issues as the timing and typ e of reformulation seen. For example, Lau and Horvitz [16] build models to predict the likelihood that searchers will sp ecialize, generalize, or reformulate queries within a search session, considering the history and timing of actions. Jones et al. [10] examine substitutions that searchers make to their queries. We explore the use of web pro jections to build models that predict if and how users reformulate their queries. We used a set of 48 thousand queries that were reformulated at least 10 times. For every query, we took the top 20 search results returned by the search engine, and created the query projection and connection graphs, extracted the graph features, and trained predictive models. More sp ecifically, we consider the following tasks: 1. Distinguish queries with high versus low reformulation probability. 2. Given a transition from query qs to query qd , predict whether it is a sp ecialization or generalization. 3. Given a query that is likely to b e reformulated, predict whether it is going to b e generalized or sp ecialized. Next, we describ e the exp erimental setup and give more detailed description of our results and findings. 5.1 Experimental setup Using the query reformulation data describ ed in Section 3.3, we defined several binary classification tasks. For each selected query, we took the top 20 search results as returned by the search engine, pro jected them on the domain and URL graphs, and extracted the features. For some of the tasks, the training datasets were quite imbalanced where one outcome was significantly more likely than the other. In order to focus on the key discriminations rather than basic marginal frequencies, we sub-sampled the ma jority class, so that b oth classes had roughly the same numb er of training examples. 5.3 Query specialization versus generalization We have just describ ed how we can reliably learn whether a given query is the source or destination of a reformulation. Now, we pursue models that can predict the nature of reformulations. We shall explore in particular whether a reformulation is likely to b e a sp ecialization or a generalization of the source query. Given a pair of queries, where qs is often reformulated into qd , we want to learn characteristics of pro jection graphs for queries that are sp ecialized versus generalized. For this task, we define query sp ecialization as the addition of more words to an existing query, and similarly define generalization as removing words from the query (richer characterizations have b een considered Lau and Horvitz [16] and by Jones et al. [10].) Given the query transition data, we extracted all pairs of queries where a sp ecialization or generalization transition had occurred at least 10 times. Then we separately pro jected the source qs and the destination query qd and extracted the features. We created transition features by simply taking the difference of the corresp onding feature values: Fi (qs ) - Fi (qd ), where Fi () denotes ith feature. Note that in this exp eriment we do not use the query text attributes (length of the query string) as it would b e p ossible to directly identify the typ e of transition by the change in the length of the query. 5.2 Probability of query reformulation First, we considered the problem of learning whether a query is likely to b e reformulated or not. We split our set of queries into two classes: queries with high reformulation probability (pi 0.6) and queries with low reformulation probability (pi 0.15). We selected these values so that the two classes were ab out the same size. Table 5 shows the classification accuracies when pro jecting on URL and domain graphs. We found gradual improvement with increasing the numb ers of top ological features under consideration. We also found that cases drawn from 477 WWW 2007 / Track: Search 7 2 Session: Web Graphs 6 16 12 2 9 7 13 8 6 19 4 9 17 11 15 2 19 17 0 13 4 8 14 10 18 2 14 9 0 17 13 5 15 16 17 15 0 11 6 5 0 13 10 7 18 12 4 15 7 3 19 14 11 12 14 (a) Source of the generalization (b) Destination of the generalization (c) Source of the sp ecialization (d) Destination of the sp ecialization Figure 8: Sources and destinations of query transitions. Pro jections (a) and (b) show an example of a generalization from query free house plans to the query house plans. Pro jections (c) and (d) show the specialization from strawberry shortcake to strawberry shortcake pictures. Notice how the reformulated queries result in more connected graphs and bring result nodes into the center. Feature set Baseline­Marginals GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL URL graph 0.50 0.71 0.69 0.71 0.80 Domain graph 0.50 0.84 0.83 0.85 0.87 Table 6: Classification accuracy of predicting whether a given query transition is a specialization or a generalization. We show the classification p erformance in Table 6. Here, using only the pro jection graph features p erforms slightly b etter than using solely the connection graph features. We see consistent increases in p erformance by combining the pro jection graph and derived features. We obtain the b est accuracy of 87% using pro jections on the domain graph and all features for learning the predictive models. The learned decision tree for predicting reformulation using GF-PROJ+GF-CONN features with pro jections on the URL graph is displayed in Figure 9. Note that the splitting criteria (e.g., GpComp onents) here are not the values of the attributes but rather the changes in attribute values, i.e., the difference in the attribute values of the source and the destination of the transition. The model shows that query sp ecializations are characterized by the decrease in the numb er of connected comp onents of pro jection graph (first split of the tree). It also shows that the numb er of nodes and edges in the pro jection graph increases for generalizations. For sp ecializations, we see that the numb er of isolated nodes decreases, results are gathered in few connected comp onents, and the size of largest connected comp onent is increases, while the numb er of connector nodes decreases. These results corresp ond with the intuition that, when a query is generalized, the list of results will get richer and more diverse. The findings revealed in the learned models suggest that the pro jection graphs associated with generalizations are sparser and less connected than those associated with sp ecializations, where the pro jection is likely to b e more concentrated, denser, requiring fewer connector nodes. It may seem that the results here do not go along with those in section 5.2, where we find characteristics of query pro jection graphs that lead to query reformulation, i.e. learn characteristics of badly formulated queries and transitions as query is formulated. On contrary, we see here that in general sp ecializations narrow down the search, while generalizations tend to lead to higher diversity and larger coverage. Figure 9: Model learned on URL graph pro jections for predicting whether transitions between queries are generalizations or specializations. 5.4 Predicting type of query reformulation Finally, we examine the typ e of reformulation associated with a query. We seek to predict whether it is more likely to see sp ecific queries generalized or sp ecialized, and how this reflects on the prop erties of the query pro jections. For this task, we learn models that consider sp ecific prop erties of queries that are reformulated in a certain way. Again, we do not use the features derived from the query string (length of the query, numb er of words, etc.) as the change in the length of the query provides information ab out the typ e of reformulation. Table 7 gives the classification p erformance for these models. We found that the p erformance of models learned from cases generated from the URL and domain graph pro jections is ab out the same. The pro jection graphs provided models with b etter p erformance than those using only the features of connection graph. Using all features, we obtained a classification p erformance of 78%. The learned probabilistic decision tree model shows that the most discriminatory prop erty for this task is the maximum degree of a node in the pro jection graph. If the maximum degree is low, the query is likely to b e sp ecialized. If there is no central node in the pro jection graph, the user 478 WWW 2007 / Track: Search Feature set Baseline­Marginals GF-PROJ GF-CONN GF-PROJ+GF-CONN GF-ALL URL graph 0.50 0.71 0.62 0.70 0.78 Domain graph 0.50 0.68 0.65 0.68 0.76 Session: Web Graphs search and name disambiguation in email messages using graphs, employing random walks on graphs to disambiguate names. In the context of machine learning on graphs, researchers have sought to predict the lab els of vertices in a graph, given the known lab els of vertices in training data [14, 17]. A similar formulation has recently b een explored in the context of kernel methods [12], and approaches based on extracting subgraphs as features [15]. In contrast to previous efforts, we examine a broad set of graph-theoretic prop erties of subgraphs, rather than, for example, only examining a single feature such as the eigenvalue associated with individual nodes. The richer characterization of the top ological prop erties of subgraphs as a whole-- or of individual nodes relative to the subgraph--allows us to investigate the discriminability of multiple features, to learn models for diverse classification goals, and, more generally, to provide useful analytical tools for exploring the web. Table 7: Classification accuracy of predicting whether a reformulation will likely lead to a specialization or generalization. will likely sp ecialize the query. On the other hand, generalizations occur when the largest connected comp onent of pro jection graph is large (more than 10 nodes, for the top 20 results) and where nodes are close together (low average path length in connector graph). 6. RELATED WORK 7. SUMMARY AND DIRECTIONS We presented a methodology for learning predictive models and attributes from a rich set of top ological characteristics of sets of web pages, based on their pro jection on the larger web graph. First, we considered patterns of connectivity among the set of pages returned as search results by pro jecting them onto the web graph, and analyzed the top ological prop erties of the induced subgraphs. Using these graph-theoretic features, we p erformed machine learning to build classifiers that discriminate good and p oor sets of results. Then, we learned models that predict user b ehavior when reformulating queries, including whether queries are likely to b e reformulated, and the nature of the reformulation. The exp erimental results for the two problem domains highlight the p otential value of employing contextual subgraphs for understanding search-related tasks. The web-pro jection method is scalable as demonstrated by the sizes of datasets used in our analyses. Calculating the graph features is fast since pro jected graphs are fairly small. The most computationally exp ensive op eration is obtaining the query connection graph. Here a shortest-path algorithm is p erformed to connect the comp onents of query pro jection graph. If the comp onents are far apart and the graph is densely linked, the shortest-path algorithm will have to traverse most of the graph. In rare cases, this problem arises in the domain graph and, the algorithm can take up a minute to complete. On average, however, less than three seconds were required to pro ject a query and produce the pro jection and connection graphs. In case of URL graph, which is not as densely connected, we did not see visible delays in the production of the graph pro jections. The method of pro jecting sets of results on the underlying graph of the web and then using machine learning with graph-theoretic features can b e applied in many settings. For example, prior work has noted that spam web pages have distinct linkage patterns. We can use the webpro jection method to identify either sets of results that are likely to contain many spam pages, or more sp ecifically to identify pages that are likely to b e spam (using graphical features of individual nodes as we will discuss next). Applying our ideas to enhance ranking is esp ecially interesting as it involves looking at features of individual nodes relative to a subset, which is interesting and has wider applicability than ranking itself. There are several ways one could approach this opp ortunity. One of the approaches we have b een pursuing considers the inclusion of additional node- In prior research, investigators have explored the use of query terms to identify prop erties and relationships among sp ecific parts of the web graph. In the HITS work by Kleinb erg [13], eigenvectors are used to identify authoritative nodes using the notion of focused subgraphs defined by a query and associated links, and mutually reinforcing hubs and authorities. The work is similar to ours in that it extracts and op erates on a query-defined subset of the web graph. In contrast to methods we have presented, HITS calculates a single prop erty of a node (the corresp onding comp onent of the 1st singular vector of graph adjacency matrix), which is used to rank search results. We use a much wider range of graphical features that characterize whole subgraphs. Variants of PageRank that work with subgraphs of the web have also b een explored in prior research. This work includes explorations of domain-sp ecific or p erson-sp ecific PageRank [8, 24], and on the use of non-random jump vectors for p ersonalization [9]. Related work on identifying web spam has examined methods for propagating from trustedpages [27]. Recent work by Nie et al. [19] focused on eigenvectors or counts to set node priors. In the content domain, Cronen-Townsend et al. [5] have looked at techniques for predicting the quality of results (what they call query difficulty) by computing the entropy b etween the language model for the results and the collection as a whole, but they do not consider any graphical prop erties. Several efforts have combined links and content in different ways. For example, Charkrabarti et al. [2] use link information on classes of neighb ors to improve text classification accuracy of a target page. Dean and Henzinger [6] use links and content to find related pages, making use of information ab out the simple existence of links, rather than the rich top ological characteristics of subgraphs that we represent and exploit. There has b een research on considering multiple ob jectives, for example examining relationships among pap ers, authors, and institutions. In this work, relationships have b een computed globally, based on one or more sets of similarity measures [28, 20]. Related work also includes research on citation analysis, including Garfield's early work on the impact factor [7], and later refinements by Pinski and Narin [22]. Vassilvitskii and Brill have recently explored the use of distance (and direction) in the web graph for relevance feedback in web search [25]. Minkov et al. [18] have examined contextual 479 WWW 2007 / Track: Search centric features from the pro jection and connection graphs, including those describing the p osition of the node with regard to the rest of the graph. These node-centric features can then b e used to train existing ranking algorithms (e.g., RankNet). Another application of our work is to discover missing human relevance scores. Given a pro jection graph where we have human relevance judgments for a few nodes, we would like to predict the judgments that would b e assigned to the rest of the nodes. Another promising direction for research is exploring the role of the connector nodes. We used these nodes to connect disconnected comp onents of the pro jection graph. However, we observed that the connector nodes often b ecome hubs holding the network together. As an example, see Figure 5(b) where the central connector node is a hub. However, there are also cases where no such patterns emerge, (e.g., Figure 6(b)). To explore these questions, a b etter understanding of the quality of our heuristics to connect comp onents of pro jection graph is needed. In the analyses describ ed, we used a greedy heuristic that randomly chooses connections from the set of comp eting paths of the same length. We do not yet understand the sensitivity this heuristic in the overall procedure for selecting nodes. With regard to modeling users and their queries, we are excited ab out exploring clusters of queries and query transitions based on graphical prop erties of their pro jection and connection graphs. The rich evidential patterns provided by graph-theoretic features promise to b e valuable in describing different characteristics of queries, determining the classes of queries, and mapping query transitions to these classes. One could also apply the ideas for modeling web searchers' b ehaviors in other ways. Web pro jections might b e used to construct predictive models that infer paths that the searchers will likely take when reformulating a query from an initial query. Predictive models could b e used to suggest likely query reformulations, sp ecializations, generalizations and avoid transitions that would not likely lead to good sets of results, as captured by the graphical prop erties of pro jections associated with the sets. Other applications include using the graph pro jections to explore the dynamics and evolution of the web, where models learned from features capturing top ology and top ological dynamics could help us to understand how sites and pages that created or removed over time relate to the rest of the web. Such models promise to b e valuable in predicting the conceptual links and quality of new content and sites. We b elieve that the represented work captures a starting p oint with the use of web pro jections. We found that the methods complement existing textual and graphical analyses and provoke tantalizing questions and interesting directions for future research in web search and retrieval, including efforts on enhancing result quality and on b etter understanding and supp orting human b ehavior. We hop e that the ideas will b e of value to others pursuing insights ab out the nature and use of graphical prop erties of the web. Session: Web Graphs [3] D. M. Chickering. The winmine toolkit. Technical Report MSR-TR-2002-103, Microsoft, 2002. [4] D. M. Chickering. Optimal structure identification with greedy search. JMLR, 3:507­554, 2003. [5] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In SIGIR '02, 2002. [6] J. Dean and M. R. Henzinger. Finding related pages in the world wide web. In WWW '99, 1999. [7] E. Garfield. Citation analysis as a tool in journal evaluation. Science, 178(60):471­479, November 1972. [8] T. H. Haveliwala. Topic-sensitive pagerank. In WWW '02, 2002. [9] G. Jeh and J. Widom. Scaling personalized web search. In WWW '03, 2003. [10] R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW '06, 2006. [11] R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Computations, pages 85­103. 1972. [12] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In ICML '03, 2003. [13] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604­632, 1999. [14] R. I. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML '02, 2002. [15] T. Kudo, E. Maeda, and Y. Matsumoto. An application of boosting to graph classification. In NIPS '05. 2005. [16] T. Lau and E. Horvitz. Patterns of search: analyzing and modeling web query refinement. In UM '99, 1999. [17] J. Leskovec, N. Milic-Frayling, and M. Grobelnik. Impact of linguistic analysis on the semantic graph coverage and learning of document extracts. In AAAI '05, 2005. [18] E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email using graphs. In SIGIR '06, 2006. [19] L. Nie, B. D. Davison, and X. Qi. Topical link analysis for web search. In SIGIR '06, 2006. [20] Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Ob ject-level ranking: bringing order to web ob jects. In WWW '05, 2005. [21] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Dig. Lib. Tech. Pro j., 1998. [22] G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12:297­312, 1976. [23] F. Radlinski and S. Dumais. Improving personalized web search using result diversification. In SIGIR '06, 2006. [24] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In NIPS '02, 2002. [25] S. Vassilvitskii and E. Brill. Using web-graph distance for relevance feedback in web search. In SIGIR '06, 2006. [26] S. Wasserman, K. Faust, and D. Iacobucci. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. [27] B. Wu, V. Goel, and B. D. Davison. Topical trustrank: using topicality to combat web spam. In WWW '06, 2006. [28] W. Xi, B. Zhang, Z. Chen, Y. Lu, S. Yan, W.-Y. Ma, and E. A. Fox. Link fusion: a unified link analysis framework for multi-type interrelated data ob jects. In WWW '04, 2004. 8. REFERENCES [1] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML '05, 2005. [2] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW '99, 1999. 480