WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Fast Algorithms for Top-k Personalized PageRank Queries Manish Gupta IIT Bombay Amit Pathak IIT Bombay Soumen Chakrabar ti IIT Bombay manishg@cse.iitb.ac.in ABSTRACT amit@cse.iitb.ac.in soumen@cse.iitb.ac.in In entity-relation (ER) graphs (V , E ), nodes V represent typ ed entities and edges E represent typ ed relations. For dynamic p ersonalized PageRank queries, nodes are ranked by their steady-state probabilities obtained using the standard random surfer model. In this work, we prop ose a framework to answer top-k graph conductance queries. Our top-k ranking technique leads to a 4× sp eedup, and overall, our system executes queries 200­1600× faster than whole-graph PageRank. Some queries might contain hard predicates i.e. predicates that must b e satisfied by the answer nodes. E.g., we may seek authoritative pap ers on public key cryptography, but only those written during 1997. We extend our system to handle hard predicates. Our system achieves these substantial query sp eedups while consuming only 10­20% of the space taken by a regular text index. Categories and Sub ject Descriptors: H.3.1[Information Systems]:Information Storage and Retrieval ­Content Analysis and Indexing ; H.3.3[Information Systems]:Information Search and Retrieval General Terms: Algorithms, Exp erimentation, Measurement, Performance Keywords: top-k, Pagerank, HubRank, Node-deletion, p ersonalized the answer nodes with the top-k p ersonalized PageRank values. At some time during the execution of algorithm 1, let u1 , u2 , · · · b e the nodes sorted in non-increasing order of their scores (pr ) then by prop osition 1, we can say ^ ^ that u1 , u2 , · · · , uk are the b est k answer nodes iff pr (uk ) pr (uk+1 )+ q 1. This is the basic idea b ehind early termi^ nation of the algorithm while guaranteeing top-k answers. Proposition 1. In algorithm 1, at any time, nodes u, ^ pr (u) pr (u) pr (u)+ q 1 ^ In search applications, if the user asked for K = 20 resp onses, it is typically acceptable if the system returns a few more (say K = 40). So, we prop ose that the numb er of resp onses b e bracketed in [K , K ]. This substantially increases the success rate for termination check. Ab out half the queries terminate through algorithm 2. Also, the actual rank K at which the termination check succeeds is typically very close to K . Further, we refine prop osition 1 as: Proposition 2. In algorithm 1, at any time, nodes u, ^ pr (u) pr (u) pr (u) + (1 - )2 q (u) + q 1 ^ Algorithm 1 Basic Push Algorithm 1: q r,pr 0 ^ 2: while q 1 > push 3: pick node u with largest q(u) > 0 {delete-max} 4: Node deletion check {Only for Delete-Push; Alg. 3} 5: q q(u), q(u) 0 ^ 6: if (u H ) pr pr + q PPV u ^ ^ ^ 7: if (u H ) 8: pr (u) pr (u) + (1 - )q ^ ^ ^ 9: for each out-neighbor v of u 10: q(v) q(v) + C (v, u)q {increase-key} ^ 11: top-k quit check {For basic topK+DeletePush;Alg. 2} 12: return pr ^ 1. INTRODUCTION AND RELATED WORK Graph proximity queries of the form "entities related to a set of keywords " can b e answered by searching for answer nodes in the vicinity of nodes matching the keywords. Such graph conductance queries assume that a keywordoriginated prestige starts at the nodes matching the query words and flows through the graph following edges, and the rank of a node dep ends on the amount of prestige that reaches it. There are different techniques prop osed in literature to compute search rankings; one of which is HubRank [3]. HubRank builds on Personalized Pagerank [4] and Berkhin's [1] Bookmark Coloring Algorithm (BCA). Consider a graph G = (V , E ) where each edge (u, v ) E is associated with a conductance C (v , uP probability of ): a "random surfer" walking from u to v ; v C (v , u) = 1. Personalized PageRank Vector (PPV) for a telep ort vector (r ) is then defined as: (1) pr = C pr + (1 - )r = (1 - )(I - C )-1 r where 1 - is telep ort probability (typically 0.2­0.25). r is set such that r (u) > 0 only if u is a match node and P u r (u) = 1. When r (u) = 1 for a single no de u, we call its P P V as P P Vu . Given a hubset (H ) of nodes with precomputed P P Vh h H , we can use Berkhin's [1] asynchronous push to compute pr for a general r as shown in algorithm 1. Maximum increase in pr (u) due to flow of q (u) that can ^ happ en (till push-algorithm termination) is (1 - )q (u) + ^ 2 q (u). Similarly, maximum increase in pr (u) due to flow of residual from nodes in V other than u is ( q 1 -q (u)). q hus, pr (u) can increase atmost by (1- )q (u)+ 2 q (u)+ ( T ^ 1 -q (u)) = (1 - )2 q (u) + q 1. However, this b etter upp er b ound is dep endent on q (u) and so we will need to maintain lower and upp er b ounds separately as against in prop osition 1. So, we relax the b ound as: Proposition 3. In algorithm 1, at any time, nodes u, ^ pr (u) pr (u) pr (u) + (1 - )2 max q (u) + q 1 ^ u Algorithm 2 Top-k termination check for Basic Push 1:initialize an empty min-heap M 2: for each node u in the score map pr ^ 3: if (|M | < K ) insert u into M 4: else 5: let v M have the smallest score 6: if (pr (u) > pr (v)) replace v with u ^ ^ 7: Let u1 , · · · , u M such that pr (u1 ) · · · pr (u ) ^ ^ K K 8: for b = K + 1, . . . , K 9: if (pr (ub ) > pr (ub+1 ) + q 1 ) ^ ^ 10: return can terminate push loop from line 2 to 11: line 11 of algorithm 1 with K = b 12: return cannot yet terminate push loop 2. BASIC TOPK FRAMEWORK For ad-hoc search applications, it is adequate to rep ort Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. Prop osition 3 needs less b ook-keeping than prop osition 2. These b etter b ounds provide a 6% reduction in query time without any change in accuracy; more queries quit earlier and at lower K . 1225 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China flows to far-off nodes) in the (out)neighb orhood of node u and delete a non-target non-hubset node only if its deletion does not blowup numb er of new edges. 3. HARD PREDICATES Consider the query "find top-k pap ers related to XML published in 2008". This enforces that the answer nodes returned should b e pap ers, published in 2008. In this section, we extend basic top-k framework to answer queries with such hard predicates, efficiently. The target nodes returned as answer nodes, should strictly satisfy the hard predicates. One of the ways is to modify "basic top-k for soft predicate queries", such that a node is considered to b e put in heap M (see algorithm 2) only if it b elongs to target set. We call this as naiveTopk . We prop ose a node deletion algorithm which builds on the idea that ranking non-target nodes is not needed in presence of hard predicates. It can delete nodes without affecting authority flow in the remaining graph. We then use it to delete non-target nodes in the graph while executing push. 4. EXPERIMENTS We used a 1994 snapshot of CiteSeer graph which has 74000 nodes and 289000 edges with a 55 MB text index. 1.9M CiteSeer queries have an average of 2.68 words p er query. All but the first 100K queries were used to train and tune our index. We fixed [K , K ] as [20, 40]. We used samples of typical size 10000 from the first 100000 queries as test data and hubset of size 15000 generated using "naive one-shot" hub inclusion p olicy [3]. For DeletePush expts, we selected slowest 1000 queries from the ab ove sample. 3.1 Node Deletion Algorithm Let V contain a sp ecial sink node s with self-loop of C (s, s) = 1. Aim of node deletion algorithm is to delete a node u from graph G and adapt the graph structure to create G = | (V , E ) such that for any telep ort r |V ×1 over G , pr (v) = pr (v ) for all nodes v V - s where pr (v ) is computed over / G , r (v ) = r (v ) for v V and r (v ) = 0 for v V . Let u b e a node to b e deleted, v b e one of the in-neighb ors of u and w b e one of the out-neighb ors of u. Let q (v ) b e the residual at node v at some time instant during execution of push algorithm. Consider the simple case when u does not have a self-loop. v passes on q (v )C (u, v ) to u then u keeps (1 - ) [q (v )C (u, v )] with itself and passes on [q (v )C (u, v )] C (w, u) to w. This can b e achieved by increasing C (w, v ) by C (u, v )C (w, u). To consider the self endorsement by u, (1 - ) [q (v )C (u, v )] is grounded by sending it to sink node by increasing C (s, v ) by (1-)C (u, v ). Algorithm 3 considers the case with self-loop at node u. Algorithm 3 Node Deletion Algorithm 1: Input:Node u to be deleted 2: for each in-neighbor v of u 3: if (u = v) continue 4: for each out-neighbor w of u 5: if (u = w) continue C (u,v)C (w,u) 6: C (w, v) C (w, v) + 1-C (u,u) 7: C (w, u) 0 {Delete u w edge} 8: C (s, v) C (s, v) + (C (u, v) (1 - )(1 + C (u, u))) 9: C (u, v) 0 {Delete v u edge} Figure 1: Push times averaged across queries vs. fraction of push time allowed in termination checks.(The top line uses no termination checks.) As figure 1 shows, algorithm 2 is fast and effective. Quit checks take a very small amount of time and typically give a 4× sp eed b oost. (To control the fraction of time sp ent in quit checks, here we timed recent quit checks and invoked them only when enough time has b een sp ent in push loops.) Also reassuring is that as little as 4% time invested in quit checks result in robust gains. Now, let us compare DeletePush with NaiveTop-k for hard predicates. We varied target set size by having different hard predicates on publication years. Note that the time re1200 1000 queryTimes(ms) 800 600 400 200 0 7778(54) <68 (109) NaiveTopK(avgTime=582ms, avgNumPush=4578) DeletePush(avgTime=443ms ,avgNumPush=2580) <83(923) 88-90 (5192) 91-92 <93 (10607) (43224) The algorithm takes O(inDeg r ee(u)× outDeg r ee(u)) time and can p otentially increase #edges by inDeg r ee(u)× outDeg r ee(u) - (inDeg r ee(u) + outDeg r ee(u)). 3.2 Ranking only target nodes (DeletePush) While p erforming DeletePush, we first pick a node u (step 3 of algorithm 1), delete all the "deletable" non-target entity nodes reachable from u (step 4 of algorithm 1) and then p erform push from u. Deleting a non-target node avoids any further pushes from it; thereby saving some work. A single node deletion can bloat #edges, so we need to judiciously pick the victim nodes (non-target entities) to b e deleted, keeping in mind these observations ab out social networks. 1. Block structure phenomena [5], observed in social networks, partitions the graph into clusters, ensuring that edges to b e added due to a node deletion already exist in the graph. 2. The indegree and outdegree of the nodes in the graph follow p ower law [2]. Since large numb er of nodes have very small indegree or outdegree they can b e deleted safely. As we do not want a large amount of time to b e sp ent in node-deletion, we use a conservative approach where we do a local search (as push may get terminated b efore authority Figure 2: Comparison of top-k algorithms quired by DeletePush does not decrease in prop ortion with the decrease in numb er of pushes b ecause of deletion overheads. Figure 2 shows that DeletePush works b etter when the target set sizes are not too large. Thus, by applying a top-k framework over the basic push algorithm, we try to efficiently answer graph conductance queries with hard predicates, achieving b etter query processing times with low indexing space. Year Predicates (Selectivity) 5. REFERENCES [1] P. Berkhin. Bo okmark-coloring approach to p ersonalized pagerank computing. Internet Mathematics, 3(1):41­62, Jan. 2007. [2] A. Z. Bro der, R. Kumar, F. Maghoul, P. Raghavan, S. Ra jagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web. Computer Networks, 33(1-6):309­320, 2000. [3] S. Chakrabarti. Dynamic p ersonalized PageRank in entity-relation graphs. In www, Banff, May 2007. [4] G. Jeh and J. Widom. Scaling p ersonalized web search. In WWW Conference, pages 271­279, 2003. [5] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the blo ck structure of the web for computing, Mar. 12 2003. 1226