SIGIR 2007 Proceedings Poster Protecting Source Privacy in Federated Search Wei Jiang Luo Si Jing Li Dept. of Computer Science Purdue University {wjiang, lsi, li66}@cs.purdue.edu ABSTRACT Many information sources contain information that can only b e accessed through search-sp ecific search engines. Federated search provides search solutions of this typ e of hidden information that cannot b e searched by conventional search engines. In many scenarios of federated search, such as the search among health care providers or among intelligence agencies, an individual information source does not want to disclose the source of the search results to users or other sources. Therefore, this pap er prop oses a two-step federated search protocol that protects the privacy of information sources. As far as we know, this is the first attempt to address the research problem of protecting source privacy in federated text search. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval models General Terms: Security Keywords: Federated search, Source privacy b e develop ed. However, some hospitals may not b e willing to share their rep orts b ecause of p ossible damage on their reputations and obligations to the public. In order to b enefit from these information, source privacy needs to b e protected in the federated search environment, while traditional federated search techniques are not applicable here. We next investigate how to achieve this ob jective. 2. Proposed Protocol The following notations are adopted extensively throughout the pap er: · S = {S1 , . . . , Sm }: a collection of m sources; Si represents a source or a collection (dep ends on the context). · u, Qu : u is a user, and Qu is a user query issued by u. · Epu , Dpr : a public key encryption scheme (e.g., RSA [4]) with a public-private key pair (pu, pr ). · Eki : a private key encryption scheme (e.g., AES [2]) with some private key ki . The protocol consists of two sub-protocols: ranked list generation (RLG) and content retrieval (CR). The RLG protocol simulates the b ehavior of conventional search engines to generate a global ranked lists. Once a user receives the ranked list, the CR protocol can b e used to retrieve the desired content one at a time. Before detailing the RLG protocol, we first describ e two functions used as building blocks for RLG: Gen List and M er g e. The Gen List function generates an encrypted ranked list Li based on Qu . Supp ose pu is the public key of a user u, ki is the private key of source Si , tij is the snipp et of document dij at Si and lij is the corresp onding hyp er link. Let Lij denote the j th element in Li , then Lij is formatted as Epu (kij )||Ekij (tij , Eki (lij ))|| , where indicates some similarity score. Given Li and Lj , the M er g e function returns a new ranked list containing the top K elements b etween Li and Lj according to . Algorithm 1 highlights the key steps of the RLG protocol. At step 1, a user broadcasts his or her query to all the sources in a sp ecific federated search environment. At steps 2-4, each source indep endently generates an encrypted ranked list. (For now, we assume all information sources use the same retrieval algorithm). At steps 5-8, the ranked lists are merged sequentially from S1 to Sm , and finally, Sm sends the global ranked list to the user. Since the snipp ets are encrypted by private keys, the snipp ets generated at a sp ecific source are not disclosed to any other sources during the merging process. Once the user receives the global ranked list Lm , using his or her public key, the user can get all the keys used to encrypt the snipp ets and their corresp onding encrypted links. Based on the contents of the snipp ets and 1. Introduction Information plays crucial roles in today's fast pace-driven environment. How to find relevant information has b een investigated in the field of information retrieval. Information can b e copied (or crawled) and accessed by conventional search engines like Google or Yahoo!. When information is hidden from conventional engines, techniques applied to conventional search engine cannot b e utilized directly. Federated search [1] has emerged as an answer to this problem. Imagine the following two scenarios: 1) For b etter understanding of the spread of a sp ecific disease, researchers may want to know if it has b een rep orted by some health monitoring agencies. Due to p ossible liability, the agencies may not b e willing to disclose their identities to the researchers. 2) Surgical accidents often occur across hospitals and health care centers. Rep orts are kept for the occurrences of such events. Some researchers may b e interested in analyzing common mistakes among hospitals in a sp ecific region so that more effective means to prevent these mistakes could Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 761 SIGIR 2007 Proceedings Poster Algorithm 1 The RLG Protocol 1: u sends Qu to S1 , . . . , Sm 2: for all Si S do 3: Li Gen List(Si , Qu , K) 4: end for 5: for i 2 to m do 6: Si-1 sends Li-1 to Si 7: Li M er g e(Li-1 , Li , K) 8: end for 9: Sm sends Lm to u Algorithm 2 The CR Protocol 1: u sends Ek (l) to S1 2: if Retr i(S1 , Ek (l)) = N U LL then 3: S1 sets d Epu (k1 )||Ek1 (c1 ) 4: else 5: S1 randomly generates d 6: end if 7: for i 2 to m do 8: Si-1 sends d, Ek (l) to Si 9: if Retr i(Si , Ek (l)) = N U LL then 10: Si sets d Epu (ki )||Eki (ci ) 11: end if 12: end for 13: Sm sends d to u subset by S2 , then the retrieval path can b e constructed as: Epu (S1 ||Epu1 (S2 ||Epu2 (S5 ||Epu5 (N U LL)))). The path is app ended after the encrypted document link. When the document is chosen by the user, he or she can decrypt and find the first source on the retrieval path and then send the encrypted link to the source. In the previous example, the user sends Ek (l) and Epu1 (S2 ||Epu2 (S5 ||Epu5 (N U LL))) to S1 . After p erforming the computation sp ecified by the CR protocol, S1 can decrypt and find the next source on the path. Then S1 sends local computation results, Ek (l) and Epu2 (S5 ||Epu5 (N U LL)) to S2 . Since S5 is the last source on the path, S5 directly sends the user the retrieved document. At each source, the computation complexity of RLG is similar to any commonly used IR algorithm to retrieve the top K documents in the local database. The communication complexity of RLG is b ounded by O(c · m), where c is the constant communication b etween any two adjacent sources assuming the public key and snipp et sizes are fixed across the system. The computation complexity of CR at each source is b ounded by the time to encrypt a document, and the communication complexity of CR is b ounded by O(d·m), where d is the size of the document. 3. Related Work scores, the user can decide which document to b e retrieved via the CR protocol. Note that public key encryption is used here to transp ort private keys (this is one of the most common uses of a public key encryption system in practice). Algorithm 2 highlights the key steps of CR. Supp ose Ek (l) is the encrypted link from which the user wants to retrieve the desired document. First, the user sends Ek (l) to S1 . At step 3, the Retr i function returns N U LL if the link cannot b e decrypted. If S1 can prop erly decrypt Ek (l) (i.e., Retr i(, ) = N U LL), it can use l to obtain the desired docu ment, denoted as c1 . S1 then encrypts c1 with a key k1 , and k1 is encrypted using the user's public key. At the end of this step, S1 passes the encrypted document to the next source. If S1 cannot decrypt Ek (l), S1 randomly generates a document (pseudo-document) and passes it to the next source. At steps 7-12, sources S2 , . . . , Sm p erform similar op erations as S1 does at steps 2-6, except that whenever Retr i returns N U LL, the source just passes the received document to the next source. At the end, Sm sends the encrypted document to the user. The user then decrypts to get the private key that allows the user to obtain the actual document. Note that only one source has the intended document, and CR returns one document only. Supp ose b oth S1 and S2 do not have the document. If S1 does not pass a pseudo-document to S2 , then S2 can infer that S1 does not p ossess the document desired by the user. Therefore, the pseudo-document is utilized to prevent such inference attack. When the numb er of sources is large, the CR protocol b ecomes inefficient b ecause it needs to visit every available source. Next, we briefly discuss how to solve this issue. The basic idea is that when each source generates the local ranked list in the RLG protocol, it randomly selects a subset t sources in S to generate a retrieval path. Let pui denote the public key of Si and pu denote the user's public key. Supp ose t = 3 and {S1 , S2 , S5 } is the selected Anonymization protocols prop osed in [3, 5] are not suitable to achieve our ob jective. First of all, in our problem domain, the user identity needs not to b e protected. As a result, we do not need to anonymize the communication from the user to the sources. Secondly, the source privacy needs to b e preserved when they collab oratively produce the global ranked list. Since anonymizing the communication among the sources alone does not prevent a source from observing the contents of other sources' local ranked lists, source privacy is violated. Nevertheless, if preventing the disclosure of sources' IP addresses is a part of our ob jective, the aforementioned anonymization protocols could b e utilized together with our prop osed protocols. 4. Conclusion / Future Work In this pap er, we have prop osed a federated search protocol to protect source privacy. One issue we have yet to discuss is the need for score global normalization. One may argue that global normalization can lead to consistent similarity scores. However, global normalization may distort the distribution of local source's data. We will address this issue in the future. In addition, we will add a resource selection phase to increase source privacy by hiding similarity scores and to make the retrieval protocol more efficient. 5. References [1] W.B. Croft, editor. Advances in Information Retrieval, chapter Distributed Information Retrieval, J. Callan, pages 127­150. Kluwer Academic Publisher, 2000. [2] NIST. Advanced encryption standard (aes). Technical Rep ort NIST Sp ecial Publication FIPS-197, National Institute of Standards and Technology, 2001. [3] Michael K. Reiter and Aviel D. Rubin. Crowds: Anonymity for web transactions. ACM Transactions on Information and System Security, 1(1):66­92, 1998. [4] R. L. Rivest, A. Shamir, and L. Adleman. A metho d for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120­126, 1978. [5] Paul F. Syverson, David M. Goldschlag, and Michael G. Reed. Anonymous connections and onion routing. In Proceedings of the 18th Annual Symposium on Security and Privacy, pages 44­54, Oakland, CA, May 1997. 762