Efficient sentence retrieval based on syntactic structure Ichikawa Hiroshi, Hakoda Keita, Hashimoto Taiichi and Tokunaga Takenobu Department of Computer Science, Tokyo Institute of Technology {ichikawa,hokoda,taiichi,take}@cl.cs.titech.ac.jp Abstract This paper proposes an efficient method of sentence retrieval based on syntactic structure. Collins proposed Tree Kernel to calculate structural similarity. However, structual retrieval based on Tree Kernel is not practicable because the size of the index table by Tree Kernel becomes impractical. We propose more efficient algorithms approximating Tree Kernel: Tree Overlapping and Subpath Set. These algorithms are more efficient than Tree Kernel because indexing is possible with practical computation resources. The results of the experiments comparing these three algorithms showed that structural retrieval with Tree Overlapping and Subpath Set were faster than that with Tree Kernel by 100 times and 1,000 times respectively. NP S VP VP NP N He V be a t s DET a N dog P wit h DET a PP NP N s t ic k S NP VP NP PP NP N He V knows DET t he N girl P wit h DET a NP N ribbon 1 Introduction Retrieving similar sentences has attracted much attention in recent years, and several methods have been already proposed. They are useful for many applications such as information retrieval and machine translation. Most of the methods are based on frequencies of surface information such as words and parts of speech. These methods might work well concerning similarity of topics or contents of sentences. Although the surface information of two sentences is similar, their syntactic structures can be completely different (Figure 1). If a translation system regards these sentences as similar, the translation would fail. This is because conventional retrieval techniques exploit only similarity of surface information such as words and parts-of-speech, but not more abstract information such as syntactic structures. 399 Figure 1: Sentences similar in appearance but differ in syntactic structure Collins et al. (Collins, 2001a; Collins, 2001b) proposed Tree Kernel, a method to calculate a similarity between syntactic structures. Tree Kernel defines the similarity between two syntactic structures as the number of shared subtrees. Retrieving similar sentences in a huge corpus requires calculating the similarity between a given query and each of sentences in the corpus. Building an index table in advance could improve retrieval efficiency, but indexing with Tree Kernel is impractical due to the size of its index table. In this paper, we propose two efficient algo- Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 399­406, Sydney, July 2006. c 2006 Association for Computational Linguistics rithms to calculate similarity of syntactic structures: Tree Overlapping and Subpath Set. These algorithms are more efficient than Tree Kernel because it is possible to make an index table in reasonable size. The experiments comparing these three algorithms showed that Tree Overlapping is 100 times faster and Subpath Set is 1,000 times faster than Tree Kernel when being used for structural retrieval. After briefly reviewing Tree Kernel in section 2, in what follows, we describe two algorithms in section 3 and 4. Section 5 describes experiments to compare these three algorithms and discussion on the results. Finally, we conclude the paper and look at the future direction of our research in section 6. · Else if the productions at n1 and n2 are the same and n1 and n2 are not pre-terminals, nc(n1 ) C (n1 , n2 ) = i (1 + C (ch(n1 , i), ch(n2 , i))) =1 (2) where nc(n) is the number of children of node n and ch(n, i) is the i'th child node of n. Equation (2) recursively calculates C on its child node, and calculating C s in postorder avoids recalculation. Thus, the time complexity of KC (T1 , T2 ) is O(mn), where m and n are the numbers of nodes in T1 and T2 respectively. 2.3 Algorithm to retrieve sentences Neither Collins nor Takahashi discussed retrieval algorithms using Tree Kernel. We use the following simple algorithm. First we calculate the similarity KC (T1 , T2 ) between a query tree and every tree in the corpus and rank them in descending order of KC . Tree Kernel exploits all subtrees shared by trees. Therefore, it requires considerable amount of time in retrieval because similarity calculation must be performed for every pair of trees. To improve retrieval time, an index table can be used in general. However, indexing by all subtrees is difficult because a tree often includes millions of subtrees. For example, one sentence in Titech Corpus (Noro et al., 2005) with 22 words and 87 nodes includes 8,213,574,246 subtrees. The number of subtrees in a tree with N nodes is bounded above by 2N . 2 Tree Kernel 2.1 Definition of similarity Tree Kernel is proposed by Collins et al. (Collins, 2001a; Collins, 2001b) as a method to calculate similarity between tree structures. Tree Kernel defines similarity between two trees as the number of shared subtrees. Subtree S of tree T is defined as any tree subsumed by T , and consisting of more than one node, and all child nodes are included if any. Tree Kernel is not always suitable because the desired properties of similarity are different depending on applications. Takahashi et al. proposed three types of similarity based on Tree Kernel (Takahashi, 2002). We use one of the similarity measures (equation (1)) proposed by Takahashi et al. KC (T1 , T2 ) = n1 N1 , n2 N2 3 Tree Overlapping 3.1 Definition of similarity When putting an arbitrary node n1 of tree T1 on node n2 of tree T2 , there might be the same production rule overlapping in T1 and T2 . We define CT O (n1 , n2 ) as the number of such overlapping production rules when n1 overlaps n2 (Figure 2). We will define CT O (n1 , n2 ) more precisely. First we define L(n1 , n2 ) of node n1 of T1 and node n2 of T2 . L(n1 , n2 ) represents a set of pairs of nodes which overlap each other when putting n1 on n2 . For example in Figure 2, L(b1 , b2 ) = 11 12 2 {(b1 , b2 ), (d1 , d2 ), (e1 , e2 ), (g1 , g1 ), (i1 , j1 )}. 11 11 11 1 L(n1 , n2 ) is defined as follows. Here ni and mi are nodes of tree Ti , ch(n, i) is the i'th child of node n. 1. (n1 , n2 ) L(n1 , n2 ) 400 max C (n1 , n2 ) (1) where C (n1 , n2 ) is the number of shared subtrees by two trees rooted at nodes n1 and n2 . 2.2 Algorithm to calculate similarity Collins et al. (Collins, 2001a; Collins, 2001b) proposed an efficient method to calculate Tree Kernel by using C (n1 , n2 ) as follows. · If the productions at n1 and n2 are different C (n1 , n2 ) = 0 · If the productions at n1 and n2 are the same, and n1 and n2 are pre-terminals, then C (n1 , n2 ) = 1 (1) T1 b1 1 d 1 1 a 1 1 T2 c 1 1 2 g1 a 2 1 2 b1 2 d1 e 2 1 2 2 2 1 ( 2) 2 g1 a 2 1 a bb2 1 11 1 1 (3) c 1 1 a b1 1 d1 e 1 1 1 1 1 c a 1 1 2 1 2 b1 2 1 2 2 2 1 e 1 1 i 2 1 i 2 1 2 2 dd1 1 e e1 1 1 1 g1 1 i 1 1 g j gg2 1 21 2 j i1 1 1 2 gg1 1 1 2 2 i i1 1 d 1 e 1 g CTO(b11,b21) = 2 CTO(g11,g21) = 1 j Figure 2: Example of similarity calculation 2. If (m1 , m2 ) L(n1 , n2 ), (ch(m1 , i), ch(m2 , i)) L(n1 , n2 ) 3. If (ch(m1 , i), ch(m2 , i)) L(n1 , n2 ), (m1 , m2 ) L(n1 , n2 ) 4. L(n1 , n2 ) includes only pairs generated by applying 2. and 3. recursively. CT O (n1 , n2 ) is defined by using L(n1 , n2 ) as follows. N T (T1 ) m2 N T (T2 ) = (m1 , m2 ) (m1 , m2 ) L(n1 , n2 ) P R(m1 ) = P R(m2 ) 1 other node pairs which gives larger CT O than 2, ST O (T1 , T2 ) becomes 2. Table 1: Example of the index table p abc bde eg gi agb gj I [p] {a1 } 1 {b1 , b2 } 11 {e1 , e2 } 11 12 {g1 , g1 } {a2 } 1 2 {g1 } CT O (n1 , n2 ) m , (3) where N T (T ) is a set of nonterminal nodes in tree T , P R(n) is a production rule rooted at node n. Tree Overlapping similarity ST O (T1 , T2 ) is defined as follows by using CT O (n1 , n2 ). ST O (T1 , T2 ) = n1 N T (T1 ) n2 N T (T2 ) 3.2 Algorithm Let us take an example in Figure 3 to explain the algorithm. Suppose that T0 is a query tree and the corpus has only two trees, T1 and T2 . The method to find the most similar tree to a given query tree is basically the same as Tree Kernel's (section 2.2). However, unlike Tree Kernel, Tree Overlapping-based retrieval can be accelerated by indexing the corpus in advance. Thus, given a tree corpus, we build an index table I [p] which maps a production rule p to its occurrences. Occurrences of production rules are represented by their left-hand side symbols, and are distinguished with respect to trees including the rule and 401 max CT O (n1 , n2 ) (4) This formula corresponds to equation (1) of Tree Kernel. As an example, we calculate ST O (T1 , T2 ) in Figure 2 (1). Putting b1 on b2 gives Figure 2 (2) 1 1 in which two production rules b d e and e g overlap respectively. Thus, CT O (b1 , b2 ) becomes 11 1 2 2. While overlapping g1 and g1 gives Figure 2 (3) in which only one production rule g i overlaps. 12 Thus, CT O (g1 , g1 ) becomes 1. Since there are no (1) T0 a 0 1 0 1 T1 b1 1 d1 e 1 a 1 1 T2 c 1 1 2 g1 a 2 1 2 b1 2 d1 e 2 1 2 2 2 1 (2) 0 a a1 1 1 0 10 bb 1 c c1 1 1 1 0 0 dd1 1 e e1 1 1 1 (3) a 2 g1 2 1 a 0 1 0 1 0 b1 c 0 d1 e 0 1 0 bb2 1 c 1 20 20 dd1 1 ee1 1 1 1 i 2 1 i 2 1 g1 1 i 1 1 g j g1 1 i 1 1 g j 2 2 2 1 Score: 2 pt. Score: 1 pt. Figure 3: Example of Tree Overlapping-based retrieval the position in the tree. I [p] is defined as follows. I [p] = F m m N T (T ) p = P R(m) T end return (n , m ) end where parent(n) is the parent node of n, and order(n) is the order of node n among its siblings. Table 2 shows example values of top(n, m) generated by overlapping T0 and T1 in Figure 3. Note that top maps every pair of corresponding nodes in a certain overlapping situation to a pair of the upper-most nodes of that situation. This enables us to use the value of top as an identifier of a situation of overlap. Table 2: Examples of top(n, m) (n, m) (a0 , a1 ) 11 0 , b1 ) (b1 1 (c0 , c1 ) 11 top(n, m) (a0 , a1 ) 11 0 , a1 ) (a1 1 (a0 , a1 ) 11 (5) where F is the corpus (here {T1 , T2 }) and the meaning of other symbols is the same as the definition of CT O (equation (3)). Table 1 shows an example of the index table generated from T1 and T2 in Figure 3 (1). In Table 1, a superscript of a nonterminal symbol identifies a tree, and a subscript identifies a position in the tree. By using the index table, we calculate C [n, m] with the following algorithm. for all (n, m) do C [n, m] := 0 end foreach n in N T (T0 ) do foreach m in I [P R(n)] do (n , m ) := top(n, m) C [n , m ] := C [n , m ] + 1 end end where top(n, m) returns the upper-most pair of overlapped nodes when node n and m overlap. The value of top uniquely identifies a situation of overlapping two trees. Function top(n, m) is calculated by the following algorithm. function top(n, m); begin (n , m ) := (n, m) while order(n ) = order(m ) do n := parent(n ) m := parent(m ) 402 Now C [top(n, m)] = CT O (n, m), therefore the tree similarity between a query tree T0 and each tree T in the corpus ST O (T0 , T )can be calculated by: ST O (T0 , T ) = nN T (T0 ), mN T (T ) max C [top(n, m)] (6) 3.3 Comparison with Tree Kernel The value of ST O (T1 , T2 ) roughly corresponds to the number of production rules included in the largest sub-tree shared by T1 and T2 . Therefore, this value represents the size of the subtree shared by both trees, like Tree Kernel's KC , though the definition of the subtree size is different. One difference is that Tree Overlapping considers shared subtrees even though they are split by a nonshared node as shown in Figure 4. In Figure 4, T1 and T2 share two subtrees rooted at b and c, but their parent nodes are not identical. While Tree Kernel does not consider the superposition putting node a on h, Tree Overlapping considers putting a on h and assigns count 2 to this superposition. (1) T1 b d e f a c g d (2) T2 b e f h c g (3) bb dd ee f f ah cc gg foreach T in I [p] do S [T ] := S [T ] + 1 end end 4.3 Comparison with Tree Kernel As well as Tree Overlapping, Subpath Set retrieval can be accelerated by indexing the corpus. The number of indexes is bounded above by L × D2 where L is the maximum number of leaves of trees (the number of words in a sentence) and D is the maximum depth of syntactic trees. Moreover, considering a subpath as an index term, we can use existing retrieval tools. Subpath Set uses less structural information than Tree Kernel and Tree Overlapping. It does not distinguish the order and number of child nodes. Therefore, the retrieval result tends to be noisy. However, Subpath Set is faster than Tree Overlapping, because the algorithm is simpler. STO(T1,T2) = 2 Figure 4: Example of counting two separated shared subtrees as one Another, more important, difference is that Tree Overlapping retrieval can be accelerated by indexing the corpus in advance. The number of indexes is bounded above by the number of production rules, which is within a practical index size. 5 Experiments This section describes the experiments which were conducted to compare the performance of structure retrieval based on Tree Kernel, Tree Overlapping and Subpath Set. 5.1 Data We conducted two experiments using different annotated corpora. Titech corpus (Noro et al., 2005) consists of about 20,000 sentences of Japanese newspaper articles (Mainiti Shimbun). Each sentence has been syntactically annotated by hand. Due to the limitation of computational resources, we used randomly selected 2,483 sentences as a data collection. Iwanami dictionary (Nishio et al., 1994) is a Japanese dictionary. We extracted 57,982 sentences from glosses in the dictionary. Each sentences was analyzed with a morphological analyzer, ChaSen (Asahara et al., 1996) and the MSLR parser (Shirai et al., 2000) to obtain syntactic structure candidates. The most probable structure with respect to PGLR model (Inui et al., 1996) was selected from the output of the parser. Since they were not investigated manually, some sentences might have been assigned incorrect structures. 5.2 Method We conducted two experiments Experiment I and Experiment II with different corpora. The queries 403 4 Subpath Set 4.1 Definition of similarity Subpath Set similarity between two trees is defined as the number of subpaths shared by the trees. Given a tree, its subpaths is defined as a set of every path from the root node to leaves and their partial paths. Figure 5 (2) shows all subpaths in T1 and T2 in Figure 5(1). Here we denotes a path as a sequence of node names such as (a, b, d). Therefore, Subpath Set similarity of T1 and T2 becomes 15. 4.2 Algorithm Suppose T0 is a query tree, T S is a set of trees in the corpus and P (T ) is a set of subpaths of T . We can build an index table I [p] for each production rule p as follows. I [p] = {T |T T S p P (T )} (7) Using the index table, we can calculate the number of shared subpaths by T0 and T , S [T ], by the following algorithm: for all T S [T ] := 0; foreach p in P (T0 ) do (1) T1 b d a c e g i T2 g i a b d e g (2) (c), (a,c), (e,g,i), (b,e,g,i), j (a,b,e,g,i) Subpaths of T1 (a), (b), (d), (e), (g), (i), (a,b), (b,d), (b,e), (e,g), (g,i), (a,b,d), (a, b, e), (b,e,g), (a,b,e,g) (j), (a,g), (g,j), (a,g,i), (e,g,j), (b,e,g,j), (a,b,e,g,j) Subpaths of T2 SSS(T1,T2) = 15 Figure 5: Example of subpaths were extracted from these corpora. The algorithms described in the preceding sections were implemented with Ruby 1.8.2. Table 3 outlines the experiments. Table 3: Summary of experiments Experiment Target corpus Corpus size No. of queries CPU Memory I Titech Corpus 2,483 sent. 100 Intel Xeon (2.4GHz) 2GB II Iwanami dict. 57,982 sent. 1,000 PowerPC G5 (2.3GHz) 2GB Table 4: Average retrieval time per query [sec] Algorithm TK TO SS Experiment I 529.42 6.29 0.47 Experiment II 3796.1 38.3 5.1 5.3 Results and discussion Since we select a query from the target corpus, the query is always ranked in the first place in the retrieval result. In what follows, we exclude the query tree as an answer from the result. We evaluated the algorithms based on the following two factors: average retrieval time (CPU time) (Table 4) and the rank of the tree which was top-ranked in other algorithm (Table 5). For example, in Experiment I of Table 5, the column "5th" of the row "TO/TK" means that there were 73 % of the cases in which the top-ranked tree by Tree Kernel (TK) was ranked 5th or above by Tree Overlapping (TO). We consider Tree Kernel (TK) as the baseline method because it is a well-known existing similarity measure and exploits more information than others. Table 4 shows that in both corpora, the retrieval speed of Tree Overlapping (TO) is about 404 100 times faster than that of Tree Kernel, and the retrieval speed of Subpath Set (SS) is about 1,000 times faster than that of Tree Kernel. This results show we have successfully accelerated the retrieval speed. The retrieval time of Tree Overlapping, 6.29 and 38.3 sec./per query, seems be a bit long. However, we can shorten this time if we tune the implementation by using a compiler-type language. Note that the current implementation uses Ruby, an interpreter-type language. Comparing Tree Overlapping and Subpath Set with respect to Tree Kernel (see rows "TK/TO" and "TK/SS"), the top-ranked trees by Tree Kernel are ranked in higher places by Tree Overlapping than by Subpath Set. This means Tree Overlapping is better than Subpath Set in approximating Tree Kernel. Although the corpus of Experiment II is 20 times larger than that of Experiment I, the figures of Experiment II is better than that of Experiment I in Table 5. This could be explained as follows. In Experiment II, we used sentences from glosses in the dictionary, which tend to be formulaic and short. Therefore we could find similar sentences easier than in Experiment I. To summarize the results, when being used in Table 5: The rank of the top-ranked tree by other algorithm [%] Experiment I A/B 1st TO/TK 34.0 SS/TK 16.0 TK/TO 29.0 SS/TO 27.0 TK/SS 17.0 TO/SS 29.0 Experiment II A/B 1st TO/TK 74.6 SS/TK 65.3 TK/TO 71.1 SS/TO 73.4 TK/SS 65.5 TO/SS 76.1 5th 73.0 35.0 41.0 49.0 29.0 58.0 10th 82.0 45.0 51.0 58.0 37.0 69.0 5th 88.0 78.8 81.0 86.0 75.9 87.7 10th 92.0 84.1 84.6 89.8 79.7 92.0 structure with the current one. For such purpose, Tree Overlapping and Subpath Set algorithms contribute to speed up the retrieval process, thus make the annotation process more efficient. However, "similarity" of sentences is affected by semantic aspects as well as structural aspects. The output of the algorithms do not always conform with human's intuition. For example, the two sentences in Figure 6 have very similar structures including particles, but they are hardly considered similar from human's viewpoint. With this respect, it is hardly to say which algorithm is superior to others. As a future work, we need to develop a method to integrate both content-based and structurebased similarity measures. To this end, we have to evaluate the algorithms in real application environments (e.g. information retrieval and machine translation) because desired properties of similarity are different depending on applications. similarity calculation of tree structure retrieval, Tree Overlapping approximates Tree Kernel better than Subpath Set, while Subpath Set is faster than Tree Overlapping. References Asahara, M. and Matsumoto, Y., Extended Models and Tools for High-performance Part-of-Speech Tagger. Proceedings of COLING 2000, 2000. Collins, M. and Duffy, N. Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems. Technical report UCSC-CRL-01-01, University of California at Santa Cruz, 2001. Collins, M. and Duffy, N. Convolution Kernels for Natural Language. In Proceedings of NIPS 2001, 2001. Inui, K., Shirai, K., Tokunaga T. and Tanaka H., The Integration of Statistics-based Techniques in the Analysis of Japanese Sentences. Special Interest Group of Natural Language Processing, Information Processing Society of Japan, Vol. 96, No. 114, 1996. Nagao, M. A framework of a mechanical translation between Japanese and English by analogy principle. In Alick Elithorn and Ranan Banerji, editors, Artifical and Human Intelligence, pages 173-180. Amsterdam, 1984. Noro, T., Koike, C., Hashimoto, T., Tokunaga, T. and Tanaka, H. Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with respect to Dependency Measures, The 5th Workshop on Asian Language Resources, pp.9-16, 2005. Nishio, M., Iwabuchi, E. and Mizutani, S. (ed.) Iwanami Kokugo Jiten, Iwanamishoten, 5th Edition, 1994. Shirai, K., Ueki, M. Hashimoto, T., Tokunaga, T. and Tanaka, H., MSLR Parser Tool Kit - Tools for Natural Language Analysis. Journal of Natural Language 6 Conclusion We proposed two fast algorithms to retrieve sentences which have a similar syntactic structure: Tree Overlapping (TO) and Subpath Set (SS). And we compared them with Tree Kernel (TK) to obtain the following results. · Tree Overlapping-based retrieval outputs similar results to Tree Kernel-based retrieval and is 100 times faster than Tree Kernelbased retrieval. · Subpath Set-based retrieval is not so good at approximating Tree Kernel-based retrieval, but is 1,000 times faster than Tree Kernelbased retrieval. Structural retrieval is useful for annotationg corpora with syntactic information (Yoshida et al., 2004). We are developing a corpus annotation tool named "eBonsai" which supports human to annotate corpora with syntactic information and to retrieve syntactic structures. Integrating annotation and retrieval enables annotators to annotate a new instance with looking back at the already annotated instances which share the similar syntactic 405 Que ry PP NP S VP PP NP N P (t o) ADJ (young) N P N (ma n) P (S BJ ) V (c a me ) ... ... (c la s s room) (a teaching material company) (of) " A young man of a teaching material company came to the classroom" Top- ranke d PP NP NP N ... ... S VP PP P (t o) ADJ (e xplode d) N (bombs he ll) P (of) N P V (hit ) (he a d) (pie c e ) (S BJ ) " A piece of the exploded bombshell hit his head" Figure 6: Example of a retrieved similar sentence Processing, Vol. 7, No. 5, pp. 93-112, 2000. (in Japanese) Somers, H., McLean, I., Jones, D. Experiments in multilingual example-based generation. CSNLP 1994: 3rd conference on the Cognitive Science of Natural Language Processing, Dublin, 1994. Takahashi, T., Inui K., and Matsumoto, Y.. Methods of Estimating Syntactic Similarity. Special Interest Group of Natural Language Processing, Information Processing Society of Japan, NL-150-7, 2002. (in Japanese) Yoshida, K., Hashimoto, T., Tokunaga, T. and Tanaka, H.. Retrieving annotated corpora for corpus annotation. Proceedings of 4th International Conference on Language Resources and Evaluation: LREC 2004. pp.1775 ­ 1778. 2004. 406