Scaling Distributional Similarity to Large Corpora James Gorman and James R. Curran School of Information Technologies University of Sydney NSW 2006, Australia {jgorman2,james}@it.usyd.edu.au Abstract Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the na¨ve nearesti neighbour approach to comparing context vectors extracted from large corpora scales poorly (O (n2 ) in the vocabulary size). In this paper, we compare several existing approaches to approximating the nearestneighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that S A S H (Houle and Sakuma, 2005) provides the best balance. 1 Introduction It is a general property of Machine Learning that increasing the volume of training data increases the accuracy of results. This is no more evident than in Natural Language Processing (N L P), where massive quantities of text are required to model rare language events. Despite the rapid increase in computational power available for N L P systems, the volume of raw data available still outweighs our ability to process it. Unsupervised learning, which does not require the expensive and timeconsuming human annotation of data, offers an opportunity to use this wealth of data. Curran and Moens (2002) show that synonymy extraction for lexical semantic resources using distributional similarity produces continuing gains in accuracy as the volume of input data increases. Extracting synonymy relations using distributional similarity is based on the distributional hypothesis that similar words appear in similar contexts. Terms are described by collating informa361 tion about their occurrence in a corpus into vectors. These context vectors are then compared for similarity. Existing approaches differ primarily in their definition of "context", e.g. the surrounding words or the entire document, and their choice of distance metric for calculating similarity between the context vectors representing each term. Manual creation of lexical semantic resources is open to the problems of bias, inconsistency and limited coverage. It is difficult to account for the needs of the many domains in which N L P techniques are now being applied and for the rapid change in language use. The assisted or automatic creation and maintenance of these resources would be of great advantage. Finding synonyms using distributional similarity requires a nearest-neighbour search over the context vectors of each term. This is computationally intensive, scaling to O (n2 m) for the number of terms n and the size of their context vectors m. Increasing the volume of input data will increase the size of both n and m, decreasing the efficiency of a na¨ve nearest-neighbour approach. i Many approaches to reduce this complexity have been suggested. In this paper we evaluate state-of-the-art techniques proposed to solve this problem. We find that the Spatial Approximation Sample Hierarchy (Houle and Sakuma, 2005) provides the best accuracy/efficiency trade-off. 2 Distributional Similarity Measuring distributional similarity first requires the extraction of context information for each of the vocabulary terms from raw text. These terms are then compared for similarity using a nearestneighbour search or clustering based on distance calculations between the statistical descriptions of their contexts. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 361­368, Sydney, July 2006. c 2006 Association for Computational Linguistics 2.1 Extraction ) A context relation is defined as a tuple (w, r, w where w is a term, which occurs in some grammatical relation r with another word w in some sentence. We refer to the tuple (r, w ) as an attribute of w. For example, (dog, direct-obj, walk) indicates that dog was the direct object of walk in a sentence. In our experiments context extraction begins with a Maximum Entropy P O S tagger and chunker. The S E X TA N T relation extractor (Grefenstette, 1994) produces context relations that are then lemmatised. The relations for each term are collected together and counted, producing a vector of attributes and their frequencies in the corpus. 2.2 Measures and Weights Both nearest-neighbour and cluster analysis methods require a distance measure to calculate the similarity between context vectors. Curran (2004) decomposes this into measure and weight functions. The measure calculates the similarity between two weighted context vectors and the weight calculates the informativeness of each context relation from the raw frequencies. For these experiments we use the Jaccard (1) measure and the TTest (2) weight functions, found by Curran (2004) to have the best performance. ( ) ) r,w ) min(w(wm , r, w , w(wn , r, w ) ( (1) ) ) r,w ) max(w(wm , r, w , w(wn , r, w ) p(w, r, w ) - p(, r, w )p(w, , ) p (, r, w )p(w, , ) (2) the results. There are many techniques to reduce dimensionality while avoiding this problem. The simplest methods use feature selection techniques, such as information gain, to remove the attributes that are less informative. Other techniques smooth the data while reducing dimensionality. Latent Semantic Analysis (L S A, Landauer and Dumais, 1997) is a smoothing and dimensionality reduction technique based on the intuition that the true dimensionality of data is latent in the surface dimensionality. Landauer and Dumais admit that, from a pragmatic perspective, the same effect as L S A can be generated by using large volumes of data with very long attribute vectors. Experiments with L S A typically use attribute vectors of a dimensionality of around 1000. Our experiments have a dimensionality of 500,000 to 1,500,000. Decompositions on data this size are computationally difficult. Dimensionality reduction is often used before using L S A to improve its scalability. 3.1 Heuristics 2.3 Nearest-neighbour Search The simplest algorithm for finding synonyms is a k -nearest-neighbour (k -N N) search, which involves pair-wise vector comparison of the target term with every term in the vocabulary. Given an n term vocabulary and up to m attributes for each term, the asymptotic time complexity of nearestneighbour search is O (n2 m). This is very expensive, with even a moderate vocabulary making the use of huge datasets infeasible. Our largest experiments used a vocabulary of over 184,000 words. 3 Dimensionality Reduction Using a cut-off to remove low frequency terms can significantly reduce the value of n. Unfortunately, reducing m by eliminating low frequency contexts has a significant impact on the quality of 362 Another technique is to use an initial heuristic comparison to reduce the number of full O (m) vector comparisons that are performed. If the heuristic comparison is sufficiently fast and a sufficient number of full comparisons are avoided, the cost of an additional check will be easily absorbed by the savings made. Curran and Moens (2002) introduces a vector of canonical attributes (of bounded length k m), selected from the full vector, to represent the term. These attributes are the most strongly weighted verb attributes, chosen because they constrain the semantics of the term more and partake in fewer idiomatic collocations. If a pair of terms share at least one canonical attribute then a full similarity comparison is performed, otherwise the terms are not compared. They show an 89% reduction in search time, with only a 3.9% loss in accuracy. There is a significant improvement in the computational complexity. If a maximum of p positive results are returned, our complexity becomes O (n2 k + npm). When p n, the system will be faster as many fewer full comparisons will be made, but at the cost of accuracy as more possibly near results will be discarded out of hand. 4 Randomised Techniques Conventional dimensionality reduction techniques can be computationally expensive: a more scal- able solution is required to handle the volumes of data we propose to use. Randomised techniques provide a possible solution to this. We present two techniques that have been used recently for distributional similarity: Random Indexing (Kanerva et al., 2000) and Locality Sensitive Hashing (L S H , Broder, 1997). 4.1 Random Indexing vector is created by sampling a Gaussian function m times, with a mean of 0 and a variance of 1. For each term w we construct its bit signature using the function hr (w) = 1 : r.w 0 : r.w < 0 0 Random Indexing (R I) is a hashing technique based on Sparse Distributed Memory (Kanerva, 1993). Karlgren and Sahlgren (2001) showed R I produces results similar to L S A using the Test of English as a Foreign Language (T O E FL) evaluation. Sahlgren and Karlgren (2005) showed the technique to be successful in generating bilingual lexicons from parallel corpora. In R I, we first allocate a d length index vector to each unique attribute. The vectors consist of a large number of 0s and small number ( ) number of randomly distributed ±1s. Context vectors, identifying terms, are generated by summing the index vectors of the attributes for each non-unique context in which a term appears. The context vector for a term t appearing in contexts c1 = [1, 0, 0, -1] and c2 = [0, 1, 0, -1] would be [1, 1, 0, -2]. The distance between these context vectors is then measured using the cosine measure: cos( (u, v )) = u·v |u| |v| (3) where r is a spherically symmetric random vector of length d. The signature, w , is the d length bit ¯ vector: w = {hr1 (w), hr2 (w), . . . , hrd (w)} ¯ The cost to build all n signatures is O (nm d). For terms u and v , Goemans and Williamson (1995) approximate the angular similarity by p(hr (u) = hr (v)) = 1 - (u, u) (4) where (u, u) is the angle between u and u. The angular similarity gives the cosine by cos( (u, u)) = cos((1 - p(hr (u) = hr (v)))) (5) The probability can be derived from the Hamming distance: p(hr (u) = hr (v )) = 1 - H(u, v ) ¯¯ d (6) This technique allows for incremental sampling, where the index vector for an attribute is only generated when the attribute is encountered. Construction complexity is O (nmd) and search complexity is O (n2 d). 4.2 Locality Sensitive Hashing By combining equations 5 and 6 we get the following approximation of the cosine distance: cos( (u, u)) = cos H (u, v ) ¯¯ d (7) L S H is a probabilistic technique that allows the approximation of a similarity function. Broder (1997) proposed an approximation of the Jaccard similarity function using min-wise independent functions. Charikar (2002) proposed an approximation of the cosine measure using random hyperplanes Ravichandran et al. (2005) used this cosine variant and showed it to produce over 70% accuracy in extracting synonyms when compared against Pantel and Lin (2002). Given we have n terms in an m dimensional space, we create d m unit random vectors also dimensions, labelled {r , r , ..., r }. Each of m 12 d That is, the cosine of two context vectors is approximated by the cosine of the Hamming distance between their two signatures normalised by the size of the signatures. Search is performed using Equation 7 and scales to O (n2 d). 5 Data Structures The methods presented above fail to address the n2 component of the search complexity. Many data structures have been proposed that can be used to address this problem in similarity searching. We present three data structures: the vantage point tree (V P T, Yianilos, 1993), which indexes points in a metric space, Point Location in Equal 363 Balls (P L E B, Indyk and Motwani, 1998), a probabilistic structure that uses the bit signatures generated by L S H , and the Spatial Approximation Sample Hierarchy (S A S H, Houle and Sakuma, 2005), which approximates a k -N N search. Another option inspired by I R is attribute indexing (I N D E X). In this technique, in addition to each term having a reference to its attributes, each attribute has a reference to the terms referencing it. Each term is then only compared with the terms with which it shares attributes. We will give a theoretically comparison against other techniques. 5.1 Vantage Point Tree the list, the radius is updated to the distance from the target to the new k th closest term. Construction complexity is O (n log n). Search complexity is claimed to be O (log n) for small radius searches. This does not hold for our decreasing radius search, whose worst case complexity is O (n). 5.2 PLEB Point Location in Equal Balls Metric space data structures provide a solution to near-neighbour searches in very high dimensions. These rely solely on the existence of a comparison function that satisfies the conditions of metricality: non-negativity, equality, symmetry and the triangle inequality. V P T is typical of these structures and has been used successfully in many applications. The V P T is a binary tree designed for range searches. These are searches limited to some distance from the target term but can be modified for k -N N search. V P T is constructed recursively. Beginning with a set of U terms, we take any term to be our vantage point p. This becomes our root. We now find the median distance mp of all other terms to p: mp = median{dist(p, u)|u U }. Those terms u such that dist(p, u) mp are inserted into the left sub-tree, and the remainder into the right subtree. Each sub-tree is then constructed as a new V P T, choosing a new vantage point from within its terms, until all terms are exhausted. Searching a V P T is also recursive. Given a term q and radius r , we begin by measuring the distance to the root term p. If dist(q , p) r we enter p into our list of near terms. If dist(q , p) - r m p we enter the left sub-tree and if dist(q , p) + r > mp we enter the right sub-tree. Both sub-trees may be entered. The process is repeated for each entered subtree, taking the vantage point of the sub-tree to be the new root term. To perform a k -N N search we use a backtracking decreasing radius search (Burkhard and Keller, 1973). The search begins with r = , and terms are added to a list of the closest k terms. When the k th closest term is found, the radius is set to the distance between this term and the target. Each time a new, closer element is added to 364 is a randomised structure that uses the bit signatures generated by L S H . It was used by Ravichandran et al. (2005) to improve the efficiency of distributional similarity calculations. Having generated our d length bit signatures for each of our n terms, we take these signatures and randomly permute the bits. Each vector has the same permutation applied. This is equivalent to a column reordering in a matrix where the rows are the terms and the columns the bits. After applying the permutation, the list of terms is sorted lexicographically based on the bit signatures. The list is scanned sequentially, and each term is compared to its B nearest neighbours in the list. The choice of B will effect the accuracy/efficiency trade-off, and need not be related to the choice of k . This is performed q times, using a different random permutation function each time. After each iteration, the current closest k terms are stored. For a fixed d, the complexity for the permutation step is O (q n), the sorting O (q n log n) and the search O (q B n). 5.3 Spatial Approximation Sample Hierarchy S A S H approximates a k -N N search by precomputing some near neighbours for each node (terms in our case). This produces multiple paths between terms, allowing S A S H to shape itself to the data set (Houle, 2003). The following description is adapted from Houle and Sakuma (2005). The S A S H is a directed, edge-weighted graph with the following properties (see Figure 1): · Each term corresponds to a unique node. · The nodes are arranged into a hierarchy of levels, with the bottom level containing n 2 nodes and the top containing a single root node. Each level, except the top, will contain half as many nodes as the level below. · Edges between nodes are linked to consecutive levels. Each node will have at most p parent nodes in the level above, and c child nodes in the level below. A B E F I C G D H J 1 2 3 4 5 Figure 1: K A S A S H, L where p = 2, c = 3 and k = 2 This changes our search complexity to: (9) 1 -1 k log n We use this geometric function in our experiments. Gorman and Curran (2005a; 2005b) found the performance of S A S H for distributional similarity could be improved by replacing the initial random ordering with a frequency based ordering. In accordance with Zipf 's law, the majority of terms have low frequencies. Comparisons made with these low frequency terms are unreliable (Curran and Moens, 2002). Creating S A S H with high frequency terms near the root produces more reliable initial paths, but comparisons against these terms are more expensive. The best accuracy/efficiency trade-off was found when using more reliable initial paths rather than the most reliable. This is done by folding the data around some mean number of relations. For each term, if its number of relations m i is greater than some chosen number of relations M, it is 2 given a new ranking based on the score Mi . Othm erwise its ranking based on its number of relations. This has the effect of pushing very high and very low frequency terms away from the root. k 1 1+ log n · Every node must have at least one parent so that all nodes are reachable from the root. Construction begins with the nodes being randomly distributed between the levels. S A S H is then constructed iteratively by each node finding its closest p parents in the level above. The parent will keep the closest c of these children, forming edges in the graph, and reject the rest. Any nodes without parents after being rejected are then assigned as children of the nearest node in the previous level with fewer than c children. Searching is performed by finding the k nearest nodes at each level, which are added to a set of near nodes. To limit the search, only those nodes whose parents were found to be nearest at the previous level are searched. The k closest nodes from the set of near nodes are then returned. The search complexity is O (ck log n). In Figure 1, the filled nodes demonstrate a search for the near-neighbours of some node q , using k = 2. Our search begins with the root node A. As we are using k = 2, we must find the two nearest children of A using our similarity measure. In this case, C and D are closer than B . We now find the closest two children of C and D . E is not checked as it is only a child of B . All other nodes are checked, including F and G, which are shared as children by B and C . From this level we chose G and H . The final levels are considered similarly. At this point we now have the list of near nodes A, C , D , G, H , I , J , K and L. From this we chose the two nodes nearest q , H and I marked in black, which are then returned. k can be varied at each level to force a larger number of elements to be tested at the base of the S A S H using, for instance, the equation: h-i 1 ki = max{ k 1- log n , pc } 2 + pc2 log n 2 6 Evaluation Measures The simplest method for evaluation is the direct comparison of extracted synonyms with a manually created gold standard (Grefenstette, 1994). To reduce the problem of limited coverage, our evaluation combines three electronic thesauri: the Macquarie, Roget's and Moby thesauri. We follow Curran (2004) and use two performance measures: direct matches (D I R E C T) and inverse rank (I N V R). D I R E C T is the percentage of returned synonyms found in the gold standard. I N V R is the sum of the inverse rank of each matching synonym, e.g. matches at ranks 3, 5 and 28 365 (8) CORPUS C U T- O FF TERMS AV E R A G E R E L AT I O N S PER TERM C U T- O FF BNC LARGE 0 5 100 0 5 100 246,067 88,926 14,862 541,722 184,494 35,618 43 116 617 97 281 1,400 NA I V E HEURISTIC RI L S H 10,000 SASH 5 1.72 1.65 0.80 1.26 1.73 100 1.71 1.66 0.93 1.31 1.71 Table 2: I N V R vs frequency cut-off The initial experiments on R I produced quite poor results. The intuition was that this was caused by the lack of smoothing in the algorithm. Experiments were performed using the weights given in Curran (2004). Of these, mutual information (10), evaluated with an extra log2 (f (w, r, w ) + 1) factor and limited to positive values, produced the best results (R I M I ). The values d = 1000 and = 5 were found to produce the best results. All experiments were performed on 3.2GHz Xeon P4 machines with 4GB of R A M. Table 1: Extracted Context Information 1 give an inverse rank score of 3 + 1 + 218 . With 5 at most 100 matching synonyms, the maximum I N V R is 5.187. This more fine grained as it incorporates the both the number of matches and their ranking. The same 300 single word nouns were used for evaluation as used by Curran (2004) for his large scale evaluation. These were chosen randomly from WordNet such that they covered a range over the following properties: frequency, number of senses, specificity and concreteness. For each of these terms, the closest 100 terms and their similarity scores were extracted. 8 Results As the accuracy of comparisons between terms increases with frequency (Curran, 2004), applying a frequency cut-off will both reduce the size of the vocabulary (n) and increase the average accuracy of comparisons. Table 1 shows the reduction in vocabulary and increase in average context relations per term as cut-off increases. For L A R G E, the initial 541,722 word vocabulary is reduced by 66% when a cut-off of 5 is applied and by 86% when the cut-off is increased to 100. The average number of relations increases from 97 to 1400. The work by Curran (2004) largely uses a frequency cut-off of 5. When this cut-off was used with the randomised techniques R I and L S H , it produced quite poor results. When the cut-off was increased to 100, as used by Ravichandran et al. (2005), the results improved significantly. Table 2 shows the I N V R scores for our various techniques using the B N C with cut-offs of 5 and 100. Table 3 shows the results of a full thesaurus extraction using the B N C and L A R G E corpora using a cut-off of 100. The average D I R E C T score and I N V R are from the 300 test words. The total execution time is extrapolated from the average search time of these test words and includes the setup time. For L A R G E, extraction using NA I V E takes 444 hours: over 18 days. If the 184,494 word vocabulary were used, it would take over 7000 hours, or nearly 300 days. This gives some indication of 366 7 Experiments We use two corpora in our experiments: the smaller is the non-speech portion of the British National Corpus (B N C), 90 million words covering a wide range of domains and formats; the larger consists of the B N C, the Reuters Corpus Volume 1 and most of the English news holdings of the LDC in 2003, representing over 2 billion words of text (L A R G E, Curran, 2004). The semantic similarity system implemented by Curran (2004) provides our baseline. This performs a brute-force k -N N search (NA I V E). We present results for the canonical attribute heuristic (H E U R I S T I C), R I, L S H , P L E B , V P T and S A S H. We take the optimal canonical attribute vector length of 30 for H E U R I S T I C from Curran (2004). For S A S H we take optimal values of p = 4 and c = 16 and use the folded ordering taking M = 1000 from Gorman and Curran (2005b). For R I, L S H and P L E B we found optimal values experimentally using the B N C. For L S H we chose d = 3, 000 (L S H 3,000 ) and 10, 000 (L S H 10,000 ), showing the effect of changing the dimensionality. The frequency statistics were weighted using mutual information, as in Ravichandran et al. (2005): log( p(w, r, w ) ) p(w, , )p(, r, w ) (10) P L E B used the values q = 500 and B = 100. NA I V E HEURISTIC RI R I MI L S H 3,000 L S H 10,000 P L E B 3,000 P L E B 10,000 VPT SASH DIRECT 5.23 4.94 2.97 3.49 2.00 3.68 2.00 3.66 5.23 5.17 BNC INVR 1.71 1.66 0.93 1.41 0.76 1.31 0.76 1.30 1.71 1.71 Time 38.0hr 2.0hr 0.4hr 0.4hr 0.7hr 2.3hr 1.2hr 3.9hr 15.9hr 2.0hr DIRECT 5.70 5.51 2.42 4.58 2.92 3.77 2.85 3.63 5.70 5.29 LARGE INVR 1.93 1.93 0.85 1.75 1.07 1.40 1.07 1.37 1.93 1.89 Time 444.3hr 30.2hr 1.9hr 1.9hr 3.6hr 8.4hr 4.1hr 11.8hr 336.1hr 23.7hr Table 3: Full thesaurus extraction the scale of the problem. The only technique to become less accurate when the corpus size is increased is R I; it is likely that R I is sensitive to high frequency, low information contexts that are more prevalent in L A R G E. Weighting reduces this effect, improving accuracy. The importance of the choice of d can be seen in the results for L S H . While much slower, L S H 10,000 is also much more accurate than L S H 3,000 , while still being much faster than NA I V E. Introducing the P L E B data structure does not improve the efficiency while incurring a small cost on accuracy. We are not using large enough datasets to show the improved time complexity using P L E B . V P T is only slightly faster slightly faster than NA I V E. This is not surprising in light of the original design of the data structure: decreasing radius search does not guarantee search efficiency. A significant influence in the speed of the randomised techniques, R I and L S H , is the fixed dimensionality. The randomised techniques use a fixed length vector which is not influenced by the size of m. The drawback of this is that the size of the vector needs to be tuned to the dataset. It would seem at first glance that H E U R I S T I C and S A S H provide very similar results, with H E U R I S T I C slightly slower, but more accurate. This misses the difference in time complexity between the methods: H E U R I S T I C is n2 and S A S H n log n. The improvement in execution time over NA I V E decreases as corpus size increases and this would be expected to continue. Further tuning of S A S H parameters may improve its accuracy. R I M I produces similar result using L A R G E to S A S H using B N C. This does not include the cost of extracting context relations from the raw text, so the true comparison is much worse. S A S H allows the free use of weight and measure functions, but R I is constrained by having to transform any context space into a R I space. This is important when 367 C U T- O FF NA I V E SASH INDEX 0 541,721 10,599 5,844 LARGE 5 184,493 8,796 13,187 100 35,617 6,231 32,663 Table 4: Average number of comparisons per term considering that different tasks may require different weights and measures (Weeds and Weir, 2005). R I also suffers n2 complexity, where as S A S H is n log n. Taking these into account, and that the improvements are barely significant, S A S H is a better choice. The results for L S H are disappointing. It performs consistently worse than the other methods except V P T. This could be improved by using larger bit vectors, but there is a limit to the size of these as they represent a significant memory overhead, particularly as the vocabulary increases. Table 4 presents the theoretical analysis of attribute indexing. The average number of comparisons made for various cut-offs of L A R G E are shown. NA I V E and I N D E X are the actual values for those techniques. The values for S A S H are worst case, where the maximum number of terms are compared at each level. The actual number of comparisons made will be much less. The efficiency of I N D E X is sensitive to the density of attributes and increasing the cut-off increases the density. This is seen in the dramatic drop in performance as the cut-off increases. This problem of density will increase as volume of raw input data increases, further reducing its effectiveness. S A S H is only dependent on the number of terms, not the density. Where the need for computationally efficiency out-weighs the need for accuracy, R I M I provides better results. S A S H is the most balanced of the techniques tested and provides the most scalable, high quality results. 9 Conclusion We have evaluated several state-of-the-art techniques for improving the efficiency of distributional similarity measurements. We found that, in terms of raw efficiency, Random Indexing (R I) was significantly faster than any other technique, but at the cost of accuracy. Even after our modifications to the R I algorithm to significantly improve its accuracy, S A S H still provides a better accuracy/efficiency trade-off. This is more evident when considering the time to extract context information from the raw text. S A S H, unlike R I, also allows us to choose both the weight and the measure used. L S H and P L E B could not match either the efficiency of R I or the accuracy of S A S H. We intend to use this knowledge to process even larger corpora to produce more accurate results. Having set out to improve the efficiency of distributional similarity searches while limiting any loss in accuracy, we are producing full nearestneighbour searches 18 times faster, with only a 2% loss in accuracy. James Gorman and James Curran. 2005a. Approximate searching for distributional similarity. In ACL-SIGLEX 2005 Workshop on Deep Lexical Acquisition, Ann Arbor, MI, USA, 30 June. James Gorman and James Curran. 2005b. Augmenting approximate similarity searching with lexical information. In Australasian Language Technology Workshop, Sydney, Australia, 9­11 November. Gregory Grefenstette. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston. Michael E. Houle and Jun Sakuma. 2005. Fast approximate similarity search in extremely high-dimensional data sets. In Proceedings of the 21st International Conference on Data Engineering, pages 619­630, Tokyo, Japan. Michael E. Houle. 2003. Navigating massive data sets via local clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 547­552, Washington, DC, USA. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th annual ACM Symposium on Theory of Computing, pages 604­613, New York, NY, USA, 24­26 May. ACM Press. Pentti Kanerva, Jan Kristoferson, and Anders Holst. 2000. Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society, page 1036, Mahwah, NJ, USA. Pentti Kanerva. 1993. Sparse distributed memory and related models. In M.H. Hassoun, editor, Associative Neural Memories: Theory and Implementation, pages 50­76. Oxford University Press, New York, NY, USA. Jussi Karlgren and Magnus Sahlgren. 2001. From words to understanding. In Y. Uesaka, P. Kanerva, and H Asoh, editors, Foundations of Real-World Intelligence, pages 294­ 308. CSLI Publications, Stanford, CA, USA. Thomas K. Landauer and Susan T. Dumais. 1997. A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211­240, April. Patrick Pantel and Dekang Lin. 2002. Discovering word senses from text. In Proceedings of ACM SIGKDD-02, pages 613­619, 23­26 July. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. In Proceedings of the 43rd Annual Meeting of the ACL, pages 622­629, Ann Arbor, USA. Mangus Sahlgren and Jussi Karlgren. 2005. Automatic bilingual lexicon acquisition using random indexing of parallel corpora. Journal of Natural Language Engineering, Special Issue on Parallel Texts, 11(3), June. Julie Weeds and David Weir. 2005. Co-occurance retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4):439­475, December. Peter N. Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pages 311­321, Philadelphia. Acknowledgements We would like to thank our reviewers for their helpful feedback and corrections. This work has been supported by the Australian Research Council under Discovery Project DP0453131. References Andrei Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21­29, Salerno, Italy. Walter A. Burkhard and Robert M. Keller. 1973. Some approaches to best-match file searching. Communications of the ACM, 16(4):230­236, April. Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pages 380­388, Montreal, Quebec, Canada, 19­21 May. James Curran and Marc Moens. 2002. Improvements in automatic thesaurus extraction. In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon, pages 59­66, Philadelphia, PA, USA, 12 July. James Curran. 2004. From Distributional to Semantic Similarity. Ph.D. thesis, University of Edinburgh. Michel X. Goemans and David P. Williamson. 1995. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of Association for Computing Machinery, 42(6):1115­1145, November. 368