Using lexical and relational similarity to classify semantic relations
´ e Diarmuid O S´ aghdha Computer Laboratory University of Cambridge 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom do242@cl.cam.ac.uk Abstract
Many methods are available for computing semantic similarity between individual words, but certain NLP tasks require the comparison of word pairs. This paper presents a kernel-based framework for application to relational reasoning tasks of this kind. The model presented here combines information about two distinct types of word pair similarity: lexical similarity and relational similarity. We present an efficient and flexible technique for implementing relational similarity and show the effectiveness of combining lexical and relational models by demonstrating state-ofthe-art results on a compound noun interpretation task.

Ann Copestake Computer Laboratory University of Cambridge 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom aac10@cl.cam.ac.uk
has received a great deal of attention in recent years (Girju et al., 2005; Turney, 2006; Butnariu and Veale, 2008). In English (and other languages), the process of producing new lexical items through compounding is very frequent and very productive. Furthermore, the noun-noun relation expressed by a given compound is not explicit in its surface form: a steel knife may be a knife made from steel but a kitchen knife is most likely to be a knife used in a kitchen, not a knife made from a kitchen. The assumption made by similarity-based interpretation methods is that the likely meaning of a novel compound can be predicted by comparing it to previously seen compounds whose meanings are known. This is a natural framework for computational techniques; there is also empirical evidence for similaritybased interpretation in human compound processing (Ryder, 1994; Devereux and Costello, 2007). This paper presents an approach to relational reasoning based on combining information about two kinds of similarity between word pairs: lexical similarity and relational similarity. The assumptions underlying these two models of similarity are sketched in Section 2. In Section 3 we describe how these models can be implemented for statistical machine learning with kernel methods. We present a new flexible and efficient kernelbased framework for classification with relational similarity. In Sections 4 and 5 we apply our methods to a compound interpretation task and demonstrate that combining models of lexical and relational similarity can give state-of-the-art results on a compound noun interpretation task, surpassing the performance attained by either model taken alone. We then discuss previous research on relational similarity, and show that some previously proposed models can be implemented in our framework as special cases. Given the good performance achieved for compound interpretation, it seems likely that the methods presented in this pa-

1

Introduction

The problem of modelling semantic similarity between words has long attracted the interest of researchers in Natural Language Processing and has been shown to be important for numerous applications. For some tasks, however, it is more appropriate to consider the problem of modelling similarity between pairs of words. This is the case when dealing with tasks involving relational or analogical reasoning. In such tasks, the challenge is to compare pairs of words on the basis of the semantic relation(s) holding between the members of each pair. For example, the noun pairs (steel,knife) and (paper,cup) are similar because in both cases the relation N2 is made of N1 frequently holds between their members. Analogical tasks are distinct from (but not unrelated to) other kinds of "relation extraction" tasks where each data item is tied to a specific sentence context (e.g., Girju et al. (2007)). One such relational reasoning task is the problem of compound noun interpretation, which

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 621­629, Athens, Greece, 30 March ­ 3 April 2009. c 2009 Association for Computational Linguistics

621

per can also be applied successfully to other relational reasoning tasks; we suggest some directions for future research in Section 7.

2

Two models of word pair similarity

While there is a long tradition of NLP research on methods for calculating semantic similarity between words, calculating similarity between pairs (or n-tuples) of words is a less well-understood problem. In fact, the problem has rarely been stated explicitly, though it is implicitly addressed by most work on compound noun interpretation and semantic relation extraction. This section describes two complementary approaches for using distributional information extracted from corpora to calculate noun pair similarity. The first model of pair similarity is based on standard methods for computing semantic similarity between individual words. According to this lexical similarity model, word pairs (w1 , w2 ) and (w3 , w4 ) are judged similar if w1 is similar to w3 and w2 is similar to w4 . Given a measure wsim of word-word similarity, a measure of pair similarity psim can be derived as a linear combination of pairwise lexical similarities: psim((w1 , w2 ), (w3 , w4 )) = [wsim(w1 , w3 )] + [wsim(w2 , w4 )] A great number of methods for lexical semantic similarity have been proposed in the NLP literature. The most common paradigm for corpusbased methods, and the one adopted here, is based on the distributional hypothesis: that two words are semantically similar if they have similar patterns of co-occurrence with other words in some set of contexts. Curran (2004) gives a comprehensive overview of distributional methods. The second model of pair similarity rests on the assumption that when the members of a word pair are mentioned in the same context, that context is likely to yield information about the relations holding between the words' referents. For example, the members of the pair (bear, f orest) may tend to co-occur in contexts containing patterns such as w1 lives in the w2 and in the w2 ,. . . a w1 , suggesting that a LOCATED IN or LIVES IN relation frequently holds between bears and forests. If the contexts in which fish and reef co-occur are similar to those found for bear and forest, this is evidence that the same semantic relation tends to (1)

hold between the members of each pair. A relational distributional hypothesis therefore states that two word pairs are semantically similar if their members appear together in similar contexts. The distinction between lexical and relational similarity for word pair comparison is recognised by Turney (2006) (he calls the former attributional similarity), though the methods he presents focus ´ e on relational similarity. O S´ aghdha and Copestake's (2007) classification of information sources for noun compound interpretation also includes a description of lexical and relational similarity. Approaches to compound noun interpretation have tended to use either lexical or relational similarity, though rarely both (see Section 6 below).

3
3.1

Kernel methods for pair similarity
Kernel methods

The kernel framework for machine learning is a natural choice for similarity-based classification (Shawe-Taylor and Cristianini, 2004). The central concept in this framework is the kernel function, which can be viewed as a measure of similarity between data items. Valid kernels must satisfy the mathematical condition of positive semidefiniteness; this is equivalent to requiring that the kernel function equate to an inner product in some vector space. The kernel can be expressed in terms of a mapping function  from the input space X to a feature space F: k(xi , xj ) = (xi ), (xj )
F

(2)

where ·, · F is the inner product associated with F. X and F need not have the same dimensionality or be of the same type. F is by definition an inner product space, but the elements of X need not even be vectorial, so long as a suitable mapping function  can be found. Furthermore, it is often possible to calculate kernel values without explicitly representing the elements of F; this allows the use of implicit feature spaces with a very high or even infinite dimensionality. Kernel functions have received significant attention in recent years, most notably due to the successful application of Support Vector Machines (Cortes and Vapnik, 1995) to many problems. The SVM algorithm learns a decision boundary between two data classes that maximises the minimum distance or margin from the training points in each class to the boundary. The geometry of the space in which this boundary is set depends on the

622

kernel function used to compare data items. By tailoring the choice of kernel to the task at hand, the user can use prior knowledge and intuition to improve classification performance. One useful property of kernels is that any sum or linear combination of kernel functions is itself a valid kernel. Theoretical analyses (Cristianini et al., 2001; Joachims et al., 2001) and empirical investigations (e.g., Gliozzo et al. (2005)) have shown that combining kernels in this way can have a beneficial effect when the component kernels capture different "views" of the data while individually attaining similar levels of discriminative performance. In the experiments described below, we make use of this insight to integrate lexical and relational information for semantic classification of compound nouns. 3.2 Lexical kernels

3.3

String embedding functions

´ e O S´ aghdha and Copestake (2008) demonstrate how standard techniques for distributional similarity can be implemented in a kernel framework. In particular, kernels for comparing probability distributions can be derived from standard probabilistic distance measures through simple transformations. These distributional kernels are suited to a data representation where each word w is identified with the a vector of conditional probabilities (P (c1 |w), . . . , P (c|C| |w)) that defines a distribution over other terms c co-occurring with w. For example, the following positive semi-definite kernel between words can be derived from the wellknown Jensen-Shannon divergence: k jsd (w1 , w2 ) = -
c

The necessary starting point for our implementation of relational similarity is a means of comparing contexts. Contexts can be represented in a variety of ways, from unordered bags of words to rich syntactic structures. The context representation adopted here is based on strings, which preserve useful information about the order of words in the context yet can be processed and compared quite efficiently. String kernels are a family of kernels that compare strings s, t by mapping them into feature vectors String (s), String (t) whose non-zero elements index the subsequences contained in each string. A string is defined as a finite sequence s = (s1 , . . . , sl ) of symbols belonging to an alphabet . l is the set of all strings of length l, and  is set of all strings or the language. A subsequence u of s is defined by a sequence of indices i = (i1 , . . . , i|u| ) such that 1  i1 < · · · < i|u|  |s|, where |s| is the length of s. len(i) = i|u| - i1 + 1 is the length of the subsequence in s. An embedl ding String :   R|| is a function that maps a string s onto a vector of positive "counts" that correspond to subsequences contained in s. One example of an embedding function is a gap-weighted embedding, defined as gapl (s) = [
i:s[i]=u

len(i) ]ul

(4)

[P (c|w1 ) log2 ( + P (c|w2 ) log2 (

P (c|w1 ) ) P (c|w1 ) + P (c|w2 )

P (c|w2 ) )] (3) P (c|w1 ) + P (c|w2 )

A straightforward method of extending this model to word pairs is to represent each pair (w1 , w2 ) as the concatenation of the co-occurrence probability vectors for w1 and w2 . Taking kjsd as a measure of word similarity and introducing parameters  and  to scale the contributions of w1 and w2 respectively, we retrieve the lexical model of pair similarity defined above in (1). Without prior knowledge of the relative importance of each pair constituent, it is natural to set both scaling parameters to 0.5, and this is done in the experiments below.

 is a decay parameter between 0 and 1; the smaller its value, the more the influence of a discontinuous subsequence is reduced. When l = 1 this corresponds to a "bag-of-words" embedding. Gap-weighted string kernels implicitly compute the similarity between two strings s, t as an inner product (s), (t) . Lodhi et al. (2002) present an efficient dynamic programming algorithm that evaluates this kernel in O(l|s||t|) time without explicitly representing the feature vectors (s), (t). An alternative embedding is that used by Turney (2008) in his PairClass system (see Section 6). For the PairClass embedding P C , an n-word context [0-1 words] N1|2 [0-3 words] N1|2 [0-1 words] containing target words N1 , N2 is mapped onto the 2n-2 patterns produced by substituting zero or more of the context words with a wildcard . Unlike the patterns used by the gap-weighted embedding these are not truly discontinuous, as each wildcard must match exactly one word.

623

3.4

Kernels on sets

String kernels afford a way of comparing individual contexts. In order to compute the relational similarity of two pairs, however, we do not want to associate each pair with a single context but rather with the set of contexts in which they appear together. In this section, we use string embeddings to define kernels on sets of strings. One natural way of defining a kernel over sets is to take the average of the pairwise basic kernel values between members of the two sets A and B. Let k0 be a kernel on a set X , and let A, B  X be sets of cardinality |A| and |B| respectively. The averaged kernel is defined as kave (A, B) = 1 |A||B| k0 (a, b)
aA bB

(5)

This kernel was introduced by G¨ rtner et a al. (2002) in the context of multiple instance learning. It was first used for computing relational sim´ e ilarity by O S´ aghdha and Copestake (2007). The efficiency of the kernel computation is dominated by the |A| × |B| basic kernel calculations. When each basic kernel calculation k0 (a, b) has significant complexity, as is the case with string kernels, calculating kave can be slow. A second perspective views each set as corresponding to a probability distribution, and takes the members of that set as observed samples from that distribution. In this way a kernel on distributions can be cast as a kernel on sets. In the case of sets whose members are strings, a string embedding String can be used to estimate a probability distribution over subsequences for each set by taking the normalised sum of the feature mappings of its members: Set (A) = 1 Z String (s)
sA

(6)

where Z is a normalisation factor. Different choices of String yield different relational similarity models. In this paper we primarily use the gap-weighted embedding gapl ; we also discuss the PairClass embedding P C for comparison. Once the embedding Set has been calculated, any suitable inner product can be applied to the resulting vectors, e.g. the linear kernel (dot product) or the Jensen-Shannon kernel defined in (3). In the latter case, which we term kjsd below, the natural choice for normalisation is the sum of the entries in sA String (s), ensuring that Set (A)

has unit L1 norm and defines a probability dis1 tribution. Furthermore, scaling Set (A) by |A| , applying L2 vector normalisation and applying the linear kernel retrieves the averaged set kernel kave (A, B) as a special case of the distributional framework for sets of strings. Instead of requiring |A||B| basic kernel evaluations for each pair of sets, distributional set kernels only require the embedding Set (A) to be computed once for each set and then a single vector inner product for each pair of sets. This is generally far more efficient than the kernel averaging method. The significant drawback is that representing the feature vector for each set demands a large amount of memory; for the gap-weighted embedding with subsequence length l, each vector potentially contains up to |A| |smax | entries, l where smax is the longest string in A. In practice, however, the vector length will be lower due to subsequences occurring more than once and many strings being shorter than smax . One way to reduce the memory load is to reduce the lengths of the strings used, either by retaining just the part of each string expected to be informative or by discarding all strings longer than an acceptable maximum. The PairClass embedding function implicitly restricts the contexts considered by only applying to strings where no more than three words occur between the targets, and by ignoring all non-intervening words except single ones adjacent to the targets. A further technique is to trade off time efficiency for space efficiency by computing the set kernel matrix in a blockwise fashion. To do this, the input data is divided into blocks of roughly equal size ­ the size that is relevant here is the sum of the cardinalities of the sets in a given block. Larger block sizes b therefore allow faster computation, but they require more memory. In the experiments described below, b was set to 5,000 for embeddings of length l = 1 and l = 2, and to 3,000 for l = 3.

4
4.1

Experimental setup for compound noun interpretation
Dataset

´ e The dataset used in our experiments is O S´ aghdha and Copestake's (2007) set of 1,443 compound nouns extracted from the British National Corpus (BNC).1 Each compound is annotated with one of
1 The data are available from http://www.cl.cam. ac.uk/~do242/resources.html.

624

six semantic relations: BE, HAVE, IN, AGENT, INSTRUMENT and ABOUT. For example, air disaster is labelled IN (a disaster in the air) and freight train is labelled INSTRUMENT (a train that carries freight). The best previous classification result ´ e on this dataset was reported by O S´ aghdha and Copestake (2008), who achieved 61.0% accuracy and 58.8% F-score with a purely lexical model of compound similarity. 4.2 General Methodology

constituents equally and ensure that the new vector sums to 1. To perform classification with these features we use the Jensen-Shannon kernel (3).3 4.4 Relational features

All experiments were run using the LIBSVM Support Vector Machine library.2 The one-versus-all method was used to decompose the multiclass task into six binary classification tasks. Performance was evaluated using five-fold cross-validation. For each fold the SVM cost parameter was optimised in the range (2-6 , 2-4 , . . . , 212 ) through crossvalidation on the training set. All kernel matrices were precomputed on nearidentical machines with 2.4 Ghz 64-bit processors and 8Gb of memory. The kernel matrix computation is trivial to parallelise, as each cell is independent. Spreading the computational load across multiple processors is a simple way to reduce the real time cost of the procedure. 4.3 Lexical features

Our implementation of the lexical similarity ´ e model uses the same feature set as O S´ aghdha and Copestake (2008). Two corpora were used to extract co-occurrence information: the written component of the BNC (Burnard, 1995) and the Google Web 1T 5-Gram Corpus (Brants and Franz, 2006). For each noun appearing as a compound constituent in the dataset, we estimate a cooccurrence distribution based on the nouns in coordinative constructions. Conjunctions are identified in the BNC by first parsing the corpus with RASP (Briscoe et al., 2006) and extracting instances of the conj grammatical relation. As the 5-Gram corpus does not contain full sentences it cannot be parsed, so regular expressions were used to extract coordinations. In each corpus, the set of co-occurring terms is restricted to the 10,000 most frequent conjuncts in that corpus so that each constituent distribution is represented with a 10,000dimensional vector. The probability vector for the compound is created by appending the two constituent vectors, each scaled by 0.5 to weight both
2 http://www.csie.ntu.edu.tw/~cjlin/ libsvm

To extract data for computing relational similarity, we searched a large corpus for sentences in which both constituents of a compound co-occur. The corpora used here are the written BNC, containing 90 million words of British English balanced across genre and text type, and the English Gigaword Corpus, 2nd Edition (Graff et al., 2005), containing 2.3 billion words of newswire text. Extraction from the Gigaword Corpus was performed at the paragraph level as the corpus is not annotated for sentence boundaries, and a dictionary of plural forms and American English variants was used to expand the coverage of the corpus trawl. The extracted contexts were split into sentences, tagged and lemmatised with RASP. Duplicate sentences were discarded, as were sentences in which the compound head and modifier were more than 10 words apart. Punctuation and tokens containing non-alphanumeric characters were removed. The compound modifier and head were replaced with placeholder tokens M:n and H:n in each sentence to ensure that the classifier would learn from relational information only and not from lexical information about the constituents. Finally, all tokens more than five words to the left of the leftmost constituent or more than five words to the right of the rightmost constituent were discarded; this has the effect of speeding up the kernel computations and should also focus the classifier on the most informative parts of the context sentences. Examples of the context strings extracted for the modifier-head pair (history,book) are the:a 1957:m pulitizer:n prize-winning:j H:n describe:v event:n in:i american:j M:n when:c elect:v official:n take:v principle:v and he:p read:v constantly:r usually:r H:n about:i american:j M:n or:c biography:n. This extraction procedure resulted in a corpus of 1,472,798 strings. There was significant variation in the number of context strings extracted for each compound: 288 compounds were associated with 1,000 or more sentences, while 191 were as´ e O S´ aghdha and Copestake (2008) achieve their single best result with a different kernel (the Jensen-Shannon RBF kernel), but the kernel used here (the Jensen-Shannon linear kernel) generally achieves equivalent performance and presents one fewer parameter to optimise.
3

625

Length 1 2 3 12 23 123 P C

kjsd Acc F 47.9 45.8 51.7 49.5 50.7 48.4 51.5 49.6 52.1 49.9 51.3 49.0 44.9 43.3

kave Acc F 43.6 40.4 49.7 48.3 50.1 48.6 48.3 46.8 50.9 49.5 50.5 49.1 40.9 40.0

Table 1: Results for combinations of embedding functions and set kernels sociated with 10 or fewer and no sentences were found for 45 constituent pairs. The largest context sets were predominantly associated with political or economic topics (e.g., government official, oil price), reflecting the journalistic sources of the Gigaword sentences. Our implementation of relational similarity applies the two set kernels kave and kjsd defined in Section 3.4 to these context sets. For each kernel we tested gap-weighted embedding functions with subsequence length values l in the range 1, 2, 3, as well as summed kernels for all combinations of values in this range. The decay parameter  for the subsequence feature embedding was set to 0.5 throughout, in line with previous recommendations (e.g., Cancedda et al. (2003)). To investigate the effects of varying set sizes, we ran experiments with context sets of maximal cardinality q  {50, 250, 1000}. These sets were randomly sampled for each compound; for compounds associated with fewer strings than the maximal cardinality, all associated strings were used. For q = 50 we average results over five runs in order to reduce sampling variation. We also report some results with the PairClass embedding P C . The restricted representative power of this embedding brings greater efficiency and we were able to use q = 5, 000; for all but 22 compounds, this allowed the use of all contexts for which the P C embedding was defined.

l = 1 and the summed kernels k12 = kl=1 +kl=2 . The best performance of 52.1% accuracy, 49.9% F-score is obtained with the Jensen-Shannon kernel kjsd computed on the summed feature embeddings of length 2 and 3. This is significantly lower ´ e than the performance achieved by O S´ aghdha and Copestake (2008) with their lexical similarity model, but it is well above the majority class baseline (21.3%). Results for the PairClass embedding are much lower than for the gap-weighted embedding; the superiority of gapl is statistically significant in all cases except l = 1. Results for combinations of lexical cooccurrence kernels and (gap-weighted) relational set kernels are given in Table 2. With the exception of some combinations of the length-1 set kernel, these results are clearly better than the best results obtained with either the lexical or the relational model taken alone. The best result is obtained by the combining the lexical kernel computed on BNC conjunction features with the summed Jensen-Shannon set kernel k23 ; this combination achieves 63.1% accuracy and 61.6% F-score, a statistically significant improvement (at the p < 0.01 level) over the lexical kernel alone and the best result yet reported for this dataset. Also, the benefit of combining set kernels of different subsequence lengths l is evident; of the 12 combinations presented Table 2 that include summed set kernels, nine lead to statistically significant improvements over the corresponding lexical kernels taken alone (the remaining three are also close to significance). Our experiments also show that the distributional implementation of set kernels (6) is much more efficient than the averaging implementation (5). The time behaviour of the two methods with increasing set cardinality q and subsequence length l is illustrated in Figure 1. At the largest tested values of q and l (1,000 and 3, respectively), the averaging method takes over 33 days of CPU time, while the distributional method takes just over one day. In theory, kave scales quadratically as q increases; this was not observed because for many constituent pairs there are not enough context strings available to keep adding as q grows large, but the dependence is certainly superlinear. The time taken by kjsd is theoretically linear in q, but again scales less dramatically in practice. On the other hand kave is linear in l, while kjsd scales exponentially. This exponential dependence may

5

Results

Table 1 presents results for classification with relational set kernels, using q = 1, 000 for the gapweighted embedding. In general, there is little difference between the performance of kjsd and kave with gapl ; the only statistically significant differences (at p < 0.05, using paired t-tests) are between the kernels kl=1 with subsequence length

626

kjsd BNC Length 1 2 3 12 23 123 No Set Acc 60.6 61.9* 62.5* 62.6* 63.1** 62.9** 59.9 F 58.6 60.4* 60.8* 61.0** 61.6** 61.3** 57.8 5-Gram Acc F 60.3 58.1 62.6 60.8 61.7 59.9 62.3* 60.6* 62.3* 60.5* 62.6 60.8* 60.2 58.1 BNC Acc 59.5 62.0 62.8* 62.0* 62.2* 61.9* 59.9

kave F 57.6 60.5* 61.2** 60.3* 60.7* 60.4* 57.8 5-Gram Acc F 59.1 56.5 61.3 59.1 62.3** 60.8** 61.5 59.2 62.0 60.3 62.4* 60.6* 60.2 58.1

Table 2: Results for set kernel and lexical kernel combination. */** indicate significant improvement at the 0.05/0.01 level over the corresponding lexical kernel alone, estimated by paired t-tests.
10
8

10 kave time/s

8

10 kave time/s

8

10 time/s

6

10

6

10

6

kave kjsd

10

4

10

4

kjsd

10

4

kjsd 10
2

10

2

10

2

10

0

50

250 q

1000

10

0

50

250 q

1000

10

0

50

250 q

1000

(a) l = 1

(b) l = 2

(c) l = 3

Figure 1: Timing results (in seconds, log-scaled) for averaged and Jensen-Shannon set kernels seem worrying, but in practice only short subsequence lengths are used with string kernels. In situations where set sizes are small but long subsequence features are desired, the averaging approach may be more appropriate. However, it seems likely that many applications will be similar to the task considered here, where short subsequences are sufficient and it is desirable to use as much data as possible to represent each set. We note that calculating the PairClass embedding, which counts far fewer patterns, took just 1h21m. For optimal efficiency, it seems best to use a gapweighted embedding with small set cardinality; averaged across five runs kjsd with q = 50 and l = 123 took 26m to calculate and still achieved 47.6% Accuracy, 45.1% F-score. ´ e distributional model of O S´ aghdha and Copestake (2008). The idea of using relational similarity to understand compounds goes back at least as far as Lebowitz' (1988) RESEARCHER system, which processed patent abstracts in an incremental fashion and associated an unseen compound with the relation expressed in a context where the constituents previously occurred. Turney (2006) describes a method (Latent Relational Analysis) that extracts subsequence patterns for noun pairs from a large corpus, using query expansion to increase the recall of the search and feature selection and dimensionality reduction to reduce the complexity of the feature space. LRA performs well on analogical tasks including compound interpretation, but has very substantial resource requirements. Turney (2008) has recently proposed a simpler SVM-based algorithm for analogical classification called PairClass. While it does not adopt a set-based or distributional model of relational similarity, we have noted above that PairClass implicitly uses a feature representation similar to the one presented above as (6) by extracting subsequence patterns from observed cooccurrences of word pair members. Indeed, PairClass can be viewed as a special case of our frame-

6

Related work

Turney et al. (2003) suggest combining various information sources for solving SAT analogy problems. However, previous work on compound interpretation has generally used either lexical similarity or relational similarity but not both in combination. Previously proposed lexical models include the WordNet-based methods of Kim and Baldwin (2005) and Girju et al. (2005), and the

627

work; the differences from the model we have used consist in the use of a different embedding function P C and a more restricted notion of context, a frequency cutoff to eliminate less common subsequences and the Gaussian kernel to compare vectors. While we cannot compare methods directly as we do not possess the large corpus of 5 × 1010 words used by Turney, we have tested the impact of each of these modifications on our model.4 None improve performance with our set kernels, but the only statistically significant effect is that of changing the embedding model as reported in section Section 5. Implementing the full PairClass algorithm on our corpus yields 46.2% accuracy, 44.9% F-score, which is again significantly worse than all results for the gap-weighted model with l > 1. In NLP, there has not been widespread use of set representations for data items, and hence set classification techniques have received little attention. Notable exceptions include Rosario and Hearst (2005) and Bunescu and Mooney (2007), who tackle relation classification and extraction tasks by considering the set of contexts in which the members of a candidate relation argument pair co-occur. While this gives a set representation for each pair, both sets of authors apply classification methods at the level of individual set members rather than directly comparing sets. There is also a close connection between the multinomial probability model we have proposed and the pervasive bag of words (or bag of n-grams) representation. Distributional kernels based on a gapweighted feature embedding extend these models by using bags of discontinuous n-grams and downweighting gappy subsequences. A number of set kernels other than those discussed here have been proposed in the machine learning literature, though none of these proposals have explicitly addressed the problem of comparing sets of strings or other structured objects, and many are suitable only for comparing sets of small cardinality. Kondor and Jebara (2003) take a distributional approach similar to ours, fitting multivariate normal distributions to the feature space mappings of sets A and B and comparing the mappings with the Bhattacharrya vector inner product. The model described above in (6) implicitly fits multinomial distributions in the feature space F;
4 Turney (p.c.) reports that the full PairClass model achieves 50.0% accuracy, 49.3% F-score.

this seems more intuitive for string kernel embeddings that map strings onto vectors of positivevalued "counts". Experiments with Kondor and Jebara's Bhattacharrya kernel indicate that it can in fact come close to the performances reported in Section 5 but has significantly greater computational requirements due to the need to perform costly matrix manipulations.

7

Conclusion and future directions

In this paper we have presented a combined model of lexical and relational similarity for relational reasoning tasks. We have developed an efficient and flexible kernel-based framework for comparing sets of contexts using the feature embedding associated with a string kernel.5 By choosing a particular embedding function and a particular inner product on subsequence vectors, the previously proposed set-averaging and PairClass algorithms for relational similarity can be retrieved as special cases. Applying our methods to the task of compound noun interpretation, we have shown that combining lexical and relational similarity is a very effective approach that surpasses either similarity model taken individually. Turney (2008) argues that many NLP tasks can be formulated in terms of analogical reasoning, and he applies his PairClass algorithm to a number of problems including SAT verbal analogy tests, synonym/antonym classification and distinction between semantically similar and semantically associated words. Our future research plans include investigating the application of our combined similarity model to analogical tasks other than compound noun interpretation. A second promising direction is to investigate relational models for unsupervised semantic analysis of noun compounds. The range of semantic relations that can be expressed by compounds is the subject of some controversy (Ryder, 1994), and unsupervised learning methods offer a data-driven means of discovering relational classes.

Acknowledgements
We are grateful to Peter Turney, Andreas Vlachos and the anonymous EACL reviewers for their helpful comments. This work was supported in part by EPSRC grant EP/C010035/1.
5 The treatment presented here has used a string representation of context, but the method could be extended to other structural representations for which substructure embeddings exist, such as syntactic trees (Collins and Duffy, 2001).

628

References
Thorsten Brants and Alex Franz, 2006. Web 1T 5-gram Corpus Version 1.1. Linguistic Data Consortium. Ted Briscoe, John Carroll, and Rebecca Watson. 2006. The second release of the RASP system. In Proceedings of the ACL-06 Interactive Presentation Sessions. Razvan C. Bunescu and Raymond J. Mooney. 2007. Learning to extract relations from the Web using minimal supervision. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07). Lou Burnard, 1995. Users' Guide for the British National Corpus. British National Corpus Consortium. Cristina Butnariu and Tony Veale. 2008. A conceptcentered approach to noun-compound interpretation. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08). Nicola Cancedda, Eric Gaussier, Cyril Goutte, and Jean-Michel Renders. 2003. Word-sequence kernels. Journal of Machine Learning Research, 3:1059­1082. Michael Collins and Nigel Duffy. 2001. Convolution kernels for natural language. In Proceedings of the 15th Conference on Neural Information Processing Systems (NIPS-01). Corinna Cortes and Vladimir Vapnik. 1995. Support vector networks. Machine Learning, 20(3):273­ 297. Nello Cristianini, Jaz Kandola, Andre Elisseeff, and John Shawe-Taylor. 2001. On kernel target alignment. Technical Report NC-TR-01-087, NeuroCOLT. James Curran. 2004. From Distributional to Semantic Similarity. Ph.D. thesis, School of Informatics, University of Edinburgh. Barry Devereux and Fintan Costello. 2007. Learning to interpret novel noun-noun compounds: Evidence from a category learning experiment. In Proceedings of the ACL-07 Workshop on Cognitive Aspects of Computational Language Acquisition. Thomas G¨ rtner, Peter A. Flach, Adam Kowalczyk, a and Alex J. Smola. 2002. Multi-instance kernels. In Proceedings of the 19th International Conference on Machine Learning (ICML-02). Roxana Girju, Dan Moldovan, Marta Tatu, and Daniel Antohe. 2005. On the semantics of noun compounds. Computer Speech and Language, 19(4):479­496. Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret. 2007. SemEval-2007 Task 04: Classification of semantic relations between nominals. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-07).

Alfio Gliozzo, Claudio Giuliano, and Carlo Strapparava. 2005. Domain kernels for word sense disambiguation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05). David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda, 2005. English Gigaword Corpus, 2nd Edition. Linguistic Data Consortium. Thorsten Joachims, Nello Cristianini, and John ShaweTaylor. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 18th International Conference on Machine Learning (ICML-01). Su Nam Kim and Timothy Baldwin. 2005. Automatic interpretation of noun compounds using WordNet similarity. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05). Risi Kondor and Tony Jebara. 2003. A kernel between sets of vectors. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). Michael Lebowitz. 1988. The use of memory in text processing. Communications of the ACM, 31(12):1483­1502. Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. 2002. Text classification using string kernels. Journal of Machine Learning Research, 2:419­444. ´ e Diarmuid O S´ aghdha and Ann Copestake. 2007. Cooccurrence contexts for noun compound interpretation. In Proceedings of the ACL-07 Workshop on A Broader Perspective on Multiword Expressions. ´ e Diarmuid O S´ aghdha and Ann Copestake. 2008. Semantic classification with distributional kernels. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08). Barbara Rosario and Marti A. Hearst. 2005. Multiway relation classification: Application to proteinprotein interactions. In Proceedings of the 2005 Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP-05). Mary Ellen Ryder. 1994. Ordered Chaos: The Interpretation of English Noun-Noun Compounds. University of California Press, Berkeley, CA. John Shawe-Taylor and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge. Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. 2003. Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings of the 2003 International Conference on Recent Advances in Natural Language Processing (RANLP-03). Peter D. Turney. 2006. Similarity of semantic relations. Computational Linguistics, 32(3):379­416. Peter D. Turney. 2008. A uniform approach to analogies, synonyms, antonyms, and associations. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08).

629