WWW 2007 / Track: Semantic Web Session: Similarity and Extraction Hierarchical, Perceptron-like Learning for Ontology-Based Information Extraction Yaoyong Li University of Sheffield 211 Por tobello Street Sheffield, S1 4DP, UK Kalina Bontcheva University of Sheffield 211 Por tobello Street Sheffield, S1 4DP, UK yaoyong@dcs.shef.ac.uk ABSTRACT Recent work on ontology-based Information Extraction (IE) has tried to make use of knowledge from the target ontology in order to improve semantic annotation results. However, very few approaches exploit the ontology structure itself, and those that do so, have some limitations. This pap er introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process, by taking into account the relations b etween concepts. The approach is evaluated on the largest available semantically annotated corpus. The results demonstrate clearly the b enefits of using knowledge from the ontology as input to the information extraction process. We also demonstrate the advantages of our approach over other state-of-the-art learning systems on a commonly used b enchmark dataset. Categories and Sub ject Descriptors: I.2.7 [Natural Language Processing]: Text analysis; I.5.2 [Pattern Recognition]: Classifier design and evaluation. General Terms: Algorithms; Performance. Keywords: Ontology-based Information Extraction, semantic annotation, hierarchical learning. kalina@dcs.shef.ac.uk Information Extraction (IE), a form of natural language analysis, is b ecoming a central technology for bridging the gap b etween unstructured text and formal knowledge expressed in ontologies. Ontology-Based IE (OBIE) is IE which is adapted sp ecifically for the semantic annotation task. One of the imp ortant differences b etween traditional IE and OBIE is in the use of a formal ontology as one of the systems inputs and as the target output. The main contribution of this pap er is in investigating the application of hierarchical classification learning to semantic annotation, as part of an ontology-based IE system. Hierarchical classification takes into account the relations b etween concepts, thus b enefiting directly from the ontology. In particular, this pap er studies the large margin hierarchical classification learning algorithm Hieron prop osed in [8], b ecause it is very efficient during b oth training and classification. Computational efficiency is of ma jor imp ortance for OBIE, b ecause dep ending on the size of the ontology, the system may need to train hundreds of classifiers. However, it should b e noted that work presented in this pap er is not a simple application of the Hieron algorithm, as prop osed in [8]. In fact, OBIE sp ecifics lead to two imp ortant modifications: introduction of multi-loop learning and a parameter which makes the algorithm applicable on any IE corpus. Both of these resulted in a quantifiable improvement in p erformance (see Table 6 in Section 5). In addition, semantic annotation is very different from document classification, which is what Hieron was originally develop ed for. Consequently, another contribution of this work is in showing how OBIE can b e decomp osed into several hierarchical classification tasks, which can then b e approached with an algorithm such as Hieron. Another imp ortant contribution of this work is in the use of the Sekt2 ontology-annotated news corpus for evaluation. To the b est of our knowledge, it is the only corpus suitable for evaluating OBIE, which has a non-trivial numb er of classes (146 classes) from an indep endetly created ontology. The problem with using only the Sekt corpus is that no other systems have b een evaluated on it, apart from the SVM and Perceptron ones rep orted here. Unfortunately other recent corp ora for IE evaluation (e.g., Pascal challenge3 , CONLL'034 ) are either not fully available or use a 2 For further information on the Sekt pro ject, see http://www.sekt-pro ject.com. The corpus itself is available on request from the second author. 3 http://nlp.shef.ac.uk/pascal/Corpus.html 4 http://www.cnts.ua.ac.b e/conll2003/ner/ 1. INTRODUCTION At present there is a large gap b etween Knowledge Management (KM) systems and the natural language materials that form something approaching 80% of corp orate data stores [22]. Similarly, Gartner rep orted in 20021 that for at least the next decade more than 95% of human-to-computer information input will involve textual language. They also rep ort that by 2012 taxonomic and hierarchical knowledge mapping and indexing will b e prevalent in almost all information-rich applications. There is a tension here: b etween the increasingly rich semantic models in KM systems on the one hand, and the continuing prevalence of human language materials on the other. The process of tying semantic models and natural language together is referred to as semantic annotation. This process may b e characterised as the dynamic creation of interrelationships b etween ontologies and unstructured and semi-structured documents in a bidirectional manner. 1 http://www3.gartner.com/DisplayDocument?id=379859 Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 777 WWW 2007 / Track: Semantic Web small, flat set of lab els (fewer than 20), thus making them inappropriate for exp erimenting with semantic annotation on a realistic scale. The exp erimental results demonstrate that our hierarchical classification algorithm obtains clearly b etter results than SVM and Perceptron b oth in terms of conventional precision and recall and also according to an ontology-induced metric. Additionally, in order to provide some comparison against state-of-the-art learning IE systems, we also evaluate our Hieron-based system on the seminar corpus, where it achieves the b est overall p erformance. The pap er is structured as follows. Related work on ontology-based IE is discussed in Section 2 and our work is placed in that context. Section 3 introduces the large margin hierarchical classification algorithm Hieron, our modifications to it, the hierarchy-based cost measure, which Hieron requires, and the way in which the OBIE task is formalised as a classification problem. The Sekt ontology-annotated gold-standard is describ ed next (section 4), followed by a comprehensive set of exp erimental results and their analysis (section 5). The pap er concludes with a discussion and plan for future work. Session: Similarity and Extraction have b een develop ed. Magpie [10] is a suite of tools which supp orts semantic annotation of web pages. It is fully automatic and works by matching the text against instances in the ontology. The SemTag system [9] is similar in approach to MagPie as it annotates texts by p erforming lookup against the TAP ontology. It also has a second, disambiguation phase, where SemTag uses a vector-space model to assign the correct ontological class. The problem with b oth systems is that they are not able to discover new instances and are thus restricted in terms of recall. The C-PANKOW system [5] exploits surface patterns and the redundancy on the Web to categorise automatically named entities with resp ect to a given ontology. The advantage is that it is an unsup ervised method, which requires little or no training data. However, this is also its main disadvantage, b ecause the accuracy of its semantic annotation is not yet sufficiently close to that of systems such as KIM and ours. However, it would b e interesting to explore in future the p ossibility of combining our approach with a web-based system such as C-PANKOW OntoSyphon [20] is similar to C-PANKOW and uses the ontology as the starting p oint and carries out web mining to p opulate the ontology with instances. It uses the ontology structure to determine the relevance of the candidate instances. However, it does not carry out semantic annotation of documents, which is the problem addressed here. The KIM OBIE system [15] produces annotations linked b oth to the ontological class and to the exact individual in the instance base. For new (previously unknown) entities, new instances are added to the semantic rep ository. KIM has a rule-based, human-engineered IE system, which uses the ontology structure during pattern matching and instance disambiguation. The only shortcoming of this approach is that it requires human intervention in order to adapt it to new ontologies. To summarise, the difference b etween our approach and the ontology-oriented systems is that we use the ontology as input to the IE process, not just as an output target. In comparison to other ontology-based approaches, our system addresses the limitations of earlier work by using machine learning for easy retargeting, exploiting the ontology structure, and carrying out semantic annotation, in addition to ontology p opulation. 2. RELATED OBIE SYSTEMS Previous work can b e regarded as b eing either ontologyoriented or ontology-based. Ontology-oriented IE systems do not incorp orate the target ontology into the IE process, but typically have a mapping b etween the IE outputs and classes and instances from the ontology. Ontology-based IE systems go one step further by also using the ontology as one of the inputs to the IE algorithms. Next we review first some ontology-oriented systems. AeroDAML [16] applies IE techniques to automatically generate DAML annotations from web pages. AeroDAML uses an ontology which is used to translate the extraction results into a corresp onding RDF model. Amilcare [6] is an IE system which has b een integrated in several different semantic annotation tools (MnM [21] and S-CREAM [14]). It uses sup ervised rule learning to adapt to new domains and applications. It treats the semantic annotations as a flat set of lab els, thus ignoring knowledge from the ontology. One of the problems with these annotation tools is that they do not provide the user with a way to customise the integrated language technology directly. While many users would not need or want such customisation facilities, users who already have ontologies with rich instance data will b enefit if they can make this data available to the IE comp onents. However, this is not p ossible when "traditional" IE methods like Amilcare are used, b ecause they are not aware of the existence of the users ontology. The more serious problem however, as discussed in the SCREAM system [14], is that there is often a gap b etween the IE output annotations and the classes and prop erties in the users ontology. The prop osed solution is to write some kind of rules, such as logical rules, to resolve this. For example, an IE system would typically annotate London and UK as locations, but extra rules are needed to sp ecify that there is a containment relationship b etween the two (for other examples see [14]). However, rule writing of the prop osed kind is too difficult for most users and therefore ontologybased IE is needed, as it annotates directly with the classes and instances from the user's ontology. In resp onse to these problems, a numb er of OBIE systems 3. EXPLOITING THE ONTOLOGY Conventional IE uses lab els that have no sp ecific relation among each other, i.e., they are treated as indep endent the learning algorithms (e.g., Person, Location). In contrast, as concepts in an ontology are related to each other (at the very least through the subsumption hierarchy), it is b eneficial to feed this knowledge into the OBIE algorithms. This section exploits two asp ects of using the ontology structure for OBIE. First it discusses ontology-induced measures, which are then used by the learning algorithm, in addition to calculating some distance-based metrics. Secondly, it introduces the Perceptron-based learning algorithm Hieron which has a mechanism to handle effectively hierarchical classification and our adaptations of Hieron for the OBIE requirements. 3.1 Ontology-Based Performance Metric As concepts in ontologies are related to each other in a subsumption hierarchy, the cost (or loss) for an instance of a concept X b eing wrongly classified as b elonging to another 778 WWW 2007 / Track: Semantic Web concept Y is defined as dep endent on the two particular concepts (denoted as c(X, Y )). Given this cost measure, one OBIE requirement is that if the system misclassifies a mention in the text as b elonging to class Y, instead of X, the cost of that misclassification should b e as small as p ossible, (e.g., classify it as a sup er-class of the correct class). IE systems traditionally use evaluation metrics, such as precision, recal l and F1 , which are computed for each category, indep endently of all other categories. An overall p erformance measure is obtained by averaging the p erformances for all categories (namely macro-averaged) or by putting together all the results of all classifications (micro-averaged). However, this typ e of measures do not take into account the hierarchical relations b etween classes in the ontology and their associated misclassification costs. Consequently, another OBIE requirement is to have p erformance metrics which are sensitive to the structure of the target ontology. Therefore, next we generalise the commonly used IE metrics, precision, recal l and F1 to ontology-based ones. In order to evaluate an OBIE system on a corpus annotated with a given ontology, we first compute the following three numb ers on the system-annotated corpus: · n -- numb er of mentions which are instances of concepts in the ontology and have b een found by the OBIE system (regardless of whether or not OBIE assigned them to the correct class). · nmissing -- numb er of entities in the corpus which are instances of concepts in the ontology but are not found by the system. · nspurious -- numb er of the entities recognised by the system which actually are not an instances of any concept in the ontology. Then for each pair of concepts X and Y we define the cost measure c(X, Y ) as a non-negative numb er, equal to 0 if X = Y . If we assume that C is the largest cost for a given ontology, then we can define a cost based error as ecost (X, Y ) = c(X, Y )/C , satisfying that ecost (X, Y ) [0, 1] and ecost (X, Y ) = 0 if X = Y . Based on this cost-based error, an overall accuracy for the n entities identified by the system is defined as follows: acost = n X i=1 Session: Similarity and Extraction Recent studies of hierarchical classification (see e.g. [8, 2]) typically define c(X, Y ) as the distance (X, Y ) b etween the two nodes X and Y in the classification hierarchy. In our exp eriments we used the distance b etween two classes in the ontology as their misclassification cost and henceforth this is referred to as the distance-based metric. It should b e noted that there are some improvements on the distance-based cost metric. For example, [23] prop ose a measure based on information content and exp erimentally showe that the new measure is b etter than the cost-based one for measuring the similarity of nouns. [19] present the BDM measure which, b esides the shortest distance of the two concepts, takes into account the depth of the most sp ecific common concept of the two concepts and the size of ontology. [25] present an overview of cost measures used in hierarchical classificatioin research. An ongoing work of ours is exp erimenting with such improved versions of cost-based metrics with Hieron and OBIE. 3.2 Hierarchical Learning with Hieron The Hieron large margin learning algorithm for hierarchical classification was defined by [8], based on the margin Perceptron algorithm. Hierarchical classification refers to a sp ecific multi-class classification problem where the class lab els are organised in a hierarchical fashion. One example is document categorisation where categories b elong to a taxonomy. Here we briefly describ e the learning algorithm and discuss our modifications to the original algorithm and then discuss how to apply it to OBIE. As defined, the Hieron algorithm exploits the hierarchical structure of the class lab els. It learns one Perceptron model for each class and meanwhile ensures that the difference b etween two models is in prop ortion to the distance of the two classes in the tree. The philosophy of the learning algorithm is that, if we have to misclassify one example as class X instead of Y , then that class X should b e close to the correct class Y in the hierarchical structure. Let us supp ose a hierarchical classification problem which has instance domain X Rn and lab el set Y . The lab els in the set Y can b e arranged as nodes in a rooted tree T . For any pair of lab els u, v Y , let (u, v ) denote their distance in the tree, namely the numb er of edges along the (unique) path from u to v in T . For every lab el v in the tree, P (v ) is defined as the set of lab els along the path from the root to v , inclusive. The training set S = {(xi , yi ) : i = 1, . . . , m} contains instance-lab el pairs, where each xi X and each yi Y . The Hieron learning algorithm aims to learn a classification function f : X Y which has a small tree induction error. The classifier f has the following form: each lab el v Y has a matching prototyp e Wv Rn , and the classifier f makes its predictions according to the following rule: f (x) = argmaxvY Wv , x (4) (1 - ecost (Ai , Bi )) (1) where Ai and Bi are two classes in the ontology and ecost (Ai , Bi ) is the cost of misclassifying the ith entity as an instance of class Bi , instead of its correct class Ai (class Bi is the same as Ai if the ith entity is classified correctly). Using the overall accuracy acost we define ontology-based precision and recall, resp ectively, as Po = acost /(acost + nspurious ), Ro = acost /(acost + nmissing ) Then, as with conventional f-measure, the ontology-based F1 is defined as the harmonic mean of ontology-based precision and recall: Fo1 = 2 Po Ro /(Po + Ro ) (2) In other words, ontology-based Fo1 is a generalisation of the standard F1 . In fact, if we define the cost c(X, Y ) as the binary function 0 if X = Y c(X, Y ) = (3) 1 otherwise then Fo1 would b e reduced to the standard overall F -measure. where ·, · represents the inner product of two vectors. Hence, the task of learning f is reduced to learning a set of prototyp es {Wv : v Y }. However, Hieron does not deal directly with the set of prototyp es but rather with the difference b etween the prototyp e of each node and the prototyp e of its parent. Formally, we denote A(v ) as the parent node of v in the tree and assume that the parent node of a root node is the root itself. Then 779 WWW 2007 / Track: Semantic Web the difference weight vector is defined as wv = Wv - WA(v) . Each prototyp e is now decomp osed into the sum X wu (5) Wv = u P (v ) Session: Similarity and Extraction Since the learning algorithm requires that adjacent vertices in the lab el tree have similar prototyp es, by representing each prototyp e as a sum of vectors from {wv : v Y }, adjacent prototyp es Wv and WA(v) can b e kept close by simply keeping the norm of the weight vector wv = Wv - WA(v) small. The Hieron algorithm assumes that there exists a set of weight vectors { v : v Y } such that the following hold: X X p u , x i (y i , r ), v , xi - (6) v P (y i ) u P (r ) Figure 1: Illustration of the Hieron's update. (xi , yi ) S and r Y \{yi } However, note that this assumption can b e relaxed if we introduce some regularization parameter into the learning algorithm, as discussed b elow. The Hieron learning algorithm is similar to the Perceptron algorithm as it learns one classifier p er class. But, unlike Perceptron where each model is learnt indep endently of the others, it learns the Perceptrons for all classes in a collective way. The algorithm initialises each of the Perceptron's weight vectors {wv : v Y } as a zero vector and up dates a weight vector only if a prototyp e related to it makes a wrong prediction. By doing so the learning algorithm tries to keep the norm of the weight vector small, which is one of the requirements as discussed ab ove. The learning algorithm also tries to satisfy the margins requirement for the weight vectors and training set shown in (6). Formally, for each instance-lab el pair (xi , yi ) S , the learning algorithm checks if the current weight vectors satisfy the margin requirement for each lab el y = yi by computing the following loss function: X X w v , xi w u , xi - L({wv }, xi , yi , y ) = + u P (y ) v P (y i ) Hieron algorithm needs to keep the norms of the weight vector w as small as p ossible. This is achieved by initialising all weight vectors to zero and only up dating them if necessary. 3.3 Our Hieron Modifications The learning algorithm describ ed ab ove is the original Hieron batch learning algorithm presented in [8]. In order to achieve b etter p erformance, some modifications were introduced in our implementation, as follows: 1. Our learning algorithm learns from the training set until no error is made on the training examples, which means that more than one learning loop may b e needed on the training set. In contrast, the original Hieron batch learning allowed only one learning loop. It will b e shown in our exp eriments b elow that multi-loop learning has b etter generalisation p erformance than single loop learning. 2. The original Hieron learning algorithm requires that the training set is compatible with the margin conditions describ ed in equation (6), so that the algorithm stops after a finite numb er of loops. This is a problem b ecause in OBIE often it is not known in advance whether or not a training set satisfies the margin condition. Therefore, we introduce a regularization parameter into the algorithm such that the learning would stop after some loops on any training set. The value of the regularization parameter is a p ositive numb er5 . The regularization parameter is similar to that used for Perceptron (see e.g. [18]). In more detail, the role of the regularization parameter is to add an extra dimension to the feature vector of each training example and set the comp onent of the dimension as the regularization parameter. By doing so, the set of training examples with extended feature vectors would b ecome linearly separable regardless of the linear separability of the original training set, meaning that the Hieron algorithm would stop after a finite numb er of loops on any training data. In addition, the original algorithm [8] distinguishes two typ es of learning models. One typ e is weight vectors obm tained at the end of learning, namely {wt : v Y }. Another is the mean of all weight vectors used during training. Let us assume that we apply the weight vectors m times to training examples during learning and the weight vectors v used were {wi : v Y , i = 1, · · · , m}, then for every v Y Although the optimal value of the regularization parameter may b e dep endent on the data used, in the exp eriments here, we simply set the regularization parameter to 1. 5 The margin requirement for (xi , yi ) and y is satisfied if and only if the ab ove function is less than or equal to 0. If the margin requirement is satisfied for all training examples, then the learning stops and returns the current learnt model. Otherwise, from all training examples (xi , yi ) for which the margin requirement (6) is violated by the current model, it chooses the lab el yi that violates the margin requirement ^ the most (namely it has the maximal value of the function (7)), and up dates the current weight vectors comprising the ^ two prototyp es Wyi and Wyi , resp ectively, as illustrated in Figure 1. As shown in Figure 1, when a training example x with lab el y is predicted mistakenly as lab el y , only the weight vectors associated with the nodes in the shortest path linking nodes y and y are up dated, except for the node which is the common ancestor of the two nodes considered. In other words, only the nodes depicted using solid lines are up dated, in which the symb ol '+' means increasing the corresp onding weight vector by the example x and the symb ol '-' means decreasing the weight vector by x. As already discussed ab ove, in order to ensure that adjacent vertices in the lab el tree have similar prototyp es, the p (y i , y ) (7) 780 WWW 2007 / Track: Semantic Web define the means of weight vectors as wv = 1X v wi m i=1 m Session: Similarity and Extraction 4. EXPERIMENTAL DATASET (8) The corpus used in our exp eriments consists of semantically annotated news articles. The articles were divided into three subsets according to the article's theme, namely business, international p olitics and UK p olitics, which has 91, 99 and 100 articles, resp ectively. The corpus was annotated manually according to the Proton ontology6 and the exp eriments b elow use its psys:Entity branch.. As already discussed ab ove, some concepts in Proton have more than one parent class. The hierarchical structure of Proton has 10 levels, with maximum path length 14. The news corpus was annotated with 146 concepts from the Proton ontology. The concepts span from the 3rd to the 10th level of the class hierarchichy . As the corpus was created within the Sekt pro ject, hereafter we will refer to the corpus as the Sekt ontology-annotated news corpus. Table 1: Distribution of concepts with different numbers of instances in the Sekt ontology-annotated corpus. #instances 1 2 3 4 5 6 ­ 10 11 ­ 20 >20 #concepts 23 14 12 6 2 13 21 55 Table 1 presents the distribution of concepts with different numb ers of instances in the corpus. We can see that there are 57 concepts each of which has 5 instances or less, thus presenting a data sparseness problem. In order to examine the effect of data sparseness on the algorithm's p erformance, we also created a version of the corpus where all classes were generalised into 7 high-level classes, which are broadly equivalent to lab els used in traditional IE systems (e.g., Person, Location, etc). Table 2 presents the numb ers of instance of each of the seven concepts in each part of the corpus. The corpus is pre-processed with the op en-source ANNIE system, which is part of GATE [7]. This enabled us to use a numb er of linguistic features, in addition to information already present in the document such as words and capitalisation information. The linguistic features are domain-indep endent and include token kind (word, numb er, punctuation), lemma, part-of-sp eech (POS) tag, gazetteer class, and named entity typ e according to ANNIE's rulebased name entity recogniser7 . Table 3 shows an example of text with its associated features. Note that a token may not have values for all features, e.g. the token "Time" does not have value for the Lookup feature b ecause it does not occur in ANNIE's gazetteer lists. Our exp erimental results using SVM (not presented here due to space limitations) showed that the simple features alone such as token form, morphological feature and simple token typ es can achieve quite good results and using more sophisticated features such as POS tag and name entities from ANNIE or a gazetteer only obtained small improvement (less than 2%). Since in IE the context of the token is usually as imp ortant as the token itself, the learning algorithm takes into account features of the preceding and following tokens, in addition to those of the current token. In our exp eriments the same numb er of left and right tokens was taken as a context window. The window size was set to 4, which means See http://proton.semanticweb.org. These entity lab els are: Person, Organisation, Location, Date, Money, and Address. 7 6 It was shown in [8] that the averaged weight vectors had b etter results than the last weight vectors in most cases. In our OBIE exp eriments we also compare the two typ es of weight vectors (see Section 5). 3.4 The Hieron-based OBIE System As already discussed, the goal of OBIE is to identity and classify information entities in text as instances of concepts in an ontology. On the other hand, Hieron is essentially a learning algorithm which classifies every example into categories organised in a tree structure. In order to apply Hieron to OBIE, we need to adapt the OBIE task to make it similar to hierarchical classification. First, we convert the OBIE task into two hierarchical classification problems. Traditionally, when classifiers are used for information extraction each token in the text is treated as one example and classified as b elonging to one of the target IE lab els or as having no lab el (see e.g., [12, 17]). In particular, the annotation task can b e viewed as two binary classification problems, one is for recognising the start tokens of information entities and the other one is for the end tokens. Similarly, we transform the OBIE task into two hierarchical classification problems. For each class in the ontology, two hierarchical classifiers are trained ­ one for recognising the start token of the class instances and one for the end. Secondly, as there are tokens in the text which do not b elong to any class in the ontology, in order to apply Hieron to OBIE, the ontology is extended notionally with a new child node of the root node, that represents the concept of non-relevant token. However, this added concept is not considered in the evaluation metrics, as it is only required for the prop er functioning of the classification algorithm. Thirdly, note that the Hieron algorithm requires that classes are organised in a tree. However, for some ontologies, the class subsumption hierarchy is not a tree, but a graph where some concepts occur as sub classes of two or more classes. In other words, some nodes in the ontology may have more than one parent. The Proton ontology used in our exp eriments (see b elow) is one example of this kind of ontology. For example, the concept SportBuilding is a sub-concept of Building and SportFacility. Consequently, the Hieron algorithm had to b e adapted to deal with graph-like hierarchies, such as the Proton ontology. The modification made to Hieron is simple. We compute one prototyp e vector for each path from the root node to the given class using formula (5). Then, given one example during training or application, the inner products b etween that example and each prototyp e vector are compared with each other and the example is assigned the class whose prototyp e is most relevant. Finally, we replace the distance (X, Y ) in the Hieron algorithm with the cost measure c(X, Y ) b etween two concepts in the ontology. Therefore, we can learn classifiers which are optimised according to a distance-based cost measure, which encodes structural knowledge from the ontology. 781 WWW 2007 / Track: Semantic Web Session: Similarity and Extraction Table 2: Numbers of instances of the seven concepts in the three parts of the Sekt ontology-annotated corpus. Business International UK #D o c 91 99 100 Person 336 948 845 Lo c 601 1857 847 Org 1385 840 845 Money 490 75 100 Contact 2 1 0 Temp oral 119 112 106 Time 743 550 670 Table 3: Example features for the text "Time: 3:30 PM". The Unknown value for the word "Time" means that ANNIE identified the word "Time" as a named entity but could not attribute a more specific entity type. Token Time : 3 : 30 PM Case upp erInitial Tokenkind word punctuation numb er punctuation numb er word Lemma time : 3 : 3 pm Pos NNP : CD : CD NNP Lookup ANNIE Entity Unknown Time Time Time Time allCaps time that the algorithm uses features derived from 9 tokens: the 4 preceding, the current, and the 4 following tokens. Due to the use of a context window, the input vector is the combination of the feature vector of the current token and these of its neighb oring tokens. 5. EXPERIMENTAL RESULTS The exp eriments rep orted here evaluate our modified implementation of the Hieron learning algorithm on the Sekt ontology-annotated news corpus. As there are no previously rep orted results on this corpus, we compare Hieron against two state-of-the-art "traditional" learning algorithms for IE: SVM and Perceptron. Since Hieron exploits the relationships among classes in the ontology, the exp ectation is that it would p erform b etter on OBIE than SVM and Perceptron which were designed basically for flat classification. In other words, these algorithms ignore the relationships b etween the classes and treat them as indep endent of each other. As already discussed in section 3.2, the Hieron algorithm is very similar to the uneven margins Perceptron except that Hieron takes into account the relationship among classes as well as the misclassification cost measure c(X, Y ). In these exp eriments, we used the uneven margins SVM and Perceptron with uneven margins (PAUM), instead of the standard algorithms, b ecause the uneven margins algorithms have b etter p erformance than the resp ective standard models for IE (see [17]). We used the SVM software S V M light (see http://svmlight.joachims.org/), with linear kernel and the parameter C was set to 1. The default settings of the S V M light were adopted for all other SVM parameters. Table 4: Comparison of the performance of Hieron, SVM and Perceptron learning on OBIE: microaveraged F1 (%) and distance induced Fo1 (%). Micro-averaged F1 Distance induced Fo1 PAUM SVM Hieron PAUM SVM Hieron Bu. 74.1 75.3 82.7 78.8 79.3 91.2 Int 77.1 80.1 83.3 83.0 85.9 91.3 UK 82.0 82.9 82.5 83.6 84.4 90.1 Table 4 presents the results of the three learning algorithms on the Sekt ontology-annotated news corpus, measured b oth by conventional micro-averaged F1 and the distancebased Fo1 defined in formula (2). Recall that the microaveraged f-measure treats concepts in ontology as indep endent, whereas the distance-based metric takes into account the relationships in the ontology. In other words, if a mention in the text is found, but assigned mistakenly the wrong ontology class, then no credit is given with the conventional micro-averaged F1, but some credit (which is inversely prop ortional to the distance b etween the two concepts in ontology) will b e given by the distance-based measure. For each algorithm, we run three exp eriments which use one of the three subsets of the corpus for testing and the other two for training. Firstly, as shown in Table 4, Hieron outp erforms the nonontology-based IE methods measured by conventional F1 . The 5 to 10% improvement in results demonstrates that the IE algorithm does b enefit from taking into account the relationships b etween classes in the ontology and using this information during the learning process to optimise p erformance. In addition, Hieron consistently achieves significantly higher p erformance than SVM and Perceptron according to the distance-based Fo1 , which gives partial credit for ontological "near misses". This increased p erformance is due to the optimisation mechanism built into Hieron, where the distance b etween the classifiers for two concepts is prop ortional to the distance of the two concepts in the ontology. In contrast, SVM and other traditional IE learning methods are trained to treat each concept indep endently from the rest. Consequently, when Hieron misclassifies a mention, it is much more likely than the flat algorithms to assign a close concept if it cannot identify the correct one exactly Consequently, this is the reason for the big difference in p erformance b etween the two metrics. Table 5 compares the computation times of training and application of the three learning algorithms on the ontology corpus. The business and international p olitics parts of the corpus were used for training and the UK p olitics part for testing. We ran those exp eriments on a Linux server with one Pentium 4 CPU (3.0GHz) and 2G RAM. The results 782 WWW 2007 / Track: Semantic Web show that Perceptron is very fast, whereas SVM takes much longer than b oth Perceptron and Hieron. At the same time, its p erformance in all exp eriment is only marginally b etter than that of Perceptron. Table 5: Training and test times (in second) of the learning algorithms: Perceptron, SVM, and Hieron. Training Test PAUM 552s 33s SVM 11450s 111s Hieron 3815s 109s Session: Similarity and Extraction Hieron to recognise separately the start and end tokens of the instances. Therefore, if b oth the start and end b oundaries are recognised correctly, we say the entity is recognised correctly or is an exact match to the gold standard. On the other hand, if only the start or end token match those from the gold standard, then it is referred to as a partial match or a partially correct result. Few IE systems (e.g. [6]) are evaluated by using exact match only and partially correct results are discarded. However, traditional large-scale evaluations of IE systems, e.g., the Message Understanding Conferences (MUC) give also credit to partial matches [4]. Consequently, in the exp eriments rep orted ab ove, we took into account b oth correct and partially correct results and, again following traditional IE scoring methods, partial results are assigned half the score of an exact match. Table 7 compares the results for exact match against those for correct and partial match combined. Again we use the business and international p olitics subsets for training and the UK p olitics subset for testing. For exact match Hieron p erforms worse than SVM with resp ect to conventional fmeasure. However, if partially correct results are taken into account, Hieron's results are higher, showing that the hierarchical classifier has b etter capability than the "flat" SVM in recognising the presence of instances in the text, even when it is not able to recognise their b oundaries exactly. Table 7: Comparison of the exact match and partial match for the SVM and Hieron: conventional microaveraged F1 (%) and the distance based Fo1 (%). F1 Distance Fo1 SVM Hieron SVM Hieron Exact 79.3 76.7 80.5 82.0 Exact and partial 82.9 82.5 84.4 90.1 The Hieron algorithm takes longer than Perceptron, but p erforms substantially b etter than b oth traditional IE algorithms. As the test set consisted of 100 documents, Hieron effectively took on average 1 second p er document. As already discussed in Section 3.2, we modified the Hieron algorithm presented in [8] by introducing multiple learning cycles on the training set. We also introduced a regularization parameter to each weight vector to guarantee that the training would finish after a finite numb er of learning cycles on any data. Table 6 compares the p erformance of the original Hieron version against our modified algorithm, using the business and international p olitics subsets for training and the UK p olitics subset for testing. Results for the last weight as well as the mean of all obtained weights during learning are rep orted for all three Hieron variants. Table 6: Comparisons of the three variants of the Hieron: micro-averaged (MA) F1 (%) and distance induced Fo1 (%). Training time is shown in second. Single loop Multi-loop Regularization Last Mean Last Mean Last Mean MA 79.8 79.1 81.3 82.2 82.5 82.5 Dist. 89.0 88.3 89.3 89.8 90.1 90.0 Time 510s 989s 54173s 97460s 3815s 11416s As shown in Table 6, multi-loop learning (300 loops used in the exp eriment) on its own achieves b etter results than single loop learning, in particular with resp ect to the conventional micro-averaged F1 , showing that multi-loop learning can exploit b etter data regularity. Secondly, when the regularization parameter is added to the multi-loop learning the p erformance improves even further, while the training time of the former is only ab out one fourteenth of the time of the latter. Finally, averaged weights p erform slightly b etter than the last weight in some cases and a bit worse in the other cases. However, this also requires much higher computation time than the last weights, so there is a trade-off b etween p erformance and training time which means that for practical systems using last weight would probably b e more practical. Last but not least, it should b e noted that the difference in computation times b etween these variants of the Hieron algorithm affect only training, whereas application times remain the same. 5.2 Heterogeneous and Homogenous Data As already discussed ab ove, the Sekt ontology annotated corpus contains documents from three different news typ es, with around 90 documents of each typ e. In the exp eriments discussed ab ove, we used two of these parts for training and the third one for testing, resulting in heterogeneous training and test data. In addition, we ran an exp eriment where two third of the documents from each three parts were put together as training data and the remaining documents were used for testing. Consequently the training and testing data in this case are homogenous. Using 3-fold cross-validation, the mean results were: conventional F1 = 0.850; distance based F1 = 0.932; Compared to the results of heterogeneous training and testing data listed in Table 4, the homogenous training and testing data help ed us achieve b etter results, but not as much as exp ected. We attribute this to the fact that the documents in the corpus are annotated with classes from a general ontology (Proton), which does not contain domain sp ecific concepts which only occur in one or two of the subsets. 5.1 Scoring Partially Correct Results When IE systems are evaluated, the entities predicted by the system are compared to the entities in the humanannotated gold standard. In the ab ove exp eriments we used 5.3 Learning curves In order to establish the effect of data sparseness on the system's p erformance, we carried out exp eriments to compare the learning curves for the top 7 classes (listed in Table 2) with those for all classes in the Sekt ontology an- 783 WWW 2007 / Track: Semantic Web notated corpus. The top 7 classes are high level concepts in the Proton ontology and have sub-concepts. For example, Organisation subsumes CommercialOrganisation, EducationOrganisation and PoliticalEntity, etc. Figure 2 shows the learning curves, using standard IE micro-averaged F1 on, resp ectively, the top 7 classes and all classes. Each exp eriment constituted of ten runs and in each run N documents were selected at random from the Sekt ontology annotated corpus8 as training data and all other documents for testing. The results for each exp eriment are the means over the ten runs, with the confidence intervals at the 95% level. The results show, unsurprisingly, that the more training documents are used, the b etter the results are. Secondly, on small training sets (10 or 20 documents), the results for the 7 classes are much b etter than for all classes. This is due to data sparseness, as each document contains many more examples when the class lab els are generalised to the top 7. The training time for 7 classes is also much less than that for all classes, due to the need for training fewer classifiers. Hence, if detailed information is not required, then learning fewer high level concepts is b etter, as it needs less training data and is also much faster. Session: Similarity and Extraction a good classifier for each class and when the learnt model is applied, it would classify many instance correctly and minimise the cost on a small numb er of incorrectly classified ones. This leads to a small difference b etween the conventional measure which considers classes separately and the ontology-based measure which takes into account the relations b etween classes. In contrast, if using a small training set, Hieron might not b e able to learn a good classifier for some individual classes due to data sparsity. So when this model is applied, it would classify some examples correctly and minimise the cost on many of the misclassified instances. Consequently the ontology-based measure is much higher than the conventional one for small training sets. Secondly, Hieron has significantly b etter results than SVM on almost every training set. The ontology-based measures for Hieron are much higher than those for SVM, which in turn outp erforms PAUM. On the other hand, PAUM is much faster for training and testing than b oth SVM and Hieron. SVM takes much longer than Hieron, in particular for testing. Hence, overall we can conclude that Hieron is a good learning algorithm for OBIE, b ecause it balances p erformance and computational complexity. Figure 2: Hieron performance on different amounts of training data for the top 7 classes and all classes respectively In addition, it is also interesting to compare the different p erformance measures as well as the different learning algorithms on small training sets, b ecause in many applications there is often only a small amount of annotated training data. Figure 3 presents the results of Hieron, SVM and PAUM for all classes on the Sekt ontology annotated corpus, measured b oth with conventional and distance-based F1 . Firstly, for Hieron the difference b etween the conventional measure and the ontology-induced measure is much larger on small training data. On the other hand, for SVM (and PAUM) the differences b etween the different measures changes slowly as the dataset grows. This reflects the different learning mechanisms adopted by Hieron and SVM, resp ectively. While the only aim of SVM is to classify every instance correctly, the Hieron classifier at first tries to classify an instance as correctly as it p ossibly can, and if it cannot classify the instance correctly, it then tries to classify it at the lowest p ossible cost, relative to the correct concept. Therefore, given a sufficiently large training set, Hieron can learn Each run we selected the same numb er of documents from each of the three parts of the corpus for training. In other words, the training and test data was homogenous. 8 Figure 3: Comparison of Hieron, SVM and PAUM performance on small training sets 6. EVALUATION AGAINST OTHER METHO DS Since there are no existing publically available corp ora9 , annotated with more than 10 classes, we evaluated our Hieronbased system on the recently created Sekt corpus. Unfortunately, this makes it difficult to compare our approach The most recent Pascal evaluation challenge for IE systems has not yet released its human-annorated test data, so cannot b e used to compare against other systems. 9 784 WWW 2007 / Track: Semantic Web Session: Similarity and Extraction compared to other IE systems on the Seminars corpus. The SVM results are from an earlier exp eriment with an uneven margins SVM [17]. The results of all other systems were obtained from the pap ers cited at the start of this section. The results show that on each of the four slots our system's p erformance is not far from the b est p er-slot measures which are achieved by different systems. As a result, our Hieron-based system obtained the b est overall p erformance, although its p erformance is not significantly b etter than the next two systems: (LP )2 and SNoW. From this, we can conclude that our algorithm is also a very good, state-of-the-art learning algorithm for information extraction. In our view, the excellent p erfomance of our Hieron-based system is due to its exploiting the trivial seminar ontology which we created to meet the algorithm's requirements. In this ontology, the four classes of interest have the same distance to each other, whereas the distance is larger b etween each of them and the non-entity concept, which we added to the ontology as a child concept of the top concept Thing. Consequently, due to Hieron's learning mechanism, the classifier for the non-entity class is more different than the classifiers for these 4 target classes. This may lead to Hieron making fewer misclassifications than the other learning algorithms (e.g., SVM), which do not explore the asymmetric similarities b etween the classes and non-entities in the Seminar corpus. However, the good p erformance of Hieron on the Seminar corpus with flat entity lab els is worth some further investigation. Table 8: Comparison of our Hieron-based system to other learning algorithms on the seminar corpus: F1 (%) on each slot and macro-averaged F1 . The 95% confidence interval for MA F1 is presented for Hieron. The best results for each slot and for overall performance appear in bold. Hieron SVM (LP )2 SNoW MaxEnt BWI HMM Rapier sp eaker 75.5 69.0 77.6 73.8 65.3 67.7 71.1 53.1 location 83.1 81.3 75.0 75.2 82.3 76.7 83.9 73.4 stime 95.4 94.8 99.0 99.6 99.6 99.6 99.1 95.9 etime 93.7 92.7 95.5 96.3 94.5 93.9 59.5 94.6 MA F1 87.0±1.1 84.5 86.8 86.2 85.4 84.6 78.4 79.1 to previous systems, as this corpus is new and no previous evaluations have b een rep orted on it, except those rep orted here. On the other hand, there exist several b enchmarking corp ora on which many IE systems have b een evaluated. In order to make a comparison of our Hieron-based learning algorithm to other state-of-the-art learning systems for IE, we applied our algorithm to the CMU Seminars corpus, which is a standard b enchmark for conventional IE. Since the Seminar corpus has only four typ es of entity lab els with no relations b etween them, we were unsure whether the Hieron algorithm could outp erform SVM, b ecause Hieron dep ends heavily on the hierarchical relations b etween the lab els. The Seminars corpus has b een used for evaluating many IE systems. This includes rule learning based system such as Rapier [1], BWI [13], SNoW [24] and (LP )2 [6], as well as statistical learning systems such as HMM [11] and maximum entropy (MaxEnt) [3]. The Seminars corpus contains 485 Seminar announcement p osts and is annotated with four typ es of information entities, namely start time (stime), end time (etime), sp eaker and location of one seminar. In order to make a fair comparison with other state-ofthe-art learning algorithms for IE, our exp eriment settings followed the methodology of Rapier and (LP )2 . First, the results are the average over 10 runs and in each run 243 documents are randomly selected from the corpus as training data and the remaining 242 are used for testing. Secondly, as far as p ossible, we used the same features as all other systems to enable a more informative comparison. In particular, the results discussed here, including these of our system, do not use any gazetteer information or named entity recogniser input. The only features used in this comparison are words, capitalisation information, token typ es, lemmas, and POS tags. Finally, we use the exact match for evaluating the results and do not consider partially correct annotations (see Section 5.1), b ecause the results of the other systems were rep orted in this way. Since Hieron needs relations among the entity typ es, in order to apply it to this corpus, we constructed a trivial ontology which consists of one concept Thing as a root node, Entity as the only child of Thing, and the four entity typ es Stime, Etime, Speaker and Location as the child concepts of Entity. Table 8 presents the results of our Hieron-based algorithm 7. CONCLUSION This pap er investigated the adaptation and evaluation on ontology-based information extraction of a large margin hierarchical classification, Perceptron-like algorithm Hieron. The algorithm takes into account the relations among concepts, thus b enefiting directly from the ontology structure. We made several modifications to the original Hieron algorithm presented in [8], which, as proven by our evaluation, led to an improved p erformance. The algorithm's p erformance is evaluated on the biggest available semantically annotated corpus, covering 146 concepts from the Proton ontology. Our system is compared to SVM and Perceptron, which are two state-of-the-art learning algorithms for IE. The results showed that our hierarchical classification algorithm obtained clearly b etter results than Perceptron and SVM. The approach was evaluated on the Sekt ontology-annotated news corpus, as it is the only corpus for the OBIE which is annotated with classes from a non-trivial ontology. However, the problem is that no other systems have b een evaluated on it, apart from the SVM and Perceptron comparisons rep orted here. Unfortunately other recent corp ora for IE evaluation use a flat set of lab els, thus making them inappropriate for OBIE. Nevertheless, in order to enable some comparison of our approach to other work, we also ran Hieron on the Seminar corpus, despite it having a flat set of lab els. The exp erimental results showed that our system outp erformed again other state-of-the-art learning algorithms, proving that Hieron is well suited for information extraction tasks. Our implementation of Hieron for OBIE is based on distancebased cost. We are currently working on using some improved cost measures instead (see the discussions in Section 3.1). Another area of ongoing work is on integrating ac- 785 WWW 2007 / Track: Semantic Web tive learning with the Hieron algorithm and also on further evaluation on related tasks, e.g., opinion mining. [13] Session: Similarity and Extraction Extraction in Informal Domains. Machine Learning, 39(2/3):169­202, 2000. D. Freitag and N. Kushmerick. Boosted Wrapp er Induction. In Proceedings of AAAI 2000, 2000. S. Handschuh, S. Staab, and F. Ciravegna. S-CREAM -- Semi-automatic CREAtion of Metadata. In 13th International Conference on Know ledge Engineering and Know ledge Management (EKAW02), pages 358­372, Siguenza, Spain, 2002. A. Kiryakov, B. Pop ov, D. Ognyanoff, D. Manov, A. Kirilov, and M. Goranov. Semantic annotation, indexing and retrieval. Journal of Web Semantics, ISWC 2003 Special Issue, 1(2):671­680, 2004. P. Kogut and W. Holmes. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. In First International Conference on Know ledge Capture (K-CAP 2001), Workshop on Know ledge Markup and Semantic Annotation, Victoria, B.C., 2001. Y. Li, K. Bontcheva, and H. Cunningham. Using Uneven Margins SVM and Perceptron for Information Extraction. In Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005), 2005. Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The Perceptron Algorithm with Uneven Margins. In Proceedings of the 9th International Conference on Machine Learning (ICML-2002), pages 379­386, 2002. D. Maynard, W. Peters, and Y. Li. Metrics for evaluation of ontology-based information extraction. In WWW 2006 Workshop on "Evaluation of Ontologies for the Web" (EON), Edinburgh, Scotland, 2006. L. K. McDowell and M. Cafarella. Ontology-Driven Information Extraction with OntoSyphon. In 5th Internal Semantic Web Conference (ISWC'06). Springer, 2006. E. Motta, M. Vargas-Vera, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. MnM: Ontology Driven Semi-Automatic and Automatic Supp ort for Semantic Markup. In 13th International Conference on Know ledge Engineering and Know ledge Management (EKAW02), pages 379­391, Siguenza, Spain, 2002. J. Perna and A. Sp ector. Introduction to the Sp ecial Issue on Unstructured Information Management. IBM Systems Journal, 43(3), 2004. P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of 14th International Joint Conference on Artificial Intel ligence, pages 448­453, Montreal, Canada, 1995. D. Roth and W. T. Yih. Relational Learning via Prop ositional Algorithms: An Information Extraction Case Study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intel ligence (IJCAI), pages 1257­1263, 2001. J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Learning Hierarchical Multi-Category Text Classification Models. Journal of Machine Learning Research, 7:1601--1626, 2006. 8. ACKNOWLEDGEMENTS This work is supp orted by the EU-funded SEKT (IST2004-506826) and KnowledgeWeb (IST-2004-507482) pro jects. The authors also wish to thank the reviewers for their useful suggestions on improving the pap er. [14] 9. REFERENCES [15] [1] M. E. Califf. Relational Learning Techniques for Natural Language Information Extraction. PhD thesis, University of Texas at Austin, 1998. [2] N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zanib oni. Incremental Algorithms for Hierarchical Classification. In Neural Information Processing Systems, 2004. [3] H. L. Chieu and H. T. Ng. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. In Proceedings of the Eighteenth National Conference on Artificial Intel ligence, pages 786­791, 2002. [4] N. Chinchor. Muc-4 evaluation metrics. In Proceedings of the Fourth Message Understanding Conference, pages 22­29, 1992. [5] P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In Proceedings of WWW'04, 2004. [6] F. Ciravegna and Y. Wilks. Designing Adaptive Information Extraction for the Semantic Web in Amilcare. In S. Handschuh and S. Staab, editors, Annotation for the Semantic Web. IOS Press, Amsterdam, 2003. [7] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), 2002. [8] O. Dekel, J. Keshet, and Y. Singer. Large Margin Hierarchical Classification. In Proceedings of the 21st International Conference on Machine Learning (ICML-2004), Canada, 2004. [9] S. Dill, J. A. Tomlin, J. Y. Zien, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Ra jagopalan, and A. Tomkins. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. In Proceedings of the 12th International Conference on World Wide Web (WWW2003), pages 178­186, Budap est, Hungary, May 2003. [10] J. Domingue, M. Dzb or, and E. Motta. Magpie: Supp orting Browsing and Navigation on the Semantic Web. In N. Nunes and C. Rich, editors, Proceedings ACM Conference on Intel ligent User Interfaces (IUI), pages 191­197, 2004. [11] D. Freigtag and A. K. McCallum. Information Extraction with HMMs and Shrinkage. In Proceesings of Workshop on Machine Learnig for Information Extraction, pages 31­36, 1999. [12] D. Freitag. Machine Learning for Information [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] 786