WWW 2007 / Poster Paper

Topic: Semantic Web

Ontology Engineering Using Volunteer Labor
B enjamin M Good and Mark D W ilkinson
iCAPTURE Centre for Cardiovascular and Pulmonary Research The University of British Columbia St. Paul's Hospital, Vancouver, BC, V6Z 1Y6 Canada

goodb@interchange.ubc.ca , mwilkinson@mrl.ubc.ca ABSTRACT
We describe an approach designed to reduce the costs of o n t o l o g y development through the use of untrained, volunteer knowledge engineers. Results are provided from an experiment in which volunteers were asked to judge the c o r r e c t n e s s of automatically inferred subsumption r e l a t i o n s h i p s in the biomedical domain. The experiment indicated that volunteers can be recruited fairly easily but that their attention is difficult to hold, that most do not understand the subsumption relationship without training, and that incorporating learned estimates of trust into voting systems is beneficial to aggregate performance. responses are combined using methods that attempt t o incorporate estimates of trust in each volunteer. The goal of the work presented here is to estimate how well our system can detect errors in auto-generated statements of the subsumption relationship without any training for the volunteers. The relationships that we use are drawn from the biomedical domain. For example, how well can volunteers (individually or in aggregate) answer questions such as "is a nipple a kind of breast" or "is a lymphocyte a sub-class of a lymphatic system"?

2. Creating an OWL version of MeSH
MeSH, which stands for `Medical Subject Headings', is the thesaurus used by the United States National Library of Medicine to index the millions of biomedical journal articles described in the MEDLINE database (http://www.nlm.nih.gov/bsd/disted/mesh/index.html). MeSH has been automatically converted to OWL using a simple, but problematic, mapping from the `narrower than' t h e s a u r a l relation to the rdfs:subClassOf relation (http://www.berkeleybop.org/ontologies/). By our estimation, about 40% of the predicted sub-class relations are incorrect. Many are statements of meronymy, as in the nipple-breast example above, but there are many more subtle problems in the mix as well [3]. The experiment described below tests our volunteer-driven system's ability to detect these errors.

Categories and Subject Descriptors
H.5.3 [Group and Organization Interfaces]: Collaborative computing, Evaluation/methodology, Web-based interaction.

General Terms
Design, Economics, Experimentation, Human Factors.

Keywords
ontology engineering, knowledge acquisition, semantic web

1. INTRODUCTION
Ontologies are a fundamental component of the incipient Semantic Web. To achieve its visions, ontologies need to be written in Semantic Web compatible languages such as OWL and used to annotate the resources of the Web. However, as w i t h many previous efforts in the domain of artificial intelligence, ontology development faces the problem of the k n o w l e d g e acquisition bottleneck. Given current a p p r o a c h e s , ontology development is a slow, expertiseheavy, labor-intensive, and thus costly enterprise. The work presented here is part of a larger project that seeks t o d r a m a t i c a l l y reduce the costs associated with ontology d e v e l o p m e n t by altering the process of knowledge acquisition such that it may be distributed across a very large number of volunteers simultaneously via the Internet. T h e process starts with a seed ontology that may be generated automatically or semi-automatically; for example, from text [1], or from a translation of an existing structured resource such as a thesaurus [4]. The putative classes and relations in the inferred ontology are then validated and refined based on answers to questions about them posed to a l a r g e pool of volunteers. The simplest form of these questions ask whether or not a given ontological statement is `true' or `false'. Each question is posed to multiple volunteers. To make improvements to the ontology, the
Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada.

3. Experiment
F o l l o w i n g from previous work that utilized scientific conferences as settings for focused knowledge capture efforts [2], this experiment took place at the annual meeting of a large research project directed at identifying biomarkers of allograft rejection (http://www.allomark.ubc.ca/). The setting of the meeting made it easy to identify volunteers from the b i o m e d i c a l domain and to provide motivation for their participation in the form of a small prize awarded to the most prolific contributor at the end of the conference. The volunteers were asked to login to a website and answer a s e r i e s of questions about subsumption relations from MeSH.owl. These questions were provided in one of two f o r m s : "Is it true that a `mast_cell' is also a `connective_tissue_cell'? or "Is it true that all instances of the class `b-lymphocyte' are also instances of the class `antibody_producing_cell'?

3.1 Test data
To measure the performance of the volunteer-system on this task, we used a sample of 130 MeSH.owl sub-class relations which we manually labeled as either true or false. The sample relation set was generated by extracting the subgraph of the MeSH term `immune system' which included all of its parents, all of its subclasses and all of the parents of all of its subclasses. The term `immune system' was chosen because

ACM 978-1-59593-654-7/07/0005.

1243


WWW 2007 / Poster Paper the topic of the meeting where the experiment was conducted was closely related to immunology.

Topic: Semantic Web Table 1. Performance on subclass-assessment task using the different aggregation methods. The F-measure is the harmonic mean of precision P and recall R where P = tp/(fp+tp), R = tp/(fn+tp), F-measure = 2*P*R/(P+R) Aggregation Method A Single Volunteer Majority Vote (MV) MV weighted by time between votes 1R SVM Naive Bayes % correct .62 .64 .63 .71 .75 .75 F-false .17 .23 .47 .56 .64 .64 F-true .75 .77 .71 .78 .78 .81

4. Results
Over the course of the 2 day experiment, responses from 25 volunteers were recorded. All but two of these were from the 50 attendees of the Biomarkers annual meeting, the others were from external IP addresses. As observed in previous experiments of this nature and displayed in Figure 1, the a m o u n t of labor provided per volunteer exhibited a characteristic long-tail distribution with a few volunteers contributing the large majority of the work. Overall, only 5 of the volunteers responded to more than 25% of the q u e s t i o n s and only one volunteer responded with an assertion of `true' or `false' to more than 90% of the questions in the test set.
1.2

5. Discussion and future work
Due to the relatively small number of volunteers and number of test cases, the work presented here should be considered as preliminary. However, it did re-iterate previous results indicating that volunteers can be found, but that this kind of task and this kind of incentive strategy are sufficient to keep the attention of only a small fraction of the recruits. It also s u g g e s t e d that learning algorithms can aid in forming i n t e l l i g e n t aggregates of multiple voters on ontology evaluation tasks. Future experiments will test the effects of v a r i o u s training mechanisms and incentive strategies for i m p r o v i n g individual volunteer performance and will continue to evaluate different approaches to combining the a s s e r t i o n s of multiple volunteers of varying levels of knowledge and reliability.

fraction subclass judgments made

1 0.8 0.6 0.4 0.2 0 12 3456 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Volunteer

Figure 1. Volume of participation per volunteer.

4.1 Performance of aggregated responses
F i v e methods were tested for combining the multiple volunteer assertions about each putative MeSH sub-class relation. The simplest method was to take the majority vote for each potential sub-class. The next method weighted each vote based on the time taken between it and the previous vote. These weights on the time spent per vote provide an extremely simple estimate for the amount to `trust' each vote based on the premise that more time spent might implies more careful thought and thus better performance. As they did not use any training, both of these aggregation methods were evaluated on the entire set of samples. The other three methods involved machine learning algorithms (1R, Support Vector Machines, and Naive Bayes) that attempted to learn how best to combine the votes using the data collected. If, for example, one voter consistently voted correctly, these algorithms should detect that voter a n d weight their responses above others. Each row of t r a i n i n g / t e s t i n g data for these methods consisted of the target class (the true/false label for one sub-class relation), the votes for that relation from each volunteer who voted o n it (including assertions of `I don't know'), and the ratio of the true verse false votes gathered from all volunteers for that relation. These methods were evaluated using 10-fold cross-validation over the whole set of samples. Table 1 provides a summary of the results obtained for the various methods. It is problematic to directly compare the results of t h e cross-validation evaluations to those from the nonlearning based approaches, but there does seem to be an advantage gained by the learning methods.

6. ACKNOWLEDGMENTS
Our thanks to Robert Stevens and Andrew Gibson for advice on the design of the experiment and to the volunteers who participated in the study. BMG is supported by an award to t h e Better Biomarkers in Transplantation project from Genome B.C., in part through Genome Canada. MDW i s supported by an award to the iCAPTURE Centre from the M i c h a e l Smith Foundation for Health Research. Core laboratory funding provided by the Natural Sciences and Engineering Research Council of Canada. Infrastructure support provided by IBM and SUN.

7. REFERENCES
[1] Cimiano, P., Hotho, A. and Staab, S. Learning concept
hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research, (2005), 24. 305-339.

[2] Good, B.M., Tranfield, E.M., Tan, P.C., Shehata, M.,

Singhera, G.K., Gosselink, J. and Wilkinson, M.D., Fast, cheap and out of control: A zero curation model for ontology development. in Pacific Symposium on Biocomputing, (Hawaii, USA, 2006), 128-139. (2001), http://www.nlm.nih.gov/mesh/meshrels.html

[3] Nelson, Stewart. Relations in Medical Subject Headings [4] Van Assem, M., Menken, M., Schreiber, G., Wielemaker, J.
and Wielinga, B., A Method for Converting Thesauri to RDF/OWL. in ISWC, (2004), 17-31.

1244