Morpho Challenge - Evaluation of algorithms for unsupervised learning of morphology in various task s and languages Mik k o K urimo, S ami V irpioja, V ille T urunen, T eemu H irsim¨ k i a Adaptive Informatics Research Centre H elsink i U niversity of T echnolog y F I-0 2 0 15 , T K K , F inland Firstname.Lastname@tkk.fi A b strac t After the release of the open sou rce softw are implementation of M orfessor alg orithm, a series of several open evalu ations has b een org aniz ed for u nsu pervised morpheme analy sis and morpheme-b ased speech recog nition and information retrieval. T he u nsu pervised morpheme analy sis is a particu larly attractive approach for speech and lang u ag e technolog y for the morpholog ically complex lang u ag es. W hen the amou nt of distinct w ord forms b ecomes prohib itive for the constru ction of a su ffi cient lex icon, it is important that the w ords can b e seg mented into smaller meaning fu l lang u ag e modeling u nits. In this presentation w e w ill demonstrate the resu lts of the evalu ations, the b aseline sy stems b u ilt u sing the open sou rce tools, and invite research g rou ps to participate in the nex t evalu ation w here the task is to enhance statistical machine translation b y morpheme analy sis. A proposal for a T y pe I I D emo 1 1 .1 Ex tended A b strac t T he segmentation of w ords into morphemes O ne of the fu ndamental task s in natu ral lang u ag e processing applications, su ch as larg e-vocab u lary speech recog nition (L V CS R), statistical machine translation (S M T ) and information retrieval (IR), is the morpholog ical analy sis of w ords. It is particu larly important for the morpholog ically complex lang u ag es, w here the amou nt of different 13 w ord forms is su b stantially increased b y infl ection, derivation and composition. T he decomposition of w ords is req u ired not only for u nderstanding the sentence, b u t in many lang u ag es also for ju st representing the lang u ag e b y any tractab le and trainab le statistical model and lex icon. T he manu ally composed ru le-b ased morpholog ical analy z ers can solve these prob lems to some ex tent, b u t only a fraction of the ex isting lang u ag es have b een covered so far, and for many the coverag e of the relevant content is insu ffi cient. T he ob jective of the M orpho Challeng e1 is to desig n and evalu ate new u nsu pervised statistical machine learning alg orithms that discover w hich morphemes (smallest individu ally meaning fu l u nits of lang u ag e) w ords consist of. T he g oal is to discover b asic vocab u lary u nits su itab le for different task s, su ch as L V CS R, S M T and IR. In u nsu pervised learning the list of morphemes is not pre-specifi ed for each lang u ag e, b u t the optimal morpheme lex icon and morpheme analy sis of all different w ord forms is statistically optimiz ed from a larg e tex t corpu s in a completely data-driven manner. T he evalu ation of the morpheme analy sis alg orithms is performed b oth b y a ling u istic and an application oriented task . T he analy sis ob tained for a long list of w ords is fi rst compared to the ling u istic g old standard representing a g rammatically correct analy sis b y verify ing that the morphemesharing w ord pairs are the correct ones (K u rimo et al., 2 0 0 7 ). T his is repeated in different lang u ag es and then the ob tained decomposition of w ords is applied in state-of-the-art sy stems ru nning variou s 1 S ee http://w w w .cis.hu t.fi /morphochalleng e2 0 0 9 / Proceedings of NAACL HLT 2009: Demonstrations, pages 13­16, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics NLP applications. The suitability of the morphemes is v erifi ed by comparing the performance of the systems to each other and to systems using unprocessed w ord s or conv entional w ord processing alg orithms lik e stemming or rule-based d ecompositions. A s a baseline method in all application, w e hav e built systems by applying the M orfessor alg orithm, w hich is an unsuperv ised w ord d ecomposition alg orithm d ev eloped at our research g roup (C reutz and Lag us, 20 0 2) and released as open source softw are implementation2 . 1.2 Morphemes in Information Retrieval In information retriev al (IR ) from tex t d ocuments a typical task is to look for the most relev ant d ocuments for a g iv en q uery. O ne of the k ey challeng es is to red uce all the infl ected w ord forms to a common root or stem for effectiv e ind ex ing . F rom the morpheme analysis point of v iew this task is to d ecompose all the w ord s in the q uery and tex t d ocuments and fi nd out those common morphemes w hich form the most relev ant link s. In M orpho C halleng e the IR systems built using the unsuperv ised morpheme analysis alg orithms are compared in state-of-the-art C LE F task s in F innish, G erman and E ng lish (K urimo and Turunen, 20 0 8 ) using the mean av erag e precision metric. The results are also compared to those obtained by the g rammatical morphemes as w ell as the stemming and w ord normaliz ation method s conv entionally used in IR . 1.3 Morphemes in S peec h Rec og nition In M orpho C halleng e the unsuperv ised morpheme alg orithms hav e been compared by using the morphemes to train statistical lang uag e mod els and applying the mod els in state-of-the-art LV C S R task s in F innish and Turk ish (K urimo et al., 20 0 6 a). B enchmark s for the same task s w ere obtained by mod els that utiliz e the g rammatical morphemes as w ell as trad itional w ord -based lang uag e mod els. 1.4 Morphemes in Mac hine T ranslation In larg e-v ocabulary continuous speech recog nition (LV C S R ) one k ey part of the process is the statistical lang uag e mod eling w hich d etermines the prior probabilities of all the possible w ord seq uences. A n especially challeng ing task is to cov er all the possible w ord forms w ith suffi cient accuracy, because any out-of-v ocabulary w ord s w ill not only be nev er correctly recog niz ed , but also sev erely d eg rad e the mod eling of the other nearby w ord s. B y d ecomposing the w ord s into meaning ful sub-w ord units, such as morphemes, larg e-v ocabulary lang uag e mod els can be successfully built ev en for the most d iffi cult ag g lutinativ e lang uag es, lik e F innish, E stonian and Turk ish (K urimo et al., 20 0 6 b). 2 The state-of-the-art statistical machine translation (S M T) systems are affected by the morpholog ical v ariation of w ord s at tw o d ifferent stag es (V irpioja et al., 20 0 7 ). In the fi rst stag e, the alig nment of the source and targ et lang uag e w ord s in a parallel training corpus and the training of the translation mod el can benefi t from the d ecomposition of complex w ord s into morphemes. This is particularly important w hen either the targ et or the source lang uag e, or both, are morpholog ically complex . The fi nal stag e w here the targ et lang uag e tex t is g enerated , may also req uire morpheme-based mod els, because the larg e-v ocabulary statistical lang uag e mod els are applied in the same w ay as in LV C S R . In the on-g oing M orpho C halleng e 20 0 9 competition, the morpheme analysis alg orithms are compared in S M T task s, w here the analysis is need ed for the source lang uag e tex ts. The E uropean Parliament parallel corpus (K oehn, 20 0 5 ) is used in the ev aluation. The source lang uag es are F innish and G erman and the targ et in both task s is E ng lish. To obtain a state-of-the-art performance in the task s the morpheme-based S M T w ill be combined w ith a w ord -based S M T using the M inimum B ayes R isk (M B R ) interpolation of the N-best translation hypothesis of both systems (d e G ispert et al., 20 0 9 ). 1.5 Morpho C halleng e 20 0 9 S ee http://w w w .cis.hut.fi /projects/morpho/ A s its pred ecessors, the M orpho C halleng e 20 0 9 competition is open to all and free of charg e. The participants' are ex pected to use their unsuperv ised machine learning alg orithms to analyz e the w ord lists of d ifferent lang uag es prov id ed by the org aniz ers and submit the results of their morpheme analysis. The org aniz ers w ill then run the ling uistic ev aluations and build the IR and S M T systems and prov id e all the results and comparisons of the d ifferent systems. The participated alg orithms and ev aluation 14 results will be presented at the Morpho Challenge work shop that is c urrently planned to tak e plac e within the H L T -N A A CL 2 0 1 0 c onferenc e. Acknowledgments T he Morpho Challenge c om petitions and work shops are part of the E U N etwork of E x c ellenc e P A S CA L Challenge program and organiz ed in c ollaboration with CL E F . W e are grateful to Mathias Creutz , E bru A risoy , S tefan B ordag, N iz ar H abash and Majdi S awalha for c ontributions in proc essing the training data and c reating the gold standards. T he A c adem y of F inland has supported the work in the projec ts Adaptive Informatics and N ew adaptive and learning meth ods in speech recog nition. and the baseline sy stem s for v arious languages dev eloped using the Morfessor algorithm for word dec om position, IR , L V CS R and S MT . T he audienc e will also be welc om e to try their own input for these baseline sy stem s and v iew the results. T he sc ript is presented below for a poster-sty le and try -it-y ourself on laptop dem o, but it will work well as a lec ture-sty le show, too, if needed. In the poster we illustrate the following points: 1 . B asic c harac teristic s of the unsuperv ised learning algorithm s and m orphem e analy sis results in different languages (F innish, T urk ish, G erm an, E nglish, A rabic ) as in T able 1 , dem o: h ttp://w w w .cis.h u t.fi /projects/morph o/. 2 . T he results of the ev aluations against the linguistic gold standard m orphem es in different languages, see e.g. F igure 1 . 3 . T he results of the IR ev aluations and c om parisons to the perform anc e of gram m atic al m orphem es, word-based m ethods and stem m ing in different languages, see e.g. F igure 2 . 4 . T he results of the L V CS R ev aluations with c om parisons to gram m atic al m orphem es and word-based m ethods, see e.g. F igure 3 . 5 . T he c all for partic ipation in the Morpho Challenge 2 0 0 9 c om petition where the new ev aluation task is using m orphem es in S MT . R efer ences M. Creutz and K . L agus. 2 0 0 2 . U nsuperv ised disc ov ery of m orphem es. In W ork sh op on M orph olog ical and P h onolog ical L earning of AC L -0 2 . A . de G ispert, S . V irpioja, M. K urim o, and W . B y rne. 2 0 0 9 . Minim um B ay es risk c om bination of translation hy potheses from alternativ e m orphologic al dec om positions. S ubm itted to H L T -N AAC L . P . K oehn. 2 0 0 5 . E uroparl: A parallel c orpus for statistic al m ac hine translation. In M T S u mmit X . M. K urim o and V . T urunen. 2 0 0 8 . U nsuperv ised m orphem e analy sis ev aluation by IR ex perim ents ­ Morpho Challenge 2 0 0 8 . In C L E F . M. K urim o, M. Creutz , M. V arjok allio, E . A risoy , and M. S arac lar. 2 0 0 6 a. U nsuperv ised segm entation of words into m orphem es - Challenge 2 0 0 5 , an introduc tion and ev aluation report. In P AS C AL C h alleng e W ork sh op on U nsu pervised seg mentation of w ords into morph emes. M. K urim o, A . P uurula, E . A risoy , V . S iiv ola, T . H ir¨ ¨ ¨ sim ak i, J . P y lk k onen, T . A lum ae, and M. S arac lar. 2 0 0 6 b. U nlim ited v oc abulary speec h rec ognition for agglutinativ e languages. In H L T -N AAC L . M. K urim o, M. Creutz , and M. V arjok allio. 2 0 0 7 . Morpho Challenge ev aluation using a linguistic G old S tandard. In C L E F . ¨ S . V irpioja, J . J . V ay ry nen, M. Creutz , and M. S adeniem i. 2 0 0 7 . Morphology -aware statistic al m ac hine translation based on m orphs induc ed in an unsuperv ised m anner. In M T S u mmit X I. D enm ark . F igure 1 : F -m easures for the T urk ish m orphem e analy sis. 2 S cr ip t ou tline for th e demo p r esenta tion In this dem o we will present the ac hiev em ents of the Morpho Challenge 2 0 0 5 -2 0 0 8 c om petition in graphs 15 T he laptop is used to dem onstrate the baseline sy stem s we hav e rec ently dev eloped for different task s that are all based on unsuperv ised m orphem es: Example word Finnish: lin u xiin T u r k ish: popU lerliG in i A r a b ic : A lmtH dp G e r m a n: z u ru ec k z u b eh alten E ng lish: b ab y -s itters M orfes s or an aly s is lin u x + iin pop + U + ler + liG in i A l+ mtH d + p z u ru ec k + z u + b e+ h alten b ab y -+ s itter + s G old S tan dard lin u x N + IL L popU ler + D ER lH g + P popU ler + D ER lH g + P mu t aH idap P O S :P N A mu t aH id P O S :A J A l+ z u ru ec k B z u b e h alt V b ab y N s it V er s + P L O S2S +A C C , O S3 +A C C 3 l+ + S G , +SG + IN F T ab le 1 : M orph eme an aly s is examples in differen t lan g u ag es . F ig u re 2 : P rec is ion performan c es for th e G erman IR . F ig u re 4 : S c reen s h ot of th e morph eme-b as ed s peec h rec og n iz er in ac tion for F in n is h . A n offl in e v ers ion c an b e tried in http://www.cis.hut.fi/projects/speech/. F ig u re 3 : L V C S R error rates for th e T u rk is h tas k . 1 . O n lin e L V C S R s y s tem for h ig h ly ag g lu tin ativ e lan g u ag es , s ee e.g . s c reen s h ot in F ig u re 4 . 2 . O n lin e IR s y s tem for h ig h ly ag g lu tin ativ e lan g u ag es . 3 . O n lin e S M T s y s tem wh ere th e s ou rc e lan g u ag e is a h ig h ly ag g lu tin ativ e lan g u ag e, s ee e.g . s c reen s h ot in F ig u re 5 . F ig u re 5 : S c reen s h ot of th e morph eme-b as ed mac h in e tran s lator in ac tion for F in n is h -En g lis h . A s implifi ed web in terfac e to th e s y s tem is als o av ailab le (pleas e email to th e au th ors for a lin k ). 16