SIGIR 2007 Proceedings Session 18: Music Retrieval Towards Musical Query-by-Semantic-Description using the CAL500 Data Set Douglas Turnbull1 , Luke Barrington2 , David Torres1 , Ger t Lanckriet2 Depar tment of Computer Science and Engineering1 Depar tment of Electrical and Computer Engineering2 University of California, San Diego La Jolla, CA, USA [dturnbul@cs, lbarrington@, datorres@cs, ger t@ece].ucsd.edu ABSTRACT Query-by-semantic-description (QBSD) is a natural paradigm for retrieving content from large databases of music. A major imp ediment to the development of good QBSD systems for music information retrieval has b een the lack of a cleanlylab eled, publicly-available, heterogeneous data set of songs and associated annotations. We have collected the Computer Audition Lab 500-song (CAL500) data set by having humans listen to and annotate songs using a survey designed to capture `semantic associations' b etween music and words. We adapt the sup ervised multi-class lab eling (SML) model, which has shown good p erformance on the task of image retrieval, and use the CAL500 data to learn a model for music retrieval. The model parameters are estimated using the weighted mixture hierarchies expectation-maximization algorithm which has b een sp ecifically designed to handle realvalued semantic association b etween words and songs, rather than binary class lab els. The output of the SML model, a vector of class-conditional probabilities, can b e interpreted as a semantic multinomial distribution over a vocabulary. By also representing a semantic query as a query multinomial distribution, we can quickly rank order the songs in a database based on the Kullback-Leibler divergence b etween the query multinomial and each song's semantic multinomial. Qualitative and quantitative results demonstrate that our SML model can b oth annotate a novel song with meaningful words and retrieve relevant songs given a multi-word, text-based query. Keywords Query-by-semantic-description, sup ervised multi-class classification, content-based music information retrieval 1. INTRODUCTION An 80-gigabyte p ersonal MP3 player can store ab out 20,000 songs. Apple iTunes, a p opular Internet music store, has a catalogue of over 3.5 million songs1 . Query-by-semanticdescription (QBSD) is a natural paradigm for navigating such large databases of music. For example, one may wish to retrieve songs that "have strong folk roots, feature a banjo, and are uplifting." We prop ose a content-based QBSD music retrieval system that learns a relationship b etween acoustic features and words using a heterogeneous data set of songs and associated annotations. Our system directly models the relationship b etween audio content and words and can b e used to search for music using text-based queries comp osed of one or more words from a large vocabulary. While QBSD has b een studied in computer vision research for b oth content-based image and video retrieval [1­4], it has received far less attention within the music information retrieval (MIR) community [5]. One ma jor imp ediment has b een the lack of a cleanly-lab eled, publicly-available, data set of annotated songs. The first contribution of this pap er is such a data set; the Computer Audition Lab 500-Song (CAL500) data set2 . CAL500 consists of 500 p opular music songs each of which has b een annotated by a minimum of three listeners. A subset of the songs are taken from the publicly-available Magnatunes dataset [6], while the remaining songs can b e downloaded from any numb er of web-based music retailers (such as Rhapsody or Apple iTunes). For all songs, we also provide various features that have b een extracted from the audio. Each annotation was collected by playing music for human listeners and asking them to fill out a survey ab out their auditory exp erience. The results of the survey were then converted into annotation vectors over a 159-word vocabulary of musically-relevant, semantic concepts. Our second contribution is showing that the CAL500 data set contains useful information which can b e used to train a QBSD music retrieval system. This system is based on a sup ervised multi-class lab eling (SML) probabilistic model [1], which has shown good p erformance on the task of image retrieval. The SML model estimates a Gaussian Mixture Model (GMM) distribution over an audio feature space conditioned on each word in a semantic vocabulary. Parame1 2 Categories and Subject Descriptors I.2.m [Computing Methodologies]: Artificial Intelligence-- Miscel laneous ; J.5 [Computer Applications]: Arts and Humanities--Music General Terms Algorithms, Design, Exp erimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. Statistics from www.apple.com/itunes, January 2007. CAL500 can be download at http://cosmal.ucsd.edu/cal. 439 SIGIR 2007 Proceedings Session 18: Music Retrieval ter estimation is done using the efficient mixture hierarchies exp ectation-maximization (MH-EM) algorithm. However, for the task of music retrieval, we have to modify this parameter estimation technique to handle real-valued semantic weights, as opp osed to binary class lab els. Weights are more appropriate in the context of music since music is inherently sub jective; each individual has their own p ersonal exp erience when listening to a song. For example, we find that three out of four college students annotate Elvis Presley's "Heartbreak Hotel" as b eing a `blues' song while everyone identified B.B. King's "Sweet Little Angel" as b eing a blues song. By extending the MH-EM algorithm to handle realvalued semantic weights b etween songs and words, we can explicitly model their resp ective strengths of association. Our third contribution is showing how the SML model can b e used to handle multiple-word queries. (See Table 1 for an illustrative example.) When annotating a novel song, the SML model produces a vector of class-conditional probabilities for each word in a vocabulary. Using Bayes rule, we obtain the set of p osterior probabilities that represents a semantic multinomial distribution over the vocabulary. If we formulate a user-sp ecified query as a query multinomial distribution over the same vocabulary, we can efficiently rank-order all the songs in a large database by calculating the Kullback-Leibler (KL) divergence b etween the query-multinomial and each song's semantic-multinomial. The following section discusses how this work fits into the field of music information retrieval and relates to research on semantic retrieval of images and audio. Section 3 formulates the SML model used to solve the related problems of semantic audio annotation and retrieval, explains how to formulate multiple-word semantic queries, and describ es how to estimate the parameters of the model using the weighted mixture hierarchies algorithm. Section 4 describ es the methods for collecting human semantic annotations of music and the creation of the CAL500 data set. Section 5 rep orts qualitative and quantitative results for annotation and retrieval of music, including retrieval using multiple-word queries. The final section outlines a numb er of future directions for this research. Table 1: Qualitative music retrieval results for our SML model. Results are shown for 1-, 2- and 3-word queries. Qu e r y Re turne d Songs The Ronettes- Walking in the Rain The Go-Gos - Vacation Spice Girls - Stop Sylvester - You make me feel mighty real Bo o Radleys - Wake Up Bo o! Alicia Keys - Fallin' Shakira - The One Christina Aguilera - Genie in a Bottle Junior Murvin - Police and Thieves Britney Sp ears - I'm a Slave 4 U Crosby Stills and Nash - Guinnevere Jewel - Enter from the East Art Tatum - Willow Weep for Me John Lennon - Imagine Tom Waits - Time Britney Sp ears - I'm a Slave 4 U Buggles - Video Killed the Radio Star Christina Aguilera - Genie in a Bottle The Ronettes - Walking in the Rain Alicia Keys - Fallin' 5th Dimension - One Less Bell to Answer Coldplay - Clo cks Cat Power - He War Chantal Kreviazuk - Surrounded Alicia Keys - Fallin' Jewel - Enter from the East Evanescence - My Immortal Cowb oy Junkies - Postcard Blues Everly Brothers - Take a Message to Mary Sheryl Crow - I Shall Believe Shakira - The One Alicia Keys - Fallin' Evanescence - My Immortal Chantal Kreviazuk - Surrounded Dionne Warwick - Walk on by Pop Female Lead Vo cals Tender Pop AND Female Lead Vo cals Pop AND Tender Female Lead Vo cals AND Tender Pop AND Female Lead Vo cals AND Tender 2. RELATED WORK A central goal of the music information retrieval community is to create systems that efficiently store and retrieve songs from large databases of musical content [7]. The most common way to store and retrieve music uses metadata such as the name of the comp oser or artist, the name of the song or the release date of the album. We consider a more general definition of musical metadata as any non-acoustic representation of a song. This includes genre, song reviews, ratings according to bip olar adjectives (e.g., happy/sad), and purchase sales records. These representations can b e used as input to retrieval systems that help users search for music. The drawback of these systems is that they require a novel song to b e manual ly annotated b efore it can b e retrieved. Another retrieval approach, query-by-similarity, takes an audio-based query and measures the similarity b etween the query and all of the songs in a database [7]. A limitation of query-by-similarity is that it requires a user to have a useful audio exemplar in order to sp ecify a query. For cases in which no such exemplar is available, researchers have develop ed query-by-humming [8], -beatboxing [9], and -tapping [10]. However, it can b e hard, esp ecially for an untrained user, to emulate the temp o, pitch, melody, and timbre well enough to make these systems viable [8]. A natural alter- native is query-by-semantic-description (QBSD), describing music with semantically meaningful words. A good deal of research has focused on content-based classification of music by genre [11], emotion [12], and instrumentation [13]. These classification systems effectively `annotate' music with class lab els (e.g., `blues', `sad', `guitar'). The assumption of a predefined taxonomy and the explicit (i.e., binary) lab eling of songs into (often mutually exclusive) classes can give rise to a numb er of problems [14] due to the fact that music is inherently sub jective. A more flexible approach [15] considers similarity b etween songs in a semantic `anchor space' where each dimension reflects a strength of association to a musical genre. The QBSD paradigm has b een largely influenced by work on the similar task of image annotation. Our system is based on Carneiro et. al.'s SML [1] model, the state-of-the-art in image annotation. Their approach views semantic annotation as one multi-class problem rather than a set of binary one-vs-all problems. A comparative summary of alternative sup ervised one-vs-all [4] and unsup ervised [2, 3] models for image annotation is presented in [1]. Despite interest within the computer vision community, there has b een relatively little work on developing text queries for content-based music information retrieval. One exception is the work of Whitman et al. [16­18]. Our approach differs from theirs in a numb er of ways. First, they use a set of web-documents associated with an artist whereas we use multiple song -sp ecific annotations for each song in our corpus. Second, they take a one-vs-all approach and learn a discriminative classifier (a supp ort vector machine or a regularized least-squares classifier) for each term in the 440 SIGIR 2007 Proceedings Session 18: Music Retrieval vocabulary. The disadvantage of the one-vs-all approach is that it results in binary decisions for each class. Our generative multi-class approach outputs a natural ranking of words based on a more interpretable probabilistic model [1]. Other QBSD audition systems [19, 20] have b een develop ed for annotation and retrieval of sound effects. Slaney's Semantic Audio Retrieval system [19, 21] creates separate hierarchical models in the acoustic and text space, and then makes links b etween the two spaces for either retrieval or annotation. Cano and Kopp enb erger prop ose a similar approach based on nearest neighb or classification [20]. The drawback of these non-parametric approaches is that inference requires calculating the similarity b etween a query and every training example. We prop ose a parametric approach that requires one model evaluation p er semantic concept. In practice, the numb er of semantic concepts is orders of magnitude smaller than the numb er of p otential training data p oints, leading to a more scalable solution. Figure 1: SML model diagram. cabulary, given the audio features: 3. SEMANTIC MULTI-CLASS LABELING This section formalizes the related problems of semantic audio annotation and retrieval as sup ervised, multi-class lab eling tasks where each word in a vocabulary represents a class. We learn a word-level (i.e., class-conditional) distribution over an audio feature space for each word in a vocabulary by training only on the songs that are p ositively associated with that word. This set of word-level distributions is then used to `annotate' a novel song, resulting in a semantic multinomial distribution. We can then retrieve songs by ranking them according to a their (dis)similarity to a multinomial that is generated from a text-based query. A schematic overview of our model is presented in Figure 1. P (i|X ) = P (X |i)P (i) , P (X ) (1) where P (i) is the prior probability that word wi will app ear in an annotation. If we assume that the feature vectors in X are conditionally indep endent given word wi , then ¢ ÉT P (i|X ) = t= 1 P (xt |i) · P (i) . P (X ) £ (2) 3.1 Problem formulation Consider a vocabulary V consisting of |V | unique words. Each `word' wi V is a semantic concept such as `happy', `blues', `electric guitar', `falsetto', etc. The goal in annotation is to find a set W = {w1 , ..., wA } of A semantically meaningful words that describ e a query song sq . Retrieval involves rank ordering a set of songs S = {s1 , ..., sR } given a query Wq . It will b e convenient to represent the text data describing each song as an annotation vector y = (y1 , ..., y|V | ) where yi > 0 if wi has a p ositive semantic association with the song and yi = 0 otherwise. The yi 's are called semantic weights since they are prop ortional to the strength of the semantic association b etween a word and a song. If the semantic weights are mapp ed to {0, 1}, then they can b e interpreted as class lab els. We represent the audio content of a song s as a set X = {x1 , ..., xT } of T real-valued feature vectors, where each vector xt represents features extracted from a short segment of the audio and T dep ends on the length of the song. Our data set D is a collection of song-annotation pairs D = {(X1 , y1 ), ..., (XD , yD )}. Note that this na¨ve Bayes assumption implies that there i is no temp oral relationship b etween audio feature vectors, given word i. While this assumption of conditional indep endence is unrealistic, attempting to model the temp oral interaction b etween feature vectors may b e infeasible due to computational complexity and data sparsity. We assume a uniform word prior, P (i) = 1/|V |, for all i = 1, .., |V | since, in practice, the T factors in the product will dominate the word prior when calculating the numerator of Equation 2. È We estimate the song prior P (X ) by |V |1 P (X |v )P (v ) and v= arrive at our final annotation equation: T P (xt |i) . P (i|X ) = È|V | t=1T É v =1 t=1 P (xt |v ) É (3) 3.2 A nnotation Annotation can b e thought of as a multi-class, multi-lab el classification problem in which each word wi V represents a `class' and the goal is to `lab el' a song with a subset of words. Our approach involves modeling a word-level distribution over audio features, P (x|i), i {1, ..., |V |} for each word wi V . Given a song represented by the set of audio feature vectors X = {x1 , ..., xT }, we use Bayes' rule to calculate the p osterior probability of each word in the vo- Note that by assuming a uniform word prior, the 1/|V | factor cancels out of the equation. Using word-level distributions (P (x|i), i = 1, ..., |V |) and Bayes rule, we use Equation 3 to calculate the parameters of a semantic multinomial distribution over the vocabulary. That is, each song in our database is compactly represented as a vector of p osterior probabilities p = {p1 , ..., p|V | } in È a `semantic space', where pi = P (i|X ) and i pi = 1. To annotate a song with the A b est words, we use the word-level models to generate the song's semantic distribution and then choose the A p eaks of the multinomial distribution, i.e., the A words with maximum p osterior probability. 3.3 Retrieval For retrieval, we first annotate our database by calculating a semantic multinomial for each song. When a user enters a query, we construct a `query multinomial' distribution, parameterized by the vector q = {q1 , ..., q|V | }, by assigning w = C if word wi is in the text-based query, and qi = qi here 1 > 0 otherwise. We then normalize q, making it's elements sum to unity so that it correctly parameterizes 441 SIGIR 2007 Proceedings Session 18: Music Retrieval QUERY: Tender, Pop, Female Lead Vocals 0.35 word-level distribution for word wi is given by: P (x|i) = rR =1 r N (x|r , r ), 1: Shakira - The One 0.024 2: Alicia Keys - Fallin 0.024 3: Evanescence - My Immortal 0.024 Figure 2: Multinomial distributions over the 159word vocabulary. The top distribution represents the query multinomial for the three-word query presented in Table 1. The next three distribution are the semantic multinomials for top three retrieved songs. a multinomial distribution. In practice, we set the C = 1 and = 10-6 . However, we should stress C need not b e a constant, rather it could b e a function of the query string. For example, we may want to give more weight to words that app ear earlier in the query string as is commonly done by Internet search engines for retrieving web documents. Examples of a semantic query multinomial and the retrieved song multinomials are given in Figure 2. Once we have a query multinomial, we rank all the songs in our database by the Kullback-Leibler (KL) divergence b etween the query multinomial q and each semantic multinomial. The KL divergence b etween q and a semantic multinomial p is given by [22]: K L(q||p) = | i V| =1 qi log qi , pi (4) where the query distribution serves as the `true' distribution. Since qi = is effectively zero for all words that do not app ear in the query string, a one-word query wi reduces to ranking by the i-th parameter of the semantic multinomials. For a multiple-word query, we only need to calculate one term in Equation 4 p er word in the query. This leads to a very efficient and scalable approach for music retrieval in which the ma jority of the computation involves sorting the D scalar KL divergences b etween the query multinomial and each song in the database. where r = 1 are the mixture weights and N (·|, ) is a multivariate Gaussian distribution with mean and covariance matrix . In this work, we consider only diagonal covariance matrices since using full covariance matrices can cause models to overfit the training data while scalar covariances do not provide adequate generalization. The resulting set of |V | distributions each have O(R · F ) parameters, where F is the dimension of feature vector x. Carneiro et al. [1] consider three parameter estimation techniques for learning a SML model: direct estimation, model averaging estimation, and mixture hierarchies estimation. The techniques are similar in that, for each wordlevel distribution, they use the exp ectation-maximization (EM) algorithm for fitting a mixture of Gaussians to training data. They differ in how they break down the problem of parameter estimation into subproblems and then merge these results to produce a final density estimate. Carneiro et al. found that mixture hierarchies estimation was not only the most scalable technique, but it also resulted in the density estimates that produced the b est image annotation and retrieval results. We confirmed these finding for music annotation and retrieval during some initial exp eriments (not rep orted here). The formulation in [1] assumes that the semantic information ab out images is represented by binary annotation vectors. This formulation is natural for images where the ma jority of words are associated with relatively `ob jective' semantic concepts such as `b ear', `building', and `sunset'. Music is more `sub jective' in that two listeners may not always agree that a song is representative of a certain genre or generates the same emotional resp onse. Even seemingly objective concepts, such as those related to instrumentation, may result in differences of opinion, when, for example, a digital synthesizer is used to emulate a traditional instrument. To this end, we b elieve that a real-valued annotation vector of associated `strengths of agreement' is a more natural semantic representation. We now extend the mixture hierarchies estimation to handle real-value semantic weights, resulting in the weighted mixture hierarchies algorithm. Consider the set of |D| song-level GMM distributions (each with K mixture comp onents) that are estimated using the feature vectors that are extracted from each song. We can estimate a word-level distribution with R comp onents using an extension of the EM algorithm: E-step: Compute the resp onsibilities of each word-level comp onent, r , to a song-level comp onent, k from song d N È hrd),k = ( [yd ]i È l (k |r , r )e- 2 Tr{(r ) ( d) 1 -1 k } (d) (d) N k r , N 1 -1 (d) ( d) (k |l , l )e- 2 Tr{(l ) k } (d) N k l 3.4 Parameter Estimation For each word wi V , we learn the parameters of the word-level (i.e., class-conditional) distribution, P (x|i), using the audio features from all songs that have a p ositive Each distribution is modeled association with word wi . with an R-comp onent Gaussian Mixture Model (GMM) distribution parameterized by {r , r , r } for r = 1, ..., R. The where N is a user defined parameter. In practice, we set ( d) N = K so that E [k N ] = 1. 442 SIGIR 2007 Proceedings Session 18: Music Retrieval M-step: Up date the word-level distribution parameters È n r ew 4.1 Semantic Representation Our goal is to collect training data from human listeners that reflect the strength of association b etween words and songs. We designed a survey that listeners used to evaluate songs in our music corpus. The corpus is a selection of 500 `western p opular' songs comp osed within the last 50 years by 500 different artists, chosen to cover a large amount of acoustic variation while still representing some familiar genres and p opular artists. In the survey, we considered 135 musically-relevant concepts spanning six semantic categories: 29 instruments were annotated as present in the song or not; 22 vocal characteristics were annotated as relevant to the singer or not; 36 genres, a subset of the Codaich genre list [25], were annotated as relevant to the song or not; 18 emotions, found by Skowronek et al. [26] to b e b oth imp ortant and easy to identify, were rated on a scale from one to three (e.g., "not happy", "neutral", "happy"); 15 song concepts describing the acoustic qualities of the song, artist and recording (e.g., temp o, energy, sound quality); and 15 usage terms from [27], (e.g., "I would listen to this song while driving, sleeping, etc."). We paid 66 undergraduate students to annotate the CAL500 corpus with semantic concepts from our vocabulary. Participants were rewarded $10 for a one hour annotation block sp ent listening to MP3-encoded music through headphones in a university computer lab oratory. The annotation interface was an HTML form loaded in a web browser requiring participants to simply click on check b oxes and radio buttons. The form was not presented during the first 30 seconds of playback to encourage undistracted listening. Listeners could advance and rewind the music and the song would rep eat until all semantic categories were annotated. Each annotation took ab out 5 minutes and most participants rep orted that the listening and annotation exp erience was enjoyable. We collected at least 3 semantic annotations for each of the 500 songs in our music corpus and a total of 1708 annotations. We expanded the set of 135 survey concepts to a set of 237 `words' by mapping all bip olar concepts to two individual words. For example, `Energy Level' gets mapp ed to `Low Energy' and `High Energy'. We are left with a collection of human annotations where each annotation is a vector of numb ers expressing the resp onse of a human listener to a semantic keyword. For each word the annotator has supplied a resp onse of +1 or -1 if the user b elieves the song is or is not indicative of the word, or 0 if unsure. We take all the human annotations for each song and compact them to a single annotation vector by observing the level of agreement over all annotators. Our final semantic weights y are 0 # = ( ( r (d),k h(d),k W ·K , where W = d |D | [yd ]i =1 new r new r = = r r z(d),k k , where z(d),k = È d) , k r z ( d) , k d) , k ( d) hrd),k k ( ( d) , k ( d) ( d) hrd),k k ( . , ( d) k + (k - t )(k - t )T ( d) ( d) From a generative p ersp ective, a song-level distribution is generated by sampling mixture components from the wordlevel distribution. The observed audio features are then samples from the song-level distribution. Note that the numb er of parameters for the word-level distribution is the same as the numb er of parameters resulting from direct estimation3 . We have essentially replaced one computationally exp ensive (and often imp ossible) run of the standard EM algorithm with at most |D| computationally inexp ensive runs and one run of the mixture hierarchies EM. In practice, mixture hierarchies EM requires ab out the same computation time as one run of standard EM for a song-level model. However, the main b enefit of using the MH-EM algorithm is that it provides a form of regularization by first representing each song by a `smooth' distribution, rather then a finite set of p oint estimates, b efore learning a word-level distribution. Our formulation differs from that derived in [23] in that the resp onsibility, hrd),k , is multiplied by the semantic weight ( [yd ]i b etween word wi and song sd . This weighted mixture hierarchies algorithm reduces to the standard formulation when the semantic weights are either 0 or 1. The semantic weights can b e interpreted as a relative measure of imp ortance of each training data p oint. That is, if one data p oint has a weight of 2 and all others have a weight of 1, it is as though the first data p oint actually app eared twice in the training set. 4. THE CAL500 MUSIC DATA SET Perhaps the easiest way to collect semantic information ab out a song is to mine text from web pages related to the song, album or artist [18, 24]. Whitman et al. collect a large numb er webpages related to the artist when attempting to annotate individual songs [18]. One drawback of this methodology is that it produces the same training annotation vector for all songs by a single artist. This is a problem for many artists, such as Paul Simon and Madonna, who have produced an acoustically diverse set of songs over the course of their careers. In previous work, we take a more song-sp ecific approach by text-mining song reviews written by exp ert music critics [24]. The drawback of this technique is that critics do not explicitly make decisions ab out the relevance of given word when writing ab out songs and/or artists. In b oth works, it is evident that the semantic lab els are a noisy version of an already problematic `sub jective ground truth.' To address the shortcomings of noisy semantic data mined from text-documents, we attempt to collect a `clean' set of semantic lab els by asking human listeners to explicitly lab el songs with acoustically-relevant words. In an attempt to overcome the problems arising from the inherent sub jectivity involved in music annotation, we require that each song b e annotated by multiple listeners. 3 Direct estimation uses the union of sets of audio feature vectors from each relevant songs to estimate a word-level GMM using the standard EM algorithm. [y]i = max , (Positive Votes) - #(Negatives Votes) #(Annotations) i . For example, for a given song, if four listeners have lab eled a concept wi with +1, +1, 0, -1, then [y]i = 1/4. For evaluation purp oses, we also create `ground truth' binary annotation vectors. We generate binary vectors by lab eling a song with a word if a minimum of two p eople express an opinion and there is at least 80% agreement b etween all listeners. We prune all concepts that are represented by fewer than eight songs. This reduces our vocabulary from 237 to 159 words. 443 SIGIR 2007 Proceedings Session 18: Music Retrieval 4.2 Musical Representation We represent the audio with a time series of delta cepstrum feature vectors. A time series of Mel-frequency cepstral coefficient (MFCC) [28] vectors is extracted by sliding a half-overlapping short-time window (12 msec) over the song's digital audio file. A delta cepstrum vector is calculated by app ending the instantaneous first and second derivatives of each MFCC to the vector of MFCCs. We use the first 13 MFCCs resulting in ab out 10,000 39-dimensional feature vectors p er minute of audio content. The reader should note that the SML model (a set of GMMs) ignores the temp oral dep endencies b etween adjacent feature vector within the time series. We find that randomly sub-sampling the set of delta cepstrum feature vectors so that each song is represented by 10,000 feature vectors reduces the computation time for parameter estimation and inference without sacrificing much overall p erformance. 5. MODEL EVALUATION In this section, we qualitatively and quantitatively evaluate our SML model for music annotation and retrieval. To our knowledge, there has b een little previous work on these problems [16­18, 24]. It is hard to compare our p erformance against the work of Whitman et al. since their work focuses on vocabulary selection while the results in [24] are calculated using a different model on a different data set of words and songs. Instead, we evaluate our system against two baselines: a `random' baseline and a `human' baseline. The random baseline is a system that samples words (without replacement) from a multinomial distribution parameterized by the word prior distribution, P (i) for i = 1, ..., |V |, estimated using the observed word counts from the training set. Intuitively, this prior stochastically generates annotations from a p ool of the words used most frequently in the training set. We can also estimate the p erformance of a human on the annotation task. This is done by holding out a single human annotation from each of the 142 songs in the CAL500 data set that had more than 3 annotations. To evaluate p erformance, we compare this human's semantic description of a song to the "ground truth" lab els obtained from the remaining annotations for that song. We run a large numb er of simulations by randomly holding out different human annotations. 5.1 Annotation Performance Given an SML model, we can effectively `annotate' a novel song by estimating a semantic multinomial using Equation 3. Placing the most likely words into a natural language context demonstrates how our annotation system can b e used to generate `automatic music reviews' as illustrated in Table 2. It should b e noted that in order to create these reviews, we made use of the fact that the words in our vocabulary can loosely b e organized into semantic categories such as genre, instrumentation, vocal characteristic, emotions, and song usages. Quantitative annotation p erformance is measured using mean per-word precision and recall [1, 2]. First, we annotate each test set song with a fixed numb er of words (e.g., A = 8) from our vocabulary of 159 words. For each word w in our vocabulary, |wH | is the numb er of songs that have word w in the "ground truth" annotation, |wA | is the numb er of songs that our model annotates with word w, and |wC | is the numb er of "correct" words that have b een used b oth in the ground truth annotation and by the model. Per-word recall is |wC |/|wH | and p er-word precision is |wC |/|wA |. While trivial models can easily maximize one of these measures (e.g., by lab eling all songs with a certain word or, instead, none of them), achieving excellent precision and recall simultaneously requires a truly valid model. Mean p er-word recall and precision is the average of these ratios over all the words in our vocabulary. It should b e noted that these metrics range b etween 0.0 and 1.0, but one may b e upp er b ounded by a value less than 1.0 if either the numb er of words that app ear in the corpus is greater or lesser than the numb er of words that are output by our system. For example, if our system outputs 4000 words to annotate the 500 test songs for which the ground truth contains 6430 words, mean recall will b e upp er-b ounded by a value less than one. The exact upp er b ounds (denoted "Upp erBnd" in Table 3) for recall and precision dep end on the relative frequencies of each word in the vocabulary and can b e calculated empirically using a simulation where the model output exactly matches the ground truth. It may seem more intuitive to use per-song precision and recall, rather than the p er-word metrics. However, p er-song metrics can lead to artificially good results if a system is good at predicting the few common words relevant to a large group of songs (e.g., "rock") and bad at predicting the many rare words in the vocabulary. Our goal is to find a system that is good at predicting all the words in our vocabulary. In practice, using the 8 b est words to annotate each song, our SML model outputs 143 of the 159 words in the vocabulary at least once. Table 3 presents quantitative results for music annotation. The results are generated using ten-fold cross validation. That is, we partition the CAL500 data set into ten sets of fifty songs and estimate the semantic multinomials for the songs in each set with an SML model that has b een trained using the songs in the other nine sets. We then calculate the p er-word precision and recall for each word and average over the vocabulary. The quantitative results demonstrate that the SML model significantly outp erforms the random baselines and is comparable to the human baseline. This does not mean that our model is approaching a `glass ceiling', but rather, it illustrates the p oint that music annotation is a sub jective task since an individual can produce an annotation that very different from the annotation derived from a p opulation of listeners. This highlights the need for incorp orating semantic weights when designing an automatic music annotation and retrieval system. 5.2 Retrieval Performance We evaluate every one-, two-, and three-word text-based query drawn from our vocabulary of 159 words. First, we create query multinomials for each query string as describ ed in Section 3.3. For each query multinomial, we rank order the 500 songs by the KL divergence b etween the query multinomial and the semantic multinomials generated during annotation. (As describ ed in the previous subsection, the semantic multinomials are generated from a test set using cross-validation and can b e considered representative of a novel test song.) Table 1 shows the top 5 songs retrieved for a numb er of text-based queries. In addition to b eing (mostly) accurate, the reader should note that queries, such as `Tender' and `Female Vocals', return songs that span different genres and are comp osed using different instruments. As more words are added to the query string, note that the songs returned 444 SIGIR 2007 Proceedings Session 18: Music Retrieval Table 2: Automatically generated music reviews. Words in bold are output by our system. White Strip es - Hotel Yorba This is a brit p oppy, alternative song that is not calming and not mellow. It features male vo cal, drum set, distorted electric guitar, a nice distorted electric guitar solo, and screaming, strong vo cals. It is a song with high energy and with an electric texture that you might like listen to while driving. Miles Davis - Blue in Green This is a jazzy, folk song that is calming and not arousing. It features acoustic guitar, saxophone, piano, a nice piano solo, and emotional, low-pitched vo cals. It is a song slow temp o and with low energy that you might like listen to while reading. Dr. Dre (feat. Sno op Dogg) - Nuthin' but a 'G' thang This is a dance p oppy, hip-hop song that is arousing and exciting. It features drum machine, backing vo cals, male vocal, a nice acoustic guitar solo, and rapping, strong vo cals. It is a song that is very danceable and with a heavy b eat that you might like listen to while at a party. Dep eche Mo de - World In My Eyes This is a funky, dance p op song that is arousing not not tender. It features male vo cal, synthesizer, drum machine, a nice male vo cal solo, and altered with effects, strong vo cals. It is a song with a synthesized texture and that was recorded in studio that you might like listen to while at a party. Table 4: Music retrieval results for 1-, 2-, and 3word queries. See Table 3 for SML model parameters. Query Length 1-word (159/159) 2-words (4,658/15,225) 3-words (50,471/1,756,124) Mo del Random SML Random SML Random SML MeanAP 0.173 0.307 0.076 0.164 0.051 0.120 MeanAROC 0.500 0.705 0.500 0.723 0.500 0.730 fewer relevant songs p er query). However, an interesting finding is that the MeanAROC actually increases with additional query terms, indicating that our model can successfully integrate information from multiple words. 5.3 Comments The qualitative annotation and retrieval results in Tables 1 and 2 indicate that our system produces sensible semantic annotations of a song and retrieves relevant songs, given a text-based query. Using the explicitly annotated music data set describ ed in Section 4, we demonstrate a significant improvement in p erformance over similar models trained using weakly-lab eled text data mined from the web [24] (e.g., music retrieval MeanAROC increases from 0.61 to 0.71). The CAL500 data set, automatic annotations of all songs, and retrieval results for each word, can b e found at the UCSD Computer Audition Lab website (http://cosmal.ucsd.edu/cal). Our results are comparable to state-of-the-art contentbased image annotation systems [1] which rep ort mean p erword recall and precision scores of ab out 0.25. However, the relative ob jectivity of the tasks in the two domains as well as the vocabulary, the quality of annotations, the features, and the amount of data differ greatly b etween our audio annotation system and existing image annotation systems making any direct comparison dubious at b est. Table 3: Music annotation results: SML model learned from K = 8 component song-level GMMs, and composed of R = 16 component word-level GMMs. Each CAL500 song is annotated with A = 8 words from a vocabulary of |V | =159 words. Mo del Random Human Upp erBnd SML Precision 0.169 0.342 1.000 0.312 Recall 0.052 0.110 0.302 0.142 are representative of all the semantic concepts in each of the queries. By considering the "ground truth" target for a multipleword query as all the songs that are associated with al l the words in the query string, we can quantitatively evaluate retrieval p erformance. We calculate the mean average precision (MeanAP) [2] and the mean area under the receiver op erating characteristic (ROC) curve (MeanAROC) for each query for which there is a minimum of 8 songs present in the ground truth. Average precision is found by moving down our ranked list of test songs and averaging the precisions at every p oint where we correctly identify a new song. An ROC curve is a plot of the true p ositive rate as a function of the false p ositive rate as we move down this ranked list of songs. The area under the ROC curve (AROC) is found by integrating the ROC curve and is upp er b ounded by 1.0. Random guessing in a retrieval task results in an AROC of 0.5. Comparison to human p erformance is not p ossible for retrieval since an individual's annotations do not provide a ranking over all retrievable songs. Columns 3 and 4 of Table 4 show MeanAP and MeanAROC found by averaging each metric over all testable one, two and three word queries. Column 1 of Table 4 indicates the prop ortion of all p ossible multiple-word queries that actually have 8 or more songs in the ground truth against which we test our model's p erformance. As with the annotation results, we see that our model significantly outp erforms the random baseline. As exp ected, MeanAP decreases for multiple-word queries due to the increasingly sparse ground truth annotations (since there are 6. DISCUSSION AND FUTURE WORK We have collected the CAL500 data set of cleanly annotated songs and offer it to researchers who wish to work on semantic annotation and retrieval of music. This data set may b e considered small in comparison to standard data set that have b een develop ed by the text-mining and computer vision communities. However, by developing a useful and efficient parameter estimation algorithm (weighted mixture hierarchies EM), we have shown how the CAL500 data set can b e used to train a query-by-semantic-description system for music information retrieval that significantly outp erforms the system presented in [24]. While direct comparison is imp ossible since different vocabularies and music corp ora are used, b oth qualitative and quantitative results suggest that end user exp erience has b een greatly improved. We have also shown that compactly representing a song as semantic multinomial distribution over a vocabulary is useful for b oth annotation and retrieval. More sp ecifically, by representing a multi-word query string as a multinomial distribution, the KL divergence b etween this query multinomial and the semantic multinomals provides a natural and computationally inexp ensive way to rank order songs in a database. The semantic multinomial representation is also useful for related music information tasks such as `query-bysemantic-example' [15, 29]. All qualitative and quantitative results rep orted are based on one SML model (K = 8, R = 16) trained using the 445 SIGIR 2007 Proceedings Session 18: Music Retrieval weighted mixture hierarchies EM algorithm. Though not rep orted, we have conducted extensive parameter testing by varying the numb er of song-level mixture comp onents (K ), varying the numb er of word-level mixture comp onents (R), exploring other parameter estimation techniques (direct estimation, model averaging, standard mixture hierarchies EM [1]), and using alternative audio features (such as dynamic MFCCs [11]). Some of these models show comparable p erformance for some evaluation metrics. For example, dynamic MFCC features tend to produce b etter annotations, but worse retrieval results than those based on delta cepstrum features rep orted here. In all cases, it should b e noted that we use a very basic frame-based audio feature representation. We can imagine using alternative representations, such as those that attempt model higher-level notions of harmony, rhythm, melody, and timbre. Similarly, our probabilistic SML model (a set of GMMs) is one of many models that have b een develop ed for image annotation [2, 3]. Future work may involve adapting other models for the task of audio annotation and retrieval. In addition, one drawback of our current model is that, by using GMMs, we ignore all medium-term (> 1 second) and long-term (entire song) information that can b e extracted from a song. Future research will involve exploring models, such as hidden Markov models, that explicitly model the longer-term temp oral asp ects of music. Lastly, we are currently exploring a more scalable data collection approach that involves using web-based games4 to collect semantic information ab out music [30]. This technique, referred to as human computation, has b een successful used to collect semantic information ab out images [31]. Acknowledgements We would like to thank A. Chan, A. Cont, trell, S. Dubnov, C. Elkan, O. Lang, L. Saul, concelos for their helpful comments.This work by NSF IGERT DGE-0333451 and NSF grant 062540922. G. W. Cotand N. Vasis supp orted DMS-MSPA 7. REFERENCES [1] G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos. Sup ervised learning of semantic classes for image annotation and retrieval. IEEE PAMI, 29(3):394­410, 2007. [2] S. L. Feng, R. Manmatha, and Victor Lavrenko. Multiple b ernoulli relevance models for image and video annotation. IEEE CVPR, 2004. [3] D. M. Blei and M. I. Jordan. Modeling annotated data. ACM SIGIR, 2003. [4] D. Forsyth and M. Fleck. Body plans. IEEE CVPR, 1997. [5] International conferences of music information retrieval. http://www.ismir.net/. [6] MIREX 2005. Music information retrieval evaluation exchange. http://www.music-ir.org/mirex2005. [7] M. Goto and K. Hirata. Recent studies on music information processing. Acoustical Science and Technology, 25(4):419­425, 2004. [8] R. B. Dannenb erg and N. Hu. Understanding search p erformance in query-by-humming systems. ISMIR, 2004. 4 [9] A. Kapur, M. Benning, and G. Tzanetakis. Query by b eatb oxing: Music information retrieval for the dj. ISMIR, 2004. [10] G. Eisenb erg, J.M. Batke, and T. Sikora. Beatbank an mp eg-7 compliant query by tapping system. Audio Engineering Society Convention, 2004. [11] M. F. McKinney and J. Breebaart. Features for audio and music classification. ISMIR, 2003. [12] T. Li and G. Tzanetakis. Factors in automatic musical genre classification of audio signals. IEEE WASPAA, 2003. [13] S. Essid, G. Richard, and B. David. Inferring efficient hierarchical taxonomies for music information retrieval tasks: Application to musical instruments. ISMIR, 2005. [14] F. Pachet and D. Cazaly. A taxonomy of musical genres. RIAO, 2000. [15] A. Berenzweig, B. Logan, D. Ellis, and B. Whitman. A large-scale evalutation of acoustic and sub jective music-similarity measures. Computer Music Journal, 2004. [16] B. Whitman. Learning the meaning of music. PhD thesis, Massachusetts Institute of Technology, 2005. [17] B. Whitman and D. Ellis. Automatic record reviews. ISMIR, 2004. [18] B. Whitman and R. Rifkin. Musical query-bydescription as a multiclass learning problem. IEEE Workshop on Multimedia Signal Processing, 2002. [19] M. Slaney. Semantic-audio retrieval. IEEE ICASSP, 2002. [20] P. Cano and M. Kopp enb erger. Automatic sound annotation. In IEEE workshop on Machine Learning for Signal Processing, 2004. [21] M. Slaney. Mixtures of probability exp erts for audio retrieval and indexing. IEEE Multimedia and Expo, 2002. [22] T. Cover and J. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. [23] N. Vasconcelos. Image indexing with mixture hierarchies. IEEE CVPR, pages 3­10, 2001. [24] D. Turnbull, L. Barrington, and G. Lanckriet. Modelling music and words using a multi-class na¨ve i bayes approach. ISMIR, 2006. [25] C. McKay, D. McEnnis, and I. Fujinaga. A large publicly accessible prototyp e audio database for music research. ISMIR, 2006. [26] J. Skowronek, M. McKinney, and S. ven de Par. Ground-truth for automatic music mood classification. ISMIR, 2006. [27] Xiao Hu, J. S. Downie, and A. F. Ehmann. Exploiting recommended usage metadata: Exploratory analyses. ISMIR, 2006. [28] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993. [29] L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet. Audio information retrieval using semantic similarity. In IEEE ICASSP, 2007. [30] D. Turnbull, R. Liu, L. Barrington, D. Torres, and G Lanckriet. Using games to collect semantic information ab out music. Technical rep ort, 2007. [31] L. von Ahn. Games with a purp ose. IEEE Computer Magazine, 39(6):92­94, 2006. Listen Game: www.listengame.org 446