Towards Efficient Automated Singer Identification in Large  Music Databases
Jialie Shen1
School of Computer Sci. & Eng. The University of New South Wales Sydney, Australia {jls, jas}@cse.unsw.edu.au
1

Bin Cui2
2

John Shepherd1

Kian-Lee Tan3
School of Computing National University of Singapore 3 Science Drive 2, Singapore tankl@comp.nus.edu.sg
3

Dept. of Computer Science & National Lab on Machine Perception Peking University, China bin.cui@pku.edu.cn

ABSTRACT
Automated singer identification is important in organising, browsing and retrieving data in large music databases. In this paper, we propose a novel scheme, called Hybrid Singer Identifier (HSI), for automated singer recognition. HSI can effectively use multiple low-level features extracted from both vocal and non-vocal music segments to enhance the identification process with a hybrid architecture and build profiles of individual singer characteristics based on statistical mixture models. Extensive experimental results conducted on a large music database demonstrate the superiority of our method over state-of-the-art approaches.

Categories and Subject Descriptors
H.3 [Information Systems]: Content Analysis and Indexing, Information Storage and Retrieval; J.5 [Arts and Humanities]: Music

General Terms
Algorithm, Experimentation, Measurement, Evaluation

Keywords
Music Retrieval, Singer Identification

1.

INTRODUCTION

With the explosive growth of online music repositories, techniques for content-based music retrieval are getting more attention in multimedia databases. Consequently, it has become an attractive research topic, and many techniques have been developed to support automatic classification or recognition of music based on instrument, genre and other characteristics [11, 12, 14, 23]. With such techniques, users are provided with powerful functions for browsing and searching musical content in a large music database. This research was partly supported by The Australian Research Council(ARC Discovery Grant DP0346004).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR '06, August 6­11, Seattle, Washington, USA Copyright 2006 ACM 1-59593-369-7/06/008 ...$5.00.

Techniques for automatic artist identification are becoming more and more important due to numerous potential applications including music indexing and retrieval, copyright management and music recommendation systems. The development of singer identification enables the effective management of music databases based on "singer similarity". With this technology, songs performed by a particular singer can be automatically clustered for easy management or exploration. Currently, the most popular and naive approach to support singer identification is to manually embed artist information into the music database with the assistance of music professionals. The main weakness of the approach is that it requires a large amount of time and high domain expertise, which is very expensive. Moreover, in some cases, such as music downloaded from the Internet, descriptive information is often lacking and/or inconsistent. It is clear that songs performed by the same singer share certain audio characteristics; the singer's voice is most likely to contain similar audio patterns over all songs (s)he perform. Singers also tend to perform within a single genre, and so the audio characteristics of their work may contain common features (e.g. instrumentation). This suggests the feasibility of automatic singer identification based on audio content. By "automatic singer identification" here we refer to the task of determining which singer is most likely singing a given song. The effectiveness of a solution heavily relies on its capability to capture salient information for separating one signal from another. While traditional speech recognition techniques [2, 15] could be applied for this task, they are likely to perform well, because the vocal track is intertwined with the nonstationary background signal played by different instruments. However, it is rare that we might acquire a pure solo voice track without instrumentation (unless we had access to the original multi-track data for the song). Several approaches have been recently proposed to apply statistical models or machine learning techniques for automatic singer classification/identification [12, 19, 23]. In general, these methods consist of two main steps: singer characteristic modelling based on solo voice and class label identification. In the singer characteristic modelling step, acoustic signal information is extracted to represent the music. Then specific mechanisms (i.e., statistical models or machine learning algorithms) are constructed to assign the songs to one of the pre-defined singer categories based on their extracted acoustic features. Unfortunately, these approaches are unable to achieve acceptable classification accuracy. The main reasons are: 1) Like other kinds of multimedia objects such as images and videos, songs with vocal components contain many audio features. Thus, good performance for identification system cannot be expected by employing a single type of feature to represent mu-

59


sic items, 2) No existing work has addressed the underlying multifeature integration model that can be used to explore the conjunctive effects among the different acoustic characteristics for effective singer identification, and 3) Most previous work simply ignored the influence of accompanying music. However, nonvocal components generally carry a large amount of information about music style and genre, which could be significantly useful to identify songs performed by certain artist. Motivated by these concerns, this paper proposes a novel system, called Hybrid Singer Identifier (HSI), for automated singer recognition in large music databases. In summary, the contributions of the paper include: · A novel singer identification framework using multiple features extracted from both vocal and nonvocal components to improve the system's effectiveness. · A probabilistic singer characteristic modelling method based on mixture models and logistic regression to bridge the "semantic gap". · An extensive performance study based on a large dataset that demonstrates the superiority of our method over state-of-theart approaches in various aspects including effectiveness, scalability, efficiency, and robustness against audio distortion and various kinds of noise. The remainder of this paper is organised as follows: Section 2 gives a brief overview of previous related work. In Section 3, we present our proposed architecture, its component modules and relevant algorithms. Section 4 reports the experimental study. Finally, we conclude the paper with the directions for future work in Section 5.

distinguish the singing voice from instrumental sounds in a song. A statistical model is trained for each singer's voice with typical song(s) of the singer. Then, for a song to be identified, the starting point of the singing voice is detected and a portion of the song is excerpted from that point. Audio features are extracted and matched with singers' voice models in the database. The song is assigned to the model having the best match. Accuracy rates of around 80% are achieved in a tiny database with 45 songs. Meanwhile, in [1], a singer identification method is developed based on the spectral envelope estimation using composite transfer function (CTF), which is calculated from the instantaneous amplitude and frequency of the signal's harmonic partials. Unfortunately, this method only examines a very limited case in which audio samples only contain the singer's voice, without accompaniment. More recently, Tsai et al. proposed a solo voice modelling framework to capture singers' vocal characteristics [19]. The technique firstly separates vocal from non-vocal regions and then models the singers' vocal characteristics based on stochastic properties of the background music. However, the main weakness for the method is that it uses only a single type of acoustic feature for vocal portions (20 dimensional Mel-Frequency Ceptral coefficients) to profile different singers. Based on their experiment on a small dataset of 230 popular music songs, the accuracy of identification is only 71%. At this point, it is the state-of-the-art in singer identification.
Identification Methods WHI [21] LIU [12] BA [1] BER [3] ZHANG [23] TSAI [19] Multifeature Integration No No No No No No Size of Test Dataset Small Small Small Small Small Small Vocal Based No No No Yes Yes Yes Nonvocal Based No No No No No Yes

2.

RELATED WORK

In recent years, there has been increasing interest in the development of frameworks for singer identification and associated topics. Among the earliest of such systems, Minnowmatch mainly focuses on artist rather than singer identification using mel-frequency cepstral coefficients (MFCCs), which is a feature adapted from the classical speech recognition and speaker identification mechanisms [21]. The best identification accuracy achieved on a small dataset, containing a 10 artist set, is approximately 70%. However, with a larger set of 21 artists, the best case accuracy significantly drops to 50%. In [4], vocal music is used as an input to a speech recognition system, achieving a success rate of up to 80% in isolating vocal regions. In [3], the authors use a neural network trained on radio recordings to similarly segment songs into vocal and non-vocal regions. By focusing on voice regions alone, they improved artist identification by 15%. The system present here also attempts to perform segmentation of vocal regions prior to singer identification. After segmentation, the classifier uses features drawn from voice coding based on Linear Predictive Coding. In [12], a novel scheme is designed and developed to automatically classify music objects according to their singers. First, the coefficients extracted from the output of the polyphase filters are used to compute the music features for segmentation. Based on these features, a music object can be decomposed into a sequence of notes (or phonemes). Then for each phoneme in the training set, its music feature is extracted and used to train a music classifier which can identify the singer of an unknown music object. Zhang [23] developed a system for automatic singer identification which recognises the singer of a song by analysing the music signal. The proposed scheme follows the framework of common speaker identification systems, but special efforts are made to

Table 1: Summary of identification methods' properties. Table 1 summarises the properties of each identification method. As discussed above, those methods either use single kind of acoustic feature to represent music objects or simply investigate the singer identification from a speech-recognition angle. Moreover, all of them have been tested based on very small dataset. Consequently, it is difficult to evaluate their impact on real life applications.

3. SYSTEM ARCHITECTURE
In this section, we present a new method, called Hybrid Singer Identifier (HSI), to facilitate automated singer recognition in large music databases. The system, as illustrated in Figure 1, utilises a layering structure and consists of two major component layers - preprocessing module and singer modelling module. The major functionality of the preprocessing module is to separate an incoming song into vocal and non-vocal segments. The second layer includes multiple statistical singer models, each corresponding to one artist. In particular, the statistical model for one singer consists of a series of Gaussian Mixture Models (GMMs), each constructed using one kind of acoustic feature. To identify a song, different feature vectors are firstly extracted from vocal and nonvocal segments. The likelihood values being generated by each GMMs in one statistical singer model are combined to form an overall relevance score using a novel fusion scheme based on Logistic function. Thus, singer's label is finally assigned to singer with the highest overall relevance score. In the following subsections, we will present the submodules and corresponding algorithms in the system.

60


Singer 1's Model

Likelihood 1

Vocal Frames

Singer 2's Model

Likelihood 2

Preprocessing Module Music Record Nonvocal Frames

MAX

Singer Label

Music Frames

Feature Extraction Vocal

Singer N's Model

Likelihood N

SVM
Nonvacal

Singer Characteristic Modeling Module

Figure 1: Architecture of the HSI singer identification system

3.1 Music Preprocessing
In the first identification process, vocal and non-vocal segments are identified and labelled via the preprocessing module. This process can be treated as a problem of vocal boundary detection and steps are illustrated in Algorithm 1. We use a learning based approach with support vector machine (SVM). It consists of two processes: feature extraction and classification with SVM. When a song comes in, it is divided into many short time-frames of a predefined length (line 1.1). We set the length of frame to 0.5sec as it yields the best performance in our experiments. The acoustic features are calculated from each frame (line 1.2) and they include the Mel-Frequency Ceptral coefficients (MFCC), Spectral centroid, Spectral flux, Zero crossings, and Low energy.
Input : Song s, Frame Length l Output: Vocal and non-vocal segments Segment song s into frames with length l ; Calculate spectral features for each frame ; Classify music frames into two categories, vocal and non-vocal, with SVM ; Return vocal and non-vocal segments ;
Vocal Segments of Song By Singer s Vocal Timbre Feature Extraction Vocal Pitch Feature Extraction
VPF GMMs VTF GMMs

ls 1

s l2

Likelihood C
s

Nonvocal Segments of Song By Singer s

Instrument Feature Extraction
Instrument GMMs

l3

s

L

s

Genre Feature Extraction
Genre GMMs
s l4

Singer s 's Model

Figure 2: Singer s's characteristic modelling module artist's song, we extract different features from both the vocal and non-vocal segments. The features consist of four different components: vocal timbre feature (VTF), vocal pitch feature (VPF), genre-based feature and instrument feature1 . The VTF and VPF capture information of the vocal segments performed by the singer. The genre-based feature represents the music style. The instrument feature is used to model the characteristics of typical instrument configuration for songs performed by the artist. This is motivated by research results on perceptual study which indicates that music performed by certain singers typically contains similar instrument configuration [22]. The detail information is as follows: · Vocal Timbre Feature (VTF) conveys the timbre information of the vocal component. It is important due to fact that the voice is a special instrument generated with flesh and bone and its timbre is unique for each singer, and provides the most expressive characteristic for a particular singer's song. In this study, we extract LPCC (Linear Predictionbased Cepstral Coefficients) [17] from vocal segments to represent the information. The dimensionality of this feature vector is 16. · Vocal Pitch Feature (VPF) describes the harmonic structure related to certain singer's voice. We extract it using the algorithm proposed by Tolonen et al. [18]. The raw signal is first processed with band pass filter, where the lower and upper passband limits can have any values between 0 and 4500 Hz. This is because the frequency of the human voice is inside this range. Then, amplitude envelopes are extracted for
1 Note that our method can be easily extended to consider more acoustic features.

1.1 1.2 1.3 1.4

Algorithm 1: The Algorithm for Preprocessing After obtaining the features, the SVM is used to separate vocal from non-vocal frames (line 1.3). The use of SVM is motivated by the fact that they have demonstrated a strong effectiveness on various categorisation problems. The basic function of the SVM is that it can non-linearly map the input features into a high dimensional feature space; then a linear classifier is constructed to use optimal hyper-planes to separate positive patterns and negative patterns with maximum margin. The SVM performs well for binary classification, and the LIBSVM [5] is utilised in this work.

3.2 Multifeature Statistical Singer Modelling
As mentioned before, in the second layer of our HSI system, there are numerous singer characteristic models, one for each singer. Each singer characteristic model is made up of two parts: feature extraction and multiple mixture models are built for capturing statistical information using different acoustic features for a particular singer. Its structure is presented in Figure 2.

3.2.1 Feature Extraction
To effectively represent the complex musical content of a specific

61


different frequencies and summed to construct a pitch histogram which is used for describing prosodic features. A 18dimensional feature vector incorporates: the amplitude and periods of the maximum six peaks in the histogram, pitch interval between the six most prominent peaks, and the overall sums of the histograms. We use this feature to build the singer's prosodic model. · Genre Based Feature (GBF) contains music information about genre. It has been observed that in general genres of music performed by a certain artist is limited. Thus, using genre based feature can improve singer identification accuracy. In this study, Daubechies Wavelet Coefficient Histograms, i.e. DWCHs technique, is used to summarise information about music genre [11]. With the DWCHs, local and global temporal information inside music signal can be captured at the same time. It is currently the state-of-the-art feature extraction techniques for content-based genre classification. The dimensionality of the feature is 40. · Instrument Based Feature (IBF) contains instrument configuration information for the song performed by certain singer. Recent results on music understanding shows that there is a strong association between the instrument configuration [22]. Thus, considering an instrument-based feature should be helpful to singer identification effectiveness. In this paper, MelFrequency Ceptral coefficients (MFCC) is used to represent information about instrument configuration. This is because MFCC feature has been widely used to model timbre for instrument identification [13]. The IBF's dimensionality in our setting is 13.

( s ) f

=

s ar g max Pf ({V f |s }) f s f

(2)

At the first step, s is initialised with random values. Then, new f c model s is obtained with the auxiliary function in E step, f c Q{s ; s } f f T J QPss c { hf j pf (j |v tf , s ) log ps (j, v tf |s )} = f f f
t= 1 j = 1

(3)

where
s c bf b ps (j, v tf |s ) = b s j ps (v tf |µs j , f j ) hf f f f

(4)

and

ps (j, |v tf , s ) = f f

J P

s wf j ps (v tf |µ s j ,s j ) f f f

j =1

hs j ps (v tf |µ s j ,s j ) f f f f

(5)

For the M step, the parameters can be updated using estimation as below, bs j = hf
1 T T J PP t= 1 j = 1

ps (j |v tf , s ) f f

(6)

bf µs j
J T PP

=

T J PP {ps (j |v tf ,s )vtf j } f f t=1 j =1 J T PP s pf (j |v tf ,s ) f t=1 j =1

(7)

3.2.2 Profiling a Singer with Mixture Models
For the purpose of effective singer identification, HSI constructs a statistical model for each singer based on multiple features. To achieve this, the individual features of the music signal are extracted, and then, individual profiling models for one singer are built up based on each feature2 . In our framework, singer profiling aims to capture statistical properties of different features with a finite numbers of mixture models. Thus, the probability of singer label s can be modelled as a random variable drawn from a probability distribution for certain feature f . Given a parameter set s f estimated based on feature f , it is present as a mixture of multivariate component densities,
T J QPss { hf j pf (v tf | µs j , s j )} f f

bs f j

=

t=1 j =1

bf bf {ps (j |v tf ,s )(vtf j -µ s j )(vtf j -µ s j )T } f f
J T PP t=1 j =1

ps (j |v tf ,s ) f f

(8)

The updating procedure is repeated until the log-likelihood value is increased by less than a predefined threshold from one iteration to the next. Since HSI considers four different features, the overall training procedure will be repeated four times, once for each feature. After the training process is completed, the likelihood value generated based on feature f for input feature vector V f can be given as below,
s s lf = log (Pf (V f |s )) f T J P Pss log ({ hf j pf (v tf | µs j , s j )}) = f f t= 1 j =1

(9)

s Pf (V f |s ) = f

t= 1 j = 1

(1)

and we can derive an overall likelihood value based on various features for singer s, expressed as below,
s s Ls = C s (lf , wf ) s wf s

Where V f = {v 1f , v 2f , ..., v T f }. Assume that Gaussian density is used as multivariate component in this study, according to GMM s = {hs j , µs j , s j | wher e 1 < j < J }, where wes j , f f f f f µs j and s j denotes mixture weights, mean vectors and covarif f ance matrices individually. Also, ps (v tf | µs j , s j ) is the probaf f f bility of a singer label s based on feature f extracted from segment t and given data v tf , and can be easily calculated using Gaussian density function and associated parameters {µs j , s j }. f f In HSI, an EM algorithm is used to determine a set of model parameters. This process of estimation is an iterative hill climbing procedure. The goal is to derive an optimal parameter set s via a f maximum likelihood estimation as
2 In this study, since four different features are extracted, the number of profiling models for each singer is four.

(10)

is the combination weights and C is likelihood value where combination function. Ls can be used to quantify universal similarity distance between input song and singer label s.

3.3 Fusion Weight Estimation via Learning
In HSI, for each singer characteristic model, a logistic function is used as a combination function C s to derive an overall likelihood score. Logistic functions have been widely used in the statistical and machine learning community and play an important role in one of the most popular statistical algorithms - Logistic Regression (LR) [9, 20, 8]. The main reason for using LR to estimate parameters is that few statistical assumptions are required for its use and

62


2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13

m Input : Matrix M A  [-1, 1]M ×F where M Amf = ym lf Np is number of the positive training example Nn is number of the negative training example Output: Weight vector W s W s = (0, 0, ..., 0) and q0 = (0.5, 0.5, ...., 0.5) ; for t = 1, 2, .... do P qt+1,m = qt,m exp(- N=1 t,f M Amf ) ; f for each positive training example do Nq qt,m = N n+t,f ; N
n p

positive and negative learning examples, we revise the algorithm to give qt,i weight (lines 2.4 - 2.9).

3.4 Identification Process
The goal of the HSI is to identify the singer of a song presented only as audio data. Given the architecture introduced previously, we describe the procedures of the whole system. To accomplish the singer identification, we have a music database and we know the singers of the songs in the database. In the training stage, we process the songs in the database, and get the features of each singer. According to the characteristics of the singers, the parameters for the multifeature statistical modelling module can be generated. After the setup of the system, we can conduct the task of singer identification. The basic steps are shown in Algorithm 3. For a given music item, at the initial stage of the process, the system partitions the song into vocal and non-vocal segments (line 3.1). After that, the feature extraction procedure generates four different kinds of features as described in Section 3.2.1. Next, the features are fed into statistical module for individual singer in the second layer of the HSI. The likelihood score based on a particular feature can be generated based on Eq. 9. Then, relevance scores, one from each singer statistical model, can be calculated with Equation 11(line 3.3). Those scores quantify the similarity between the incoming song and the singer label. In the final step, a singer's label for the song can be assigned based on those relevance scores. s = ar g max Ls
1sS

end for each negative training example do N qt,m q t,m = N p + N ;
n p

end

2.14

P    jt = ar g max  M=1 qt,m M Amf  ; m f P rt = M=1 qt,m M Amft ; m PM zt = m=1 qt,m ; z +r at = 1 ln( zt -rt ) ; 2( t t at if j = ft ; t,f = 0 other w ise

t+ 1 t 2.15 update with W s = W s + t ; 2.16 end s s s 2.17 Return W s = {w1 , w2 , ....., wF } ;

Algorithm 2: Logistic regression training algorithm to determine weights of score fusion for singer s

(14)

less computational cost in terms of training. In addition, the output of logistic functions can map any real value into probabilistic output in [0,1]. With logistic function, Eq. 10 can be reformulated as below,
s s Ls = C s (lf , wf ) =

Input : A song with unknown singer s Output: Singer label of the song
3.1 3.2 3.3 3.4

1 PF ss 1 + e x p (- y s f =1 wf lf )

(11)

Process the song to get vocal and non-vocal segments; Feature extraction; Derive relevance scores using statistical singer modelling models; Return singer label with Equ. 14 ;

Where ys = 1 if this input object belongs to singer s, ys = -1 s otherwise, wf is the weight for singer s's likelihood value generated based on feature f . F is the size of input score and equals the number of features extracted in the first layer. Ls denotes the overall relevancy score - conditional probability of singer s. Based on Eq. 11, the likelihood value occurring in the learning samples is,
M Y m=1 1

Algorithm 3: Algorithm for automatic singer identification

3.5 Major Advantages
The HSI architecture presented in previous sections, enjoys several advantages over competing approaches: · Comprehensiveness In our singer characteristic model, various features extracted from vocal and nonvocal segments are taken into consideration. The combination of those features can summarise the music content more precisely, and hence result in more concise singer modelling. · Effectiveness The multifeature statistical singer characteristic model in conjunction with probabilistic likelihood fusion scheme with Logistic function enables us to capture effects of different features more effectively. This can result in an identification process with higher accuracy. · Flexibility The statistical singer characteristic model for each artist is constructed independently. Thus, the whole system doesn't need to be rebuilt when music items performed by new singers are integrated into the database. Therefore, the system naturally has superior scalability over other schemes. Moreover, the property allows that a larger number of classes and data objects can be handled with HSI at the same time.

1 + e x p (- y s

PF

f =1

ss wf lf )

(12)

Where M is the number of training examples. It is easy to see that the goal of the training process is to maximise the overall likelihood value. Thus, the goal is to find W s to minimise the log loss of the model as below,
M X i=1

ln(1 + exp(-ys

F X f =1

ss wf lf ))

(13)

To achieve this, the sequential-update optimisation algorithm proposed by Collins et.al [6]3 is applied and the modified version is shown in Algorithm 2. The algorithm is equivalent to AdaBoost [7, 10]. The basic principle is that on each iteration t, the algorithm updates distribution qt,i to increase the weights of misclassified training examples (lines 2.14 - 2.15). To count the distribution between
3

For detail derivation, please refer to paper [6].

63


4.

AN EXPERIMENTAL STUDY

In this section, we present an experimental study to evaluate the proposed method. We demonstrate the effectiveness of our method for large music databases by comparing it with the current best approaches in the areas of accuracy of singer identification, scalability to accommodate different size of data, robustness against various kind of noise and efficiency in terms of the response time.

4.1 Experimental Configuration
We first present the experimental settings for the performance evaluation, including competitive systems, testing datasets and performance metrics for evaluation. All methods have been implemented and tested on a Pentium III, 450MHz, PC running the Linux operating system. A demo system of our HSI system is also available [16]. Two datasets are used in our experimental study. The first dataset, Dataset I, contains 230 songs from 13 female and 10 male artists with 10 songs per artist. The dataset has been used in [19]. The second dataset, Dataset II, consists of 8000 songs covering 90 different singers. This dataset was constructed from a CD collection of the authors. It includes 45 male singers (such as Van Morrison, Michael Jackson, Elton John, etc) and 45 female singers (such as Kylie Minogue, Madonna, Jennifer Lopez, etc). The length of each music item in the two datasets is set as 30 seconds and there is no overlap between the two datasets. For both datasets, the sound files are converted to 22050Hz, 16-bit, mono audio files. For singer identification, we use 20% of each dataset for training purposes and the remaining songs to evaluate the performance of all the schemes studied. In all experiments present below, there is no overlap between training set and test set. As discussed above, the main goal of the system is to identify the artist who performs an incoming song. The procedure is similar to clustering with certain criteria. Thus, our evaluation method focuses on how accurate the identification process is with different approaches for a particular database. We use the accuracy as the metric for evaluation: accur acy = NC , where NC is the number N of songs correctly identified and N is the number of songs for evaluation. On the other hand, the average response time is applied to measure the efficiency of different techniques, which represents the time required for identifying a single query song.

HSI has two advantages over the competitive schemes. First, HSI extracts both vocal and non-vocal features from the song. The usage of multiple features can result in more comprehensive statistical models for songs by a particular singer and hence result in better identification effectiveness. Second, the weight for fusing likelihood scores derived via logistic regression can capture joint effects among various acoustic characteristics for singer identification. This naturally raises the question of how much accuracy can be improved by the scheme. In the last experiment, identification effectiveness of HSI with and without nonlinear likelihood value fusion weights is examined. The first two rows in Table 2 illustrate the results. As expected, the nonlinear likelihood value fusion weights generation scheme presented in Section 3.3 plays an important role in the whole identification procedure and can bring significant improvement in identification accuracy. From Table 2, we can see that the HSI with nonlinear likelihood value fusion weights offers much higher accuracy (about 12% accuracy improvement). With the scheme, we are able to construct a connection between the misclassification and its correction in the second layer of the system via Logistic regression.
Singer Identification Methods HSI HSI-L T S AI LIU BER Accuracy(%) Dataset I Female Male 86.4 88.2 78.2 80.2 73.3 71.3 66.4 66.0 65.4 65.2 Ave. 87.3 79.2 72.5 66.2 65.3 Dataset II Female Male Ave. 77.4 75.0 76.2 69.7 70.5 70.1 61.1 63.1 62.1 55.2 56.4 55.8 56.0 55.4 56.7

Table 2: Identification accuracy comparison. HSI-L denotes linear combination of likelihood score with same weight.

4.2.2 On Efficiency Comparison
From the above experiment, we can see that the proposed HSI approach is superior to other methods in terms of accuracy. For large music databases, the response time is another concern of the system performance. Although the more features and extra decision module in the HSI improve the accuracy, they introduce more cost overhead. In this experiment, we will show how this affects the time efficiency. Tests were run with 6400 songs and 180 query songs for Dataset I and Dataset II.
Singer Identification Methods HSI T S AI LIU BER Query Time(sec) Female 0.269 0.277 0.216 0.227 Dataset I Male 0.241 0.213 0.226 0.235 Ave. 0.255 0.235 0.221 0.231 Female 1.199 1.731 1.780 1.756 Dataset II Male 1.112 1.631 1.764 1.804 Ave. 1.156 1.681 1.772 1.775

4.2 Comparison with Other Schemes
In this section, we compare HSI with three existing methods including BER, LIU and TSAI. To ensure a fair comparison, we selected the same set of data for training, and use the rest for performance evaluation.

4.2.1 On Accuracy Comparison
In this section, we first report comparative results on the accuracy of the various singer identification schemes. Table 2 summarises the results for the two datasets. The bottom two rows show how the B E R and LI U performed. The experiments verify the fact that both of them achieve similar identification accuracy, which is the worst. Furthermore, although the TSAI technique achieves better performance than LIU and BER, it still suffers from low accuracy. This is because TSAI only considers MFCC based acoustic characteristic inside vocal segments of one song. In fact, the experimental results clearly demonstrate that HSI significantly outperforms the other three approaches. For example, from Table 2, it shows that compared to TSAI, the HSI method improves the identification precision from 72.5% to 87.3% for dataset I and 62.1% to 76.2% for dataset II. In addition, this is a remarkable improvement over LIU. On average, around 35% improvement can be observed for both datasets.

Table 3: Identification efficiency comparison. Table 3 shows the response time of query for different singer identification schemes over the two datasets. From the experimental results summarised in the table, we can see that HSI achieved comparable results in terms of query speed against the other two methods, especially for large datasets. For example, the identification process over Dataset II with BER, LIU and TSAI required 1.775sec, 1.772sec, and 1.681sec respectively to finish in the music database containing 8000 songs. By contrast, it takes HSI only 1.086sec to complete whole process. There is only 29% saving. Another interesting observation is that, unlike the other three approaches, the gap of response time for HSI between different sizes of data is much smaller and the time does not increase proportionally to the size (see the scalability experiment in next section).

64


4.2.3 On Scalability Comparison
Scalability is particularly important for large music databases, because such systems can potentially contain thousands of songs. As the number of music items increases, the performance of a system may degrade due to noise and more similar songs in the database. In this experiment, we compare the identification accuracy and response time of the HSI system with the two other competitors using different sizes of music data.
100 Indentification Accuracy (%) 90 80 70 60 50 2k 4k 6k Size of Data 8k HSI TSAI LIU TSAI 2 1.8 Query Time (sec) 1.6 1.4 1.2 1 0.8 2k 4k 6k Size of Data 8k HSI TSAI LIU TSAI

on all distortion cases. For example, it is shown that, HSI is robust to echo with 6s cropping, 60% volume amplification, 70% volume deamplification, 7sec delay time and 40dB SNR background noise on average4 . In addition, the figures also show that HSI is fairly robust to different levels of noises and acoustic distortions.

5. CONCLUSIONS
With fast increasing music data from various emerging application domains, efficient and effective music information retrieval is becoming more important than ever. In this paper, we developed a novel technique, called HSI, to facilitate singer identification in large music databases. Extensive experimental results based on large datasets demonstrate the superior effectiveness, scalability, efficiency and robustness of HSI over state-of-the-art systems. In the near future, we plan to evaluate the current framework on a larger dataset, develop advanced acoustic feature extraction methods to gain further improvement on the effectiveness of the framework, and examine novel indexing structures to speed up the query processing cost. Adapting the system for other kinds of multimedia data is another promising research direction.

(a) Query Accuracy

(b) Query Time

Figure 3: Scalability comparison on Dataset II For this set of experiments, we randomly pick 2000, 4000, 6000 and 8000 songs from dataset II. 20% of data are used for training and the rest is used for testing. Figure 3 shows the experimental results for the four systems. We can see that the accuracy and query times for both LIU and BER degrade dramatically when the data size increases. This is because the larger data sizes affect the performance of the machine learning based classifiers in those systems. While TSAI performs better, the improvement is very limited. Compared to the above three approaches, HSI is very robust against the volume of data. There is no significant decrease in accuracy or increase query processing time with larger datasets, e.g. around 79% accuracy for large size of 2000 songs which increases only 4% for the same system with 8000 songs. This is because HSI is constructed using multiple feature from the songs and this enables HSI to capture the song characteristics more precisely. At the same time, conjunctive effects of multiple features are applied to enhance system performance via a nonlinear likelihood fusion weights generation scheme.

6. ACKNOWLEDGEMENTS
We would like to thank Professor Wei-Ho Tsai at Taipei University of Technology, for kindly sharing his dataset with us.

7. REFERENCES
[1] M. Bartsch and G. Wakefield. Singing voice identification using spectral envelop estimation. IEEE Trans. on Speech and Audio Processing, 12(2):100­109, 2004. [2] C. Becchetti, L. Ricotti, and L. Ricotti. Speech Recognition. John Wiley, 1999. [3] A. Berenzweig, D. P. W. Ellis, and S. Lawrence. Using voice segments to improve artist classification of music. In AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, pages 119­122, 2002. [4] A. L. Berenzweig and D. P. W. Ellis. Locating singing voice segments within music signals. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2001. [5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [6] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. In Proc. of the 13th Annual Conference on Learning Theory (COLT), pages 158­169, 2000. [7] Y. Freund and R. Schapire. A decision-theoretic generalzation of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119­139, 1997. [8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001. [9] Michael I. Jordan. Why the logistic function? a tutorial discussion on probabilities and neural networks. Technical Report 9503, MIT, 1995.
S The equation S N RdB = 10log10 N is used to calculate the signal-to-noise ratio, where S is the signal power, and N is the noise power in dB. 4

4.2.4 On Robustness Comparison
Human perception has a superior capability to identify sound or music, even in the presence of moderate amounts of distortion. Also, real world applications often require singer identification under different constrained conditions. This property could be extremely important in modern music database applications. This is because real data can typically contain noise and distortion. In this section, to study the robustness of different singer identification techniques, music data items are modified with different kinds of distortion as query examples and a series of experiments has been carried out to test the performance of different systems in the presence of moderate amounts of noises and other kinds of distortions. During the evaluation, we ran the same set of tests as in the previous experiments. However, each query song is distorted and the results are compared against the results obtained from using a nondistorted query. We vary the levels and types of distortions to test the robustness of different schemes. Figure 4 summaries the accuracy of three methods under various distortions for dataset II. The performance on dataset I shows a similar trend and we omit them due to space limitations. The experimental results clearly demonstrate that compared with the other two approaches, the HSI emerges as the most robust technique in terms of accuracy. It performs better than the competitors

65


85 80 75
Identification Accuracy (%) Identification Accuracy (%)

85 80 75 70 65 60 55 50 45 40 35 30 25 0 1 2 3 4 5 6 7 8
Echo Delay(s) Identification Accuracy (%)

85 80 75 70 65 60 55 50 45 40 35 30 25 0 HSI TSAI LIU BER 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Volume Amplification (%)

70 65 60 55 50 45 40 35 30 25 0 1 2 3 4 5 6 7 8 HSI TSAI LIU BER 9 10 11 12 13 14 15

HSI TSAI LIU BER 9 10 11 12 13

Length of Sound(s)

(a) Cropping
85 80 75
Identification Accuracy (%) Identification Accuracy (%)

(b) Echo
85 80 75 70 65 60 55 50 45 40 35 30 25 0 10 20 30 40 50 60
Signal Noise Ratio (dB) Identification Accuracy (%)

(c) Volume amplification
85 80 75 70 65 60 55 50 45 40 35 30 25 80 90 0 10 20 30 40 50 60
Signal Noise Ratio (dB)

70 65 60 55 50 45 40 35 30 25
Volume Deamplification (%)

HSI TSAI LIU BER 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

HSI TSAI LIU BER 70

HSI TSAI LIU BER 70 80 90

(d) Volume deamplification
85 80 75
Identification Accuracy (%) Identification Accuracy (%)

(e) Laplace noise addition
85 80 75 70 65 60 55 50 45 40 35 30 25 80 90 0 10 20 30 40 50 60
Signal Noise Ratio (dB) Identification Accuracy (%)

(f) Gaussian noise addition
85 80 75 70 65 60 55 50 45 40 35 30 25 80 90 0 10 20 30 40 50 60
Signal Noise Ratio (dB)

70 65 60 55 50 45 40 35 30 25 0 10 20 30 40 50 60
Signal Noise Ratio (dB)

HSI TSAI LIU BER 70

HSI TSAI LIU BER 70

HSI TSAI LIU BER 70 80 90

(g) White noise addition

(h) Median noise addition

(i) Pink noise addition

Figure 4: Robustness comparison of singer identification methods on Dataset II. [10] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential model and bregman distances. In Proc. of the 14th Conf. Neural Information Processing Systems (NIPS), pages 110­121, 2002. [11] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification. In Proc. of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 282­289, 2003. [12] C. C. Liu and C. S. Huang. A singer identification technique for content-based classification of mp3 music objects. In Proc. of ACM CIKM Int. Conference on Information and Knowledge Management, pages 438­445, 2002. [13] A. Livshin and X. Rodet. Musical instrument identification in continuous recordings. In Proc. of the 7th Int. Conference on Digital Audio Effects (DAFX-04), pages 222­226, 2004. [14] Francois Pachet. Content management for electronic music distribution. Communications of the ACM, 46(4):71­75, 2003. [15] L. Rabiner and B. Juang. Fundamentals of Speech Recognition. Prentice Hall PTR, 1993. [16] J. Shen, J. Shepherd, B. Cui, and K.L. Tian. HSI: A novel framework for efficient automated singer identification in large music databases. In Proc. of ICDE'06 (demo paper), page 169, 2006. J. Sundberg. The Science of the Singing Voice. Illinios University Press, 1987. T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. IEEE Trans. on Speech and Audio Processing, 8(4):708­716, 2000. W. H. Tsai and H. M. Wang. Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals. IEEE Trans. on Speech, Audio and Language Processing, 14(1):330­341, 2006. V. Vapnik. Statistical Learning Theory. John Wiley&Sons, 1998. B. Whitman, G. Flake, and S. Lawrence. Artist detection in music with minnowmatch. In IEEE Workshop on Neural Networks for Signal Processing, 2001. C. Xu, N.C. Maddage, and X. Shao. Automatic music classification and summarization. IEEE Trans. on Speech, Audio and Language Processing, 13(3):441­450, 2005. T. Zhang. Automatic singer identification. In Proc. of IEEE Int. Conference on Multimedia and Expo (ICME), pages 33­36, 2003.

[17] [18]

[19]

[20] [21]

[22]

[23]

66