Genre Classification of Compressed Audio Data - CiteSeerX

5 downloads 0 Views 196KB Size Report
Abstract—This paper deals with the musical genre classifica- tion problem, starting from a set of features extracted directly from MPEG−1 layer III compressed ...
Genre Classification of Compressed Audio Data Antonello Rizzi, Nicola Maurizio Buccino, Massimo Panella, and Aurelio Uncini INFO−COM Department, University of Rome“La Sapienza” Via Eudossiana 18, 00184 Rome, Italy {rizzi,buccino,panella,aurel}@infocom.uniroma1.it

Abstract—This paper deals with the musical genre classification problem, starting from a set of features extracted directly from MPEG−1 layer III compressed audio data. The automatic classification of compressed audio signals into a short hierarchy of musical genres is explored. More specifically, three feature sets for representing timbre, rhythmic content and energy content are proposed for a four leafs tree genre hierarchy. The adopted set of features are computed from the spectral information available in the MPEG decoding stage. The performance and relative importance of the proposed approach is investigated by training a classification model using the audio collections proposed in musical genre contests. We also used an optimization strategy based on genetic algorithms. The results are comparable to those obtained by PCM−based musical genre classification systems.

I. I NTRODUCTION Genre hierarchies, typically created manually by human experts, are currently one of the ways used to structure music contents on the WEB. Automatic musical genre classification can potentially automate this process and provide an important component for a complete music information retrieval system for audio signals. In this paper, the problem of automatically classifying MPEG−1 layer III (MP3) compressed audio signals into a hierarchy of musical genres is addressed. Although music on the Web is usually in a compressed form, in particular MP3, most of the known techniques use features calculated form PCM or MIDI audio data [1]. The technique proposed in this paper allows the direct classification of partially decompressed data, thus avoiding the complete PCM decompression. For the datasets proposed herein, we will consider a four leafs taxonomy made of classical, electronic, pop and world music. The adopted taxonomy seems to be a good compromise between the physical and perceptual features that characterize a musical genre. As concerns research works related to the classification problem using MPEG compressed audio data very little has been done, mostly regarding the music/speech classification task [2]–[5]. An interesting work on musical genre classification in the compressed domain is illustrated in [6]. In this paper, the Author makes a classification in 6 different genres: blues, easy listening, classical, opera, dance (techno) and indie rock. The features used for this task are energy related features, in particular the cepstrum coefficients and two different classification strategies are compared: a GMM (Gaussian Mixture Model) and a Vector Tree Quantization (VTQ). Results show a 90.9% accuracy for GMM and 85.1%

978-1-4244-2295-1/08/$25.00 © 2008 IEEE

for VTQ. Comparing these results with the ones obtained using PCM audio data, there is an accuracy deterioration of about 4%. In our paper, three sets of features for representing timbre, rhythmic content and energy content are proposed. Although timbral and energy features are also used for speech and general sound classification, the rhythmic feature set is novel and specifically designed to represent aspects of musical content. Moreover, a psychoacoustic preprocessing is performed in order to enhance the perceptual aspect of the proposed features set. Its performance and relative importance is evaluated by training a classification model and by an automatic feature selection procedure based on a genetic optimization technique. Data used for the evaluation of the proposed classification system are audio collections available in well−known repositories on the Web. II. P SYCHOACOUSTIC R EMARKS In our system some psychoacoustic considerations are taken into account, before the sheer feature extraction stage, in order to extract the significant information related to subbands. As can be seen in Fig.1, the process starts as a normal MP3 decompression, including bitstream parsing and frequency sample de-quantization. Once subband data become available, they are used as a source for further computations rather than for synthesizing actual samples with the synthesis filter. We consider in this paper MP3 compressed audio data [7]. In the first step of the MPEG encoding process, the audio signal is converted into 32 equally spaced spectral components using an analysis filterbank. For every 32 consecutive PCM input audio samples (corresponding to T = 32 · Tc seconds of audio signal sampled at a period of Tc seconds), the filterbank provides 32 subband samples si [k] = si (kT ), one sample per subband indexed by i = 1 . . . 32. The layer III algorithm groups the input signal in frames made of 1152 PCM samples. Each MP3 frame consists of two granules, each of 576 PCM samples. With a standard 44.1 KHz sampling rate, a granule occurs approximately every 13 ms. Each granule contains 18 consecutive subband samples, where each subband sample is a vector of 32 frequency band amplitudes, each related to a subband of 689 Hz. The first observation is that the 32 filterbank constant bandwidths do not accurately reflect the human ear critical bands; in particular, each bandwidth is too wide for the lower frequencies and too narrow for the higher

654

MMSP 2008

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.

ones. Based on this considerations we can say that a subband of 689 Hz is too wide to represent the lowest critical bands. This is true especially for the first critical band. In order to face this limitation we will apply in the following a Discrete Wavelet Transform (DWT) to the first subband, thus obtaining two other subbands, each referring to a 345 Hz frequency range. The DWT is performed using a standard first order biorthogonal filter. The second important perceptual consideration regards empirical results showing that the ear has a limited frequency selectivity, which varies in acuity from less than 100 Hz for the lowest audible frequencies to more than 4 KHz for the highest ones. Thus, not all the audible spectrum is useful for a perceptual analysis and some frequencies are perceptually more meaningful than others. This observation reflects the resolving power of the ear as a function of frequency, which is usually approximated by the Fletcher’s curve, representing the ear’s sensitivity thresholds with respect to the sound pressure level. Looking at the Fletcher’s curve, it is evident that frequencies over 17 KHz are not perceptually relevant. So, in our analysis the subbands from 26 to 32 (ranging from 16.882 KHz to more than 20 KHz) are not taken into account. Frame Granule

IMDCT

Bit Stream Unpacking and Dequantization

s0 [1]

s0 [2]

Granule

s0 [18]

s0 [19]

s0 [36]

IMDCT

s1[1]

s1[2]

s1[18]

s1[19]

s1[36]

IMDCT

s2 [1]

s2 [2]

s2 [18]

s2 [19]

s2 [36]

Invese filterbank IMDCT

s29 [1]

s29 [2]

s29 [18]

s29 [19]

s29 [36]

IMDCT

s30 [1]

s30 [2]

s30 [18]

s30 [19]

s30 [36]

IMDCT

s31[1]

s31[2]

s31[18]

s31[19]

s31[36]

Data used for psychoacustic analysis and feature extraction

Fig. 1: Psychoacustic analysis and feature extraction data The last psychoacoustic observation derives once again from the Fletcher’s curve. In particular, due to the high ear’s sensitivity around the 4 KHz, the frequency range from 700 Hz to 7.5 KHz has been emphasized by an amplification of a factor 3 of the subbands from 2 to 11. In this way, after this perceptual-based computations, the actual data are made of 25 subband samples. This data will be used for the sheer feature extraction. III. F EATURE E XTRACTION P ROCEDURE The first step of our analysis system is intended to extract some features from the audio data, in order to manipulate more meaningful information and to reduce the further processing. The features that are used here describe timbre, rhythm, and energy. We used for this experiment MP3, 44.1 KHz, 128Kbps, stereo files [7]. Stereo channels have been processed separately and the resulting features have been averaged in order to represent the whole stereo file.

First of all, a root mean squared subband granule vector RM SGV TG is calculated as follows:  18 2 k=1 si [k] , (1) RM SGVTG [i] = 18 where index i, i = 1 . . . 25, denotes the i-th element of the vector, each related to a subband; si [k] is the k-th sample of the i-th subband; TG stands for “Time of Granule” and it indicates that the calculation refers to TG  13 ms of audio signal. In the same way, the i-th element of a root mean squared subband frame vector RM SF V TF is calculated:  36 2 k=1 si [k] , (2) RM SF VTF [i] = 36 where i = 1 . . . 25; TF stands for “Time of Frame” indicating that the calculation refers to TG  26 ms of audio signal. A. Timbral Features Timbre is currently defined in the literature as the perceptual feature that makes different two sounds with the same pitch and loudness. Features characterizing timbre can be found in [2]–[5]. These features analyze the spectral distribution of the signal and are global, in the sense that they integrate the information of all sources and instruments at the same time. Most of these descriptors are computed at regular time intervals over short windows of typical length between 10 to 60 ms. In the context of classification, timbre descriptors are often summarized by evaluating low-order statistics of their distribution over larger windows commonly called texture windows [1], [2], [5]. Modeling timbre on a higher time scale not only reduces furtherly computation, but it is also perceptually more meaningful, since the short frames of signal used to evaluate features are not long enough for human perception. Consequently, the timbral features used in our system are the following ones. 1) Spectral centroid: the spectral centroid SC is the balancing point of the RM SGV TG vector and is defined as follows:  25  i=1 i · RM SGVTG [i] SC = 25 . (3) i=1 RM SGVTG [i] 2) Spectral flux: the spectral flux SF represents the spectral difference between two temporally successive normalized granule vectors N GV TG . The i-th element, i = 1 . . . 25, is defined as: RM SGVTG [i] , (4) N GVTG [i] = RM SGV TG  25  SF = |N GVTG [i] − N GVTG −1 [i]|2 , (5) i=1

where the subscript ‘TG − 1’ refers to the previous granule. 3) Spectral roll-off: the spectral roll-off is defined as the number SR of subbands where 85% of the whole granule energy lays. The SR is defined such that: SR  i=1

RM SGVTG [i] ≤ 0.85

25 

RM SGVTG [i].

(6)

i=1

655 Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.

Subbands

Once all the previous scalar features have been computed, we will consider their mean an variance in a larger window (77 granules) of about 1 s. This larger window is used to capture significant perceptive characteristics of audio signals.

FWR

B. Energy Features

LPF

Energy features are the least perceptual ones, they represent physical aspects of audio signals such as the energy content. This features are widely used in speech analysis and music−speech classification systems [1], [2], [4]–[6]. 1) RMS: this feature is a measure of the granules loudness and it is defined as:  25 2 i=1 RM SGVTG [i] (7) RM S = 25 2) Low energy: this feature, denoted as LE, represents the percentage of granules in a 1 s window (77 granules) that have less than the average RMS power. 3) Pseudo-Cepstral coefficients: the cepstral coefficients are widely used in speech analysis. Cepstrum coefficients are defined as the Discrete Cosine Transform (DCT) of the log transformed Fourier coefficients of the signal. The availability of the 25 filterbank spectral coefficients allows us to bypass the Fourier transform step. In this way, 16 cepstral coefficients are obtained via the 16 DCT coefficients of the first 16 subband samples of a frame. Once again, excluding LE that already refers to a 1 s window, the actual features used for the analysis are the mean and variance of those 16 pseudo−cepstral coefficients and of the RMS, in a larger window (77 granules for RMS and 38 frames for cepstral coefficients) of about 1 s. In addition to these features, for each 1 s window the overall sum of the means and variances of the 16 cepstral coefficients are computed. C. Rhythm Features Automatic rhythm description may be oriented toward different applications: tempo induction, beat tracking, meter induction, quantization of performed rhythm, or characterization of intentional timing deviations. They all works with PCM or MIDI audio data. A usual PCM automatic beat detector consists of a filterbank decomposition, followed by an envelope extraction step, and finally a periodicity detection algorithm that is used to detect the lag at which the signals envelope is most similar to itself [1]. In our case, the 25 filterbank coefficients of the MPEG encoder/decoder make possible the direct extraction of the evenlope of the signal. The feature set for representing rhythm structure is based on detecting the most salient periodicity of the signal. In Fig.2 it is shown the flow diagram of the proposed beat analysis algorithm. 1) Evenlope extraction and autocorrelation: using the 25 filterbank coefficients, the time domain amplitude envelope of the i-th subband, i = 1 . . . 25, is extracted separately. This is achieved by the following processing steps:   • Full Wave Rectification (FWR): xi [k] = si [k].

Evenlope

Evenlope

Evenlope

Extrction

Extrction

Extrction

MR Evenlope Extraction

Autocorrelation Peak Picking Beat Histogram

Fig. 2: Flow diagram of the beat histogram calculation. This is applied in order to extract the temporal envelope of the signal rather than the time domain signal itself. • Low-Pass Filtering (LPF): yi [k] = (1 − α)xi [k] − αyi [k − 1]. A one-pole filter with α = 0.99 is used to smooth the envelope. Full wave rectification followed by low-pass filtering is a standard envelope extraction technique. • Mean Removal (MR): fi [k] = yi [k] − E{yi [k]} , where E{·} denotes the expected value. The MR is applied in order to make the signal centered around zero for the autocorrelation stage. After mean removal, theenvelopes of each band are then 25 summed together: S[k] = i=1 fi [k]. The autocorrelation of the resulting sum envelope is then computed: L 1  S[m]S[m + h] . R[h] = L m=1 As specified in the following, L is the length of a 1 s sequence. The autocorrelation function is further manipulated in order to reduce the effect of integer multiples of the basic periodicities. The original autocorrelation R[h] is clipped to positive values, downsampled by a factor 2, and subtracted from the original clipped function. The same process is repeated with other integer factors (i.e. 3, 4, and 5); in such a way the repetitive peaks at integer multiples are removed. The dominant peaks of the so obtained autocorrelation function correspond to the various periodicities of the signal envelope. This analysis refers to a window of 231 granules that at 44.1 KHz sampling rate corresponds to approximately 3 s, with an overlap of 154 granules corresponding to about 2 s. This larger window is necessary to capture the signal repetitions at the beat and subbeat levels. In this way the above mentioned autocorrelation is computed every 1 s of signal. 2) Peak detection and beat histogram computation: the first three peaks of each autocorrelation function (one for every 1 s of signal) are selected and accumulated over the whole sound file into a Beat Histogram (BH), where each bin corresponds to

656 Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.

the peak lag, i.e., the beat period in beats-per-minute (bpm). The lags of the histogram correspond to a range that spans from 40 to 200 bpm (corresponding respectively to periodicity from 1.5 s to 0.3 s). For each peak of autocorrelation functions, the peak amplitude is added to the histogram. That way, peaks having high amplitude (where the signal is highly similar) are weighted in the histogram computation more strongly than weaker peaks. So, when the signal is very similar to itself (strong beat) the histogram peaks will be higher. 3) Beat histogram features: The BH representation captures detailed information about the rhythmic content of the piece that can be used to effectively guess the musical genre of a song. Starting from this observation, a set of features based on the BH are calculated in order to represent rhythmic content. The latter are: • P1 , P2 , P3 : period of the first, second and third peak (in bpm); • S: overall sum of the histogram (indication of beat strength); • R1 , R2 , R3 : first, second and third histogram peak divided by S (relative amplitude); r • P2 : ratio of the period (in bpm) of the second peak to the period (in bpm) of the first peak; r • P3 : ratio of the period (in bpm) of the third peak to the period (in bpm) of the first peak. D. Feature Vector The feature calculation described so far allows us to represent each piece of music as a 52−dimensional vector. It is important to notice that the first nine (rhythm) features are “global” features, in the sense that they are computed over the whole sound file. On the contrary, the other ones are computed every 1 s of music. So, the actual feature vector used to represent an audio file is obtained by the union of the 9 (global) rhythm features and the mean of the other 43 (local) features over the whole duration of the audio clip. IV. C LASSIFICATION M ODEL The feature vectors obtained by the extraction procedure described in Sect. III will feed a classification system. The choice of the classification system is independent of the genre classification task; in this paper we will adopt the classification system [8], whose performance has been ascertained in many real-world problems with respect to wellknown classification benchmarks. The classification model is a classical fuzzy Min-Max neural network. The classification strategy consists in covering the patterns of the training set with hyperboxes, whose boundary hyperplanes are parallel to the main reference system. The hyperbox can be considered as a crisp frame on which different types of membership functions can be adapted. The neurofuzzy classification model is trained by the ‘Pruning Adaptive Resolution Classifier’ (PARC) algorithm. It is a constructive procedure able to generate a suitable succession of Min-Max classifiers, which are characterized by a decreasing

Music Node 1

Classical (Class 1)

Other

Node 2

Electronic (Class 2)

Node 3

Other

Pop (Class 3)

World (Class 4)

Fig. 3: Proposed classification tree. complexity (number of neurons in the hidden layer). The regularized network is automatically selected according to learning theory, as the Ockham’s Razor criterion. For this reason, PARC is characterized by a good generalization capability as well as a high automation degree. Consequently, it is particularly suited to be used as the core modeling system for a wrapper-like feature optimization procedure, such as the one described in the following Section. The implemented system consists in a classification tree where each node is a Min-Max classifier. Each node is trained to discriminate one genre among the remaining ones. As illustrated in Fig.3, the overall system can discriminate between the four said genres (classical, electronic, pop and world) in a sequential manner. The first node classical/other, decides whether the audio clip belongs to classical or not (other). All the patterns classified as other will feed the second node electronic/other of the tree. Again, all the patterns classified as other will feed the third node pop/world of the tree. V. AUTOMATIC F EATURE S ELECTION In order to evaluate the effectiveness of the proposed features, an optimization strategy has been implemented. The technique is based on a genetic algorithm for selecting the optimal subset of features for the assigned task. Consequently, this reduces the input space dimension and improves the classification accuracy. Genetic algorithms are designed in order to manage a population of individuals, i.e. a set of potential solutions for the optimization problem at hand. Each individual is univocally represented by a genetic code, which is typically a string of binary digits. The fitness of a particular individual coincides with the corresponding value assumed by the objective function to be optimized. In our application, the adopted fitness function is a convex linear combination of two terms: Fj = (1 − λ)Ej + λCj , where Ej is a performance measure on the training set Str and Cj is a complexity measure of the j-th classifier; λ ∈ [0, 1] is a meta-parameter by which the user can control the training procedure, taking into account that small values of λ yield more complex models and consequently more accuracy on Str (the opposite situation is characterized by large values of λ). For a feature selection problem, the genetic code represents a subset of the original feature set; in our application, each

657 Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.

genetic code is made of 52 binary digits, where the n-th digit, n = 1 . . . 52, is equal to one if the corresponding feature feeds the classification system, otherwise it is zero. The evolution starts from a population of P random selected individuals. At the k-th generation Gk , with k = 0, . . . , Mgen , the next generation Gk+1 is determined by applying standard selection, mutation and crossover operators. The behavior of the whole algorithm depends on P and Mgen values, as well as on the mutation rate M R and on the crossover rate CR, which are two probability thresholds that control the related operators. The convergence of the genetic algorithm is assured by using elitism, i.e. by copying the best individual in the next generation. VI. P ERFORMANCE E VALUATION The system’s performance has been evaluated using different Web repositories (in particular “all-music.com” and “mp3.com”), and different classification procedures (GMM, k−NN, SVM, Min−Max). In spite of the several tests we carried out, for the sake of synthesis and objectivity we will show in the following the results obtained using the ISMIR2004 genre classification contest [9]. The results in that database were obtained working with PCM audio data and using a six genres taxonomy: classical, electronic, jazz blues, metal punk, rock pop and world. On the basis of the consideration made in Sect. I, we use a four genres taxonomy: classical, electronic, pop (including jazz blues, metal punk and rock pop sub−genres), and world. TABLE I: Data Composition XX XX Data S (n) S (n) Sts XXX ts Genre X tr classical

317

317

318

electronic

115

113

112

pop

172

173

173

world

122

122

117

Total

726

725

720

It is important to notice that the classification systems of ISMIR2004 were tested on a Test_Tracks collection (made up of 700 tracks) that was supplied during the contest. This collection is not available, so our system has been trained with the Training_Tracks collection (like all the systems of the contest participants) and tested using the tracks of the Development_Tracks collection (supplied for the development of the participants’ classification systems and not for the test). Starting form this observation, it is evident that our system has been trained using a quantity of data that is smaller than the one used by the contest participants. The data set used in our experiment is available for download at the ISMIR2004 website; it is distributed and copyrighted by “Mangatune.com” and it is made up of two different collections. The first collection, called Training_Tracks, consists in 729 MP3, 128 Kbps stereo files sampled at 44.1 KHz, 16 bit, divided into 6 genres: classical, electronic,

jazz blues, metal punk, rock pop and world. The files of this collection have been used for training each Min-Max neural network that made up the decision tree. The second collection, called Development_Tracks, consists in 729 MP3, 128 Kbps stereo files sampled at 44.1 KHz, 16 bit. The files of this collection are used for the validation of each MinMax neural network that made up the decision tree and for the test of the overall system. A. Training In order to train each node, every clip of the Training_Tracks collection has been represented as a unique mean feature vector referred to 40 s at its center. For practical needs, the tracks with a duration less than 10 s have (n) been deleted. In this way, the ‘node’ training set Str , shown in the first column of Table I, consists in 726 segments of 40 s, for a total duration of about 8 hours of music. An intermediate performance evaluation has been performed in order to check the reliability on each node of the classification tree. In order to test each node, every clip of the Development_Tracks collection has been represented as a unique mean feature vector referred to 40 s at its center. Once again, the tracks with a duration less than 10 s have (n) been deleted. In this way, the ‘node’ test set Sts , indicated in the second column of Table I, consists in 725 segments of 40 s, for a total duration of about 8 hours of music. The first node classical/other has been trained using all the four node training sets, where patterns belonging to electronic, pop and world classes have been relabeled as other. Thus, the whole training set for the first node is made up of all the 726 tracks. The node has been tested with all the four node test sets, made up of all the 725 tracks and labeled as for the training set. The second node electronic/other has been trained using the three node training sets related to electronic, pop and world classes. Patterns belonging to pop and world classes have been relabeled as other. Consequently, the whole training set of the second node is made up of 726 − 317 = 409 files. The test set for this node is built in the same way, using the corresponding node test sets, made up of 725−317 = 408 files. Finally, for the third node pop/world, only pop and world node training sets are used. Thus, the whole training set consists of 409 − 115 = 294 files, while the test set, built in the same way, consists in 408 − 113 = 295 files. In order to improve the classification results and to test the effectiveness of the proposed set of features, the genetic algorithm and feature selection technique described in Sect. V, was applied to the available data. In particular, for each node of the decision tree a genetic algorithm with P = 100, Mgen = 100, M R = 0.3, and CR = 1 was ran. The genetic codes that minimize the classification error, according to the criterion proposed in the previous Section, are used for the implementation of an optimized classification tree. We obtained the node performances shown in Table II; also the actual number of used features Nf is reported.

658 Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.

B. Test of the Whole Classification Tree The overall performance of the implemented decision tree was tested on the Development_Tracks collection. Tracks with a duration less than 10 s and greater than 10 min have been deleted. In this way, the test set Sts , illustrated in the third column of Table I, consists in 720 tracks for a total duration of about 47 hours of music. In particular the classification are made in the following way: • each track was split in 40 s duration consecutive segments; • a unique mean feature vector (with 52 components) are calculated for each segment; • each segment flows through the decision tree and it is assigned a particular class; • each track is classified on the basis of the most frequent class assigned to its segments; • if two or more classes have the same frequency among the segments of the track, this one is classified as an indeterminate. TABLE II: Accuracy and Number of Selected Features Nf Node

(n)

Accuracy on Str

(n)

Accuracy on Sts

Nf

Node 1

87.88%

88.97%

24

Node 2

98.04%

80.88%

26

Node 3

100%

83.73%

31

TABLE III: Confusion Matrix and Relative Accuracy XX XX Target

(a) Confusion Matrix Output

XX

classical

X X

electronic

pop

world 46

classical

292

2

11

electronic

0

48

7

2

pop

6

36

139

27

12

17

7

33

world

(b) Relative Accuracy Genre

Total patterns

Correctly classified patterns

Accuracy

classical

318

292

91.82%

electronic

112

48

42.86%

pop

173

139

80.35%

world

117

33

28.21%

formed automatically and directly in the compress domain with results and performance comparable to PCM genre classification. The classification system proposed in this paper, along with the psychoacoustic-based preprocessing of MP3 files, achieves a good accuracy. Nevertheless, this is obtained automatically, reducing to a minimum the intervention of human experts, as commonly required by Web and other multimedia applications. R EFERENCES

The performances of the optimized classification tree, using Sts , are reported in Table III. These results show a classification accuracy of 71.12% (512 correctly classified tracks) and an indetermination of 4.86% (35 tracks). C. Performance Comparison To make a comparison possible, our results have been compared with the best and worse ISMIR2004 contest results. The best result was achieved in [10], the worst in [11]. Both systems works with PCM audio data and uses Mel Frequency Cepstral Coefficients whit GMM and clustering techniques. These results show a 86.14% accuracy for the best and a 67.9% for the worst. Our system places in the middle with a 71.12% accuracy. In all the cases, the best discriminated class is classical; this result can be justified by two different kind of observations: a relevant presence of patterns belonging to this class in the training set and its intrinsic perceptual well-defined kind of musical genre. On the other side, the worst classified genre is world and a considerable overlap with classical is evident. This result can be physically justified by a real similarity in terms of timbre and rhythm structure of the tracks belonging to these two classes. As concerns pop and electronic, even in this case results show a discrete overlap between the classes; this can be ascribed to a significative track similarity (this is especially true for the files belonging to electronic and the two sub-genres rock and pop belonging to the pop class). VII. C ONCLUSION Definitively it can be said that, despite the fuzzy nature of genre boundaries, musical genre classification can be per-

[1] G. Tzanetakis and P.Cook, “Musical genre classification of audio signals,” IEEE Transaction on Speech and Audio Processing, vol. 10, no. 5, July 2002. [2] ——, “Sound analysis using mpeg compressed audio,” in Proceedings of the 2000 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2000), vol. 2, Instanbul, Turkey, June5-9 2000, pp. 761–764. [3] R. Jarina, N. Murphy, N. OConnor, and S. Marlow, “An experiment in audio classification from compressed data,” in International Workshop on Systems, Signals and Image Processing, Poznan, Poland, September13-15 2004. [4] S. Kiranyaz, M. Aubazac, and M. Gabbouj, “Unsupervised segmentation and classification over mp3 and aac audio bitstreams,” in Proceedings of European Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2003), London, UK, April 2003. [5] A. Rizzi, N. M. Buccino, M. Panella, and A. Uncini, “Optimal shorttime features for music/speech classification of compressed audio data,” in Proceedings of IEEE International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA06), Sydney, Australia, Noveber 2006. [6] D. Pye, “Content-based methods for managing electronic music,” in Proceedings of the 2000 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2000), vol. 4, Instanbul, Turkey, June5-9 2000, pp. 2437–2440. [7] Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbits/s Part 3: Audio, ISO/IEC International Standard IS 11172-3, Information Technology Std. [8] A. Rizzi, M. Panella, and F. M. F. Mascioli, “Adaptive resolution minmax classifiers,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 402–414, March 2002. [9] Availabe at: http://ismir2004.ismir.net. [10] E. Pampalk, “A matlab toolbox to compute music similarity from audio,” in Proceedings of International Symposium on Music Information Retrieval (ISMIR2004), Barcelona, Spain, October10-15 2004. [11] D. Ellis and B. Whitman, “Automatic record reviews,” in Proceedings of International Symposium on Music Information Retrieval (ISMIR2004), Barcelona, Spain, October10-15 2004, pp. 470–477.

659 Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on March 25,2010 at 02:58:51 EDT from IEEE Xplore. Restrictions apply.