Paper Title (use style: paper title)

2 downloads 29 Views 370KB Size Report
frequency was correct when the difference between the estimated and correct frequencies was under 50 cent. Figure 2 shows the experimental result. We can.
A Query-by-Humming Music Information Retrieval from Audio Signals based on Multiple F0 Candidates Akinori Ito, Yu Kosugi, Shozo Makino Graduate School of Engineering Tohoku University Sendai, Japan Email:{aito,kosugi,makino}@makino.ecei.toh oku.ac.jp

Masashi Ito Faculty of Engineering Tohoku Insititute of Technology Sendai, Japan e-mail: [email protected]

system for real data, we need to develop a method to search musical pieces without symbolic melody information. One direction to achieve this is to use an automatic music transcription [6] to generate symbolic melody information from audio signal. Fundamental frequency (F0) estimation from mixed signal is needed to do this, such as PreFEst [7] or HTC [8]. However, performance these F0 estimation methods still poor for polyphonic signal that involves various instruments including percussions. For the objective of music transcription in this work is to realize a QbH system from audio signal, we do not necessarily need a perfect music transcription. Considering that a melody is redundant, there is a possibility that a incomplete F0 estimation is still useful for QbH. Assuming that the generated F0 estimation results contain errors, we could construct robust QbH method by absorbing F0 estimation errors in the search process. In this paper, we propose a Query-by-Humming method when only audio signals such as CD sources are available as a database. To realize this system, we solved the following two problems. Recovering F0 estimation error: To realize this, we exploited multiple F0 candidates from one analysis frame, considering that the top estimation of F0 could be incorrect. Melody matching algorithm for multiple F0 candidate: We need to develop a new matching algorithm for fully exploiting the melody database that has multiple F0 candidates. In section II, we propose a database generation using multiple F0 candidates, and evaluate the quality of the database. In section III, a new melody matching algorithm is proposed for the database with multiple F0 candidates. The results of evaluation experiments are presented in section IV.

Abstract In this paper, we propose a query-by-humming (QbH) system that retrieves musical pieces given as audio signals. Most conventional QbH systems assume that the symbolic melody information is given a priori, which is not always true. In our system, the database for retrieval is generated from 1ch audio signal that contains many sounds. We generate the database by estimating fundamental frequencies (F0) of the audio signals frame by frame. To improve the retrieval accuracy, we exploit multiple F0 candidates to absorb the impact of F0 estimation errors. From the experiment, we obtained about 15 points of improvement by using multiple F0 candidates, compared with the QbH system with only one F0 candidate .

1. Introduction With the development of network-based music delivery such as iTunes Store, strong demands for searching musical pieces have been continuously growing [1]. The current music search mainly uses text information such as titles, composers or artists; besides, the content-based music search systems have been developed. These systems utilize melodies or lyrics of the pieces as a key of retrieval. A query-by-humming (QbH) system is one of such systems, which uses user’s singing voice (humming) as a key of retrieving pieces using melody information. There have been developed many QbH-based music information retrieval systems [2, 3, 4, 5]. Most of these QbH systems assume that information on the melody line of a musical piece can be obtained as symbolic information such as MIDI data. However, it is not always true; many CDs do not have its corresponding MIDI information. To realize a QbH

978-1-4244-5857-8/10/$26.00 ©2010 IEEE

1

ICALIP2010

2. Generation of Melody Database with Multiple F0 Candidates

F0 (t ) = arg max p Ft 0 ( F )

(1)

F

However, considering we have more than one sound in a frame, the maximum peak does not belong to the desired signal, such as vocal. Besides, when there is sound from instruments without harmonics such as drums and cymbals, the F0 estimation by PreFEst-core does not work correctly and the peak value of the desired F0 could be smaller. Moreover, we cannot avoid F0 estimation errors that incorrectly capture double pitch or half pitch.

A. Overview of the method The database with multiple F0 candidates is generated as follows. 1.Frequency analysis of the audio signal 2.Estimation of F0 probability density function 3.Selection of F0 candidates 4.Database generation We use PreFEst-core [7] as an F0 estimation method for audio signals. This method calculates probability densities function p F 0 ( F ) frame by frame, which is probability density of the F0 being at the log frequency F. After calculating p F 0 ( F ) , several local peaks of

p F 0 ( F ) with high probability densities are selected as candidates of F0 in a frame. The set of the selected F0 candidates are gathered to generate an F0 database of a musical piece.

Figure 1. The estimated and correct F0 frequencies

B. Estimation of F0 probability density

It is very difficult to avoid these estimation errors completely. However, considering multiple F0 candidates, the probability that the correct F0 is involved in the candidates is reasonably high. Therefore, we select multiple F0 candidates to avoid missing the correct F0 frequency. Let db (t ) = ( db1 (t ),…, dbn (t )) be local peaks of

Next, we briefly describe the calculation of F0 probability density p F 0 ( F ) [7]. First, the audio signal of a musical piece is analyzed by the short-time Fourier transform, and power spectra are calculated frame-byframe. The power spectrum of a frame contains a peak that corresponds to the F0 frequency; however, the power spectrum also contains many peaks derived from harmonics. The PreFEst-core algorithm approximates the power spectrum of a frame using mixture of Gaussian mixture distributions. One Gaussian mixture represents a harmonic structure of one instrument or vocal, and the mixture of multiple sounds are approximated by mixing the Gaussian mixture distributions. The strength of a Gaussian mixture is automatically estimated using the Expectation-Maximization (EM) algorithm. Using the PreFEst-core algorithm, a power spectrum is decomposed into several harmonics, and the strength of harmonics from fundamental frequency F is directly

pFt 0 ( F ) such that (2)

pFt 0 (db1 (t )) ≥ … ≥ p Ft 0 (dbn (t )) . Then

the

sequence

of

the

candidates

db(0), db(1),… becomes a database of a musical piece. Figure 1 shows an example of the generated database. The blue crosses denote the first candidates ( db1 (t ) ), the green crosses the second ones, and the red marks the correct F0 frequencies, respectively. We can see that the first candidates capture double pitches around 3000 ms, while the second candidates are estimated as correct frequencies.

regarded as p F 0 ( F ) .

D. Evaluation of F0 estimation

C. Selection of F0 candidates

We conducted an experiment to assess performance of F0 estimation and correctness of multiple F0 candidates. Eight musical pieces (POP songs) were used, where the vocal and accompaniments were prepared separately. The vocal parts of the pieces were generated using Vocaloid (Hatsune Miku) [9]. The correct F0 frequencies were estimated using the vocal tracks of the pieces using Praat software [10] and then manually

t

Let pF 0 ( F ) be p F 0 ( F ) at the frame t. When determining F0 of the frame t, the most straightforward t way is to choose the largest peak of pF 0 ( F ) :

2

checked. The accompaniment tracks were generated by software synthesizer provided by SONAR HOME STUDIO XL. Then the vocal tracks and the accompaniment tracks were mixed into single-channel signals. Other conditions are shown in Table I. TABLEⅠExperimental conditions. Sampling frequency 44.1kHz Vocal frequency 3000-12000 cent band F0 frequency band 2600-8400 cent F0 frequency 20 cent quantization step Frame shift 40 ms Number of total 3657 frames

Figure 3. Examples of p F 0 ( F ) for different frames To select different number of candidates frame by frame, we used a threshold of likelihood. We prepare a threshold r ( 0 < r ≤ 1 ), and those peaks with likelihood larger than r ⋅ p F 0 ( db1 (t )) are chosen as the candidates.

3. Melody Matching Method A. Overview After creating database for musical pieces, we perform matching of a humming query with the pieces in the database. On performing the matching, we need to consider the following three differences between the sung query and the database: the tempos, the keys and the singing parts.

Figure 2. Accuracy of the estimated F0 frequencies with respect to the number of F0 candidates

B. Tempo normalization

On evaluation, we regard that an estimated F0 frequency was correct when the difference between the estimated and correct frequencies was under 50 cent. Figure 2 shows the experimental result. We can confirm that more than 99% of the correct F0 frequencies were involved in the candidates when we chose four candidates per frame. When five candidates were chosen, the accuracy was 99.95%.

In general, the tempo of the query is different from the original piece. Most QbH systems absorb the difference of tempos between the query and the database using time-warping matching algorithm such as continuous DP matching [4]. In this work, we wanted to avoid using DP-based matching algorithm because of its computational load; then we assumed that the tempos of the pieces in the database and that of the query were known. Then we normalized the tempo of all pieces in the database and that of all queries. To make the hummed query tempo-synchronous, we presented the users click sounds of constant tempo when he/she hummed the query and instructed the users to sing the query in synchronous with the click sounds

E. Likelihood-based candidate selection In the previous experiment we chose a fixed number of candidates from one frame. However, number of appropriate candidates could differ frame by frame. Figure 3 shows such examples. In the frame shown in Fig. 3 (a), there are three peaks with similar likelihood; while the case in Fig. 3 (b), only one peak has large likelihood and the other peaks are much smaller. In the former case, choosing three peaks seems reasonable. However, in the latter case, three peaks are too much to describe the peaks in this frame. Therefore, changing the number of selected peaks according to likelihood of the peaks seems to be more reasonable than just using fixed number of candidates for all frames.

C. Matching for keys and singing parts Let q (t ) be the logarithm of F0 frequencies of the input query. Then q (t ) are shifted in the time and frequency domain to find the best match with a piece in the database.

3

The matching score of Q = ( q (0),… q (t in − 1)) with respect to a piece in the database P = (db (0),…db (t db − 1)) is

S ( P, Q) = max f ,τ

4. Evaluation Experiments We conducted experiments for evaluating performance of music retrieval using the proposed method. The experimental conditions are shown in Table II. The other conditions including analysis condition of the signals are same as shown in Table I.

tin −1

∑ s(q(t ) + f , db(t + τ )) ,

(3)

t =0

where s (q, db ) is a scoring function which evaluates how q is involved in the database. Query by different key can be matched by shifting the query in the frequency domain, and the singing part can be found by shifting the query in the time domain.

TABLEⅡ Experimental conditions. Musical 108 songs (100 pop songs from the pieces in the RWC music DB, 8 from database commercial CDs) Queries 70 queries by 10 subjects (8 males, 2 females) Length of About four measures queries

D. Scoring function The scoring function determines how the query matches the current frame of the database. The scoring is based on 50 cent (half of halftone) threshold.

⎧1 if | x − y |< 50 otherwise ⎩0

δ h ( x, y ) = ⎨

The accuracy of retrieval using fixed number of candidate in the database is shown in Figure 4. In this figure, the x-axis is the number of candidates and the yaxis is the ratio of correct pieces included in the retrieved list of musical pieces when 1, 5 and 10 pieces were listed as the result of the retrieval.

(4)

If two frequencies are nearer than 50 cent, these two frequencies are within a halftone, which means that the two frequencies are regarded as the same height. We used this decision function for all candidates in a frame, considering the importance of the candidate, as follows.

s ( x, db(t )) = max(n − i + 1)δ h ( x, dbi (t )) 1≤i ≤ n

(5)

This scoring function gives higher score when the candidate with higher rank is hit to the query. The similar property can be obtained by simply using the probability density function as the score: s' ( x, db(t )) = max pF 0 (dbi (t ))δ h ( x, dbi (t )) . (6) 1≤i ≤ n

However, we found that this scoring function does not work well through a preliminary experiment, because the dynamic range of the scoring function differs from frame to frame. As p F 0 is a probability density function, integration if it for all frequencies gets one, i.e.



Fhi

Flo

p F 0 ( F )dF = 1 .

Figure 4. Accuracy of music retrieval when fixed number of candidates were used

(7)

Therefore, when many sounds exist in a frame, values of p F 0 of the frame tend to be small. Conversely, when only one sound exists in a frame, p F 0 become nearly one. The scoring function shown in Eq. (5) is a simple way of normalizing the difference of dynamic range. Another advantage of scoring function by Eq. (5) is that we do not need to store probability values in the database. What we need are just frequencies (integral values because of the quantization) and their ranks of the probability peaks.

Figure 5. Accuracy of music retrieval when likelihood-based candidate selection were applied From this result, we can confirm that use of multiple candidates did improve the retrieval accuracy. When only one musical piece was retrieved for one query, the

4

database with 2 or 5 candidates gave the best result. Using multiple candidates, the top-1 accuracy was improved from 55.7% to 64.3%. On the other hand, when 5 or 10 pieces were retrieved, 3-candidate database was the best one. Using 3-candidate database, nearly 90% of the correct pieces were included in the top 10 list of the retrieved pieces. Next, we conducted an experiment using likelihoodbased candidate selection. Figure 5 shows the experimental result. In this result, the vertical line top-1 (multiple) shows the best top-1 result in Figure 4. We can see that the accuracy of top-1 result was further improved using the likelihood-based candidate selection (from 64.3% to 70.0%). The top-5 result was slightly improved, but we could not improve the top-10 results.

develop a faster method while keeping the retrieval accuracy.

References [1] J. S. Downie, “Music Information Retrieval,” Annual Review of Information Science and Technology, vol. 37, pp. 295-340, 2003. [2] S. Pauws, “CubyHum: A Fully Operational Query by Humming System,” Proc. ISMIR, pp.187–196, 2002. [3] J. S. R. Jang, H. Lee and J. Chen, “Super MBox: An Efficient/Effective Content based Music Retrieval System,” The 9th ACM Multimedia Conference (Demo paper ), pp.636–637, 2001. [4] S.-P. Heo, M. Suzuki, A. Ito and S. Makino, “An Effective Music Information Retrieval Method Using ThreeDimensional Continuous DP,” IEEE Trans. Multimedia, vol.8, no.3, pp.633–639, 2006. [5] M. Suzuki, T. Ichikawa, A. Ito and S. Makino, “Novel Tonal Feature and Statistical User Modeling for Query-byHumming,” IPSJ Journal, vol. 50, no. 3, pp. 1100-1110, 2009. [6] M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti and M. B. Sandler, “Automatic Music Transcription and Audio Source Separation,” Cybernetics and Systems, vol. 33, no. 6 pp. 603 – 627, 2002. [7] M. Goto, “A Real-time Music Scene Description System: Predominant-F0 Estimation for Detecting Melody and Bass Lines in Real-world Audio Signals,” Speech Communication, vol. 43, no. 4, pp.311-329,2004. [8] H. Kameoka, T. Nishimoto and S. Sagayama, “A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering,” IEEE Trans. Audio, Speech and Language Proc., vol. 15, no. 3, pp. 982-994, 2007. [9] H Kenmochi and H Ohshita, “VOCALOID–Commercial singing synthesizer based on sample concatenation,” Proc. Interspeech, Antwarp, 2007. [10] Paul Boersma, David Weenink, “Praat: doing phonetics by computer (version 5.0.30),” 2008.

5. Summary In this paper, we proposed a Query-by-Humming method that retrieves musical pieces from audio signals. The basic idea is to select multiple F0 candidates from audio signals as a database, and then an input query is matched against the database with multiple F0 candidates. We exploited PreFEst-core as a method of F0 probability calculation, and then selected multiple F0 candidates based on the likelihood values. As a matching method, we used a simple matching method for absorbing key differences and finding the part of a piece corresponding to the query. From the evaluation experiment, we achieved about 15 points of improvement using multiple F0 candidates. In the QbH system presented here does not consider computational complexity of retrieval, so the retrieval is slow (several minutes for one query). To realize quick retrieval for a huge amount of audio signals, we need to

5