Efficient Music Representation with Content ... - Semantic Scholar

2 downloads 43 Views 610KB Size Report
exploit music characteristics to create a set of content-adaptive elementary functions called ... might have evolved a highly efficient coding mechanism that organizes a ..... representation in the short-term Fourier transform magnitude. (STFTM) ...
Efficient Music Representation with Content Adaptive Dictionaries Namgook Cho, Yu Shiu and C.-C. Jay Kuo Ming Hsieh Department of Electrical Engineering and Signal and Image Processing Institute University of Southern California, Los Angeles, CA 90089-2564 E-mail: {namgookc, yshiu}@usc.edu, [email protected]

Abstract— An efficient music representation based on the matching pursuit (MP) technique with content-adaptive dictionaries (CADs) is investigated in this work. The Gabor atoms are commonly adopted in the MP-based signal representation due to their excellent time-frequency localization property. However, the Gabor dictionary may not yield a concise representation for music signals. Music signals have special characteristics which are specified by pitches and durations of music notes. In this work, we exploit music characteristics to create a set of content-adaptive elementary functions called atoms that efficiently capture the inherent structure of music signals. As a result, we are able to project musical signals onto a subspace spanned by atoms from CAD for a concise and efficient representation. The proposed CAD representation technique is applied to music enhancement with noisy background to demonstrate the power of the proposed representation.

I. I NTRODUCTION To find an efficient representation for audio signals is one of emerging research topics in audio signal processing due to its rich applications in audio structure analysis, automatic music transcription, and audio source separation. Humans are able to listen and understand each individual sound from a complex acoustic environment. Human auditory systems might have evolved a highly efficient coding mechanism that organizes a meaningful structure for perception and processes the information conveyed to the brain compactly. In this work, we investigate a systematic approach for efficient music representation. Methods to represent real-valued audio signals can be roughly classified into two categories: orthogonal basis expansion and overcomplete representation. Orthogonal bases such as Fourier and wavelet bases can provide a complete representation of signals of finite energy, but they may not provide a concise representation for some signals. An alternative approach to signal representation is the use of overcomplete representation, where the number of the representative units (or called atoms) can be greater than the dimension of the signal space. For example, the matching pursuit (MP) algorithm approximates a signal based on a successive refinement framework using parameterized time-frequency atoms from redundant dictionaries [1]. More recently, Lewicki and Sejnowski [2] presented an algorithm to learn the overcomplete representation of sensory data using a probabilistic framework. Although the overcomplete representative approach allows for a more concise representation of audio signals, it has several drawbacks. One is its high computational complexity.

Second, given an acoustic signal from a complex mixture of sounds, e.g., music with background speech or environmental sounds, overcomplete representations may not capture inherent characteristics of a specific audio type effectively. That is, an overcomplete representation (such as Gabor atoms) is not designed for a specific audio type, but for the approximation of a large class of signals. As it is content-independent, its efficiency is sacrificed for a particular content type. In this work, we propose a systematic way to find an efficient representation of music signals using the overcomplete representation. For a given music instrument type, we are able to provide a set of basic representative units so as to construct the contentadaptive dictionary (CAD). With CAD, we can represent musical signals of the same nature efficiently. To demonstrate the power of the proposed CAD representation technique, we apply it to music enhancement with noisy background. The rest of this paper is organized as follows. A generic framework for signal representation using a set of overcomplete functions is presented in Sec. II. The generation of CAD for musical signals is described in Sec. III. Experimental results are shown in Sec. IV. Finally, concluding remarks and future research directions are given in Sec. V. II. R ESEARCH M OTIVATION AND P ROBLEM F ORMULATION To analyze a signal x(t), we can decompose it into a weighted sum of basis functions ϕk (t) as X x(t) = ak ϕk (t), (1) k

where ak is the coefficient of ϕk (t). We say the signal model is efficient (or sparse), if most energy of x(t) is concentrated on a small set of basis functions. This is often achieved by overcomplete representative functions [2] formed by a set of non-orthogonal linearly dependent vectors, which spans the signal space while including more functions than necessary. An efficient representation of a signal using a set of overcomplete functions can be explained by a simple 2-D example as given in Fig. 1. When using the orthonormal basis ϕ1 and ϕ2 , signal x can be represented as x = hx, ϕ1 i ϕ1 + hx, ϕ2 i ϕ2 = a1 ϕ1 + a2 ϕ2 , where the angle bracket represents the Hermitian inner product. Given overcomplete functions, ϕ1 , ϕ2 , and ϕ3 as shown

in Fig. 1, we can have a more concise representation as

the musical note signal is analyzed. Mathematically, we have the following iteration:

x ≈ hx, ϕ3 i ϕ3 = a3 ϕ3 . We can project the residual signal, r = x − a3 ϕ3 , onto the other two functions ϕ1 and ϕ2 , and the resulting coefficients, |hr, ϕ1 i| and |hr, ϕ2 i| are much smaller than the coefficient |a3 |. x

Gabor Dictionary

Dg

Re-organization with Signal Model

Music ContentAdaptive Dictionary

Time-domain

Dm

Fig. 2. Finding music content-adaptive atoms and their corresponding dictionary from musical notes.

A. Music Decomposition by Matching Pursuit with Gabor Atoms The Gabor dictionary consists of a family of atoms that are obtained by dilation, translation, and modulation of a mother window g(t) [1] 1 ³ t − u ´ jξ(t−u) e , (2) gs,u,ξ (t) = √ g s s which is concentrated in the neighborhood of time u and frequency ξ, and proportional to s in time and 1/s in frequency. The Gabor dictionary can be expressed by Dg = {gγ }γ∈Γg where Γg denotes an index set. We use the kth musical note sk (t) of a specific music instrument as the training signal. With the Gabor dictionary,

After projecting the kth musical note of a specific music instrument, we obtain a set of Gabor atoms chosen from the Gabor dictionary, which can be represented by Dk = {gγm , γm = (sm , um , ξm ) ∈ Γk },

Γk ⊂ Γg

(5)

for some index set Γk and Gabor atoms at scale sm > 0, time position um , and frequency ξm , m = 1, · · · , M . Next, we define a signal model for musical signals as X hγk (t) = cm gγm (t), (6) γm ∈Γk

where cm is a normalization constant such that khk (t)k2 = 1. In other words, the set of chosen Gabor atoms is re-grouped using the signal model to create a subspace that represents the strong harmonic structure of the musical note effectively. The subspace Shγk , obtained by the grouping procedure is represented by Shγk = span{gγm , γm ∈ Γk }.

(7)

Note that each subspace corresponds to one musical note. Consequently, new atoms hγk , 1 ≤ k ≤ K, form a music content-adaptive dictionary, denoted by Dm , where K is the number of notes in the music database. Fig. 3 presents two atoms in Dm and their time-frequency representations. We can observe the strong harmonic structure in their time-frequency representations, which often exist in musical signals. Trumpet C4, f0 = 261 Hz

Trumpet C6: f0 = 1046 Hz

0.3

Amplitude

The inherent structure of music signals is analyzed and learned based on their basic components, i.e., individual notes in pitched sounds. The learning of musical notes can be accomplished by a matching pursuit (MP) mechanism as depicted in Fig. 2. It is detailed below.

B. Content-Adaptive Dictionaries

0.2 0.1 0

0.4 0.3 0.2 0.1 0 0

0.1 0

0.4

Frequency

M2

III. M USIC R EPRESENTATION WITH C ONTENT-A DAPTIVE D ICTIONARIES

Decomposition with Matching Pursuit

R0 (t) = sk (t) is used as the initialization and gγm is chosen to maximize the correlation with residual Rm (t) at the mth step, i.e., gγm = arg max |hRm−1 , gγ i|. (4)

Amplitude

a2

The above example shows the power of the overcomplete set. One of the key questions under this framework is how to find ϕ3 . In the next section, we will focus on musical signals and seek their concise representation.

s k (t )

(3)

Atoms in the Gabor dictionary have the time-frequency localization property. Even though they are overcomplete, they may not provide an efficient representation for musical signals. In the next subsection, we consider content-adaptive dictionaries that are tailored to musical signals.

Fig. 1. An efficient signal representation using an overcomplete set of unitnorm functions, where ϕ1 ⊥ ϕ2 .

Musical note signal from Music Database

hRm−1 , gγm i gγm (t) + RM (t).

m=1

Frequency

M3

a1

M X

gγ ∈Dg

a3

M1

sk (t) =

0.3 0.2 0.1

200

400

600

800

1000

0 0

200

400

600

Time (sample)

Time (sample)

(a)

(b)

800

1000

Fig. 3. Music content-adaptive atoms that correspond to notes C4 and C6 of trumpet, where f0 is the fundamental frequency of notes.

C. Computational Complexity At each iteration, the proposed algorithm requires to compute all scalar products between M atoms of Dm and a residual Rm (t). Thus, for a N -dimensional signal, the complexity to find one atom is O(MN). A good MP implementation uses a fast Fourier transform (FFT) to compute all scalar products with shifted atoms [1], [4]. Our algorithm employs FFT to find the best position of the atom, which produces a maximal scalar product in the signal. Thus, the complexity of the proposed algorithm to find the best atom is O(MN log N ), which is the same as the complexity of MP. However, there is a great difference between them in the dictionary size due to the value of M . For instance, the Gabor dictionary used in MP usually has about 103 to 105 atoms to handle audio signals from clarinet, whereas its CAD contains about M = 40 atoms. The clarinet sounds can be easily represented by these 40 atoms effectively. D. Application to Musical Signal Enhancement Suppose that an observed musical signal x(t) consists of pure music signal m(t) and background sound b(t), i.e., x(t) = m(t) + b(t). The observed signal is projected onto the subspace spanned by atoms from dictionary Dm . That is, after initialization by setting R0 (t) = x(t) in the MP algorithm, we compute the following at the mth step:

A. Learning Musical Structures We compare the performance of the proposed music representation with that of a data-driven approach proposed in [6]. The latter is based on a sparse coding technique using a representation in the short-term Fourier transform magnitude (STFTM) domain. Fig. 4 (a) shows the resulting dictionary of 40 atoms, where the x-axis is the atom index and the yaxis shows the frequency components of each atom. We see that atoms contain incorrect note spectra. For comparison, the same 40 clarinet notes was used to generate a music CAD. Then, all atoms in the CAD were transformed to the STFTM domain, i.e., Φk = |F F T {ϕk }|. We see an excellent harmonic structure of musical instrument sounds as shown in Fig. 4 (b).

(8)

where Phγm represents the orthogonal projection onto subspace Shγm , and hγm is chosen as hγm = arghγ ∈Dm (|hRm−1 , hγ i|2 > ηe ),

Experiments were conducted to evaluate the performance of the proposed concise music representation. We considered the Gabor dictionary built on real Gabor atoms of length 92.8 msec, with dyadic scales and 800 different frequencies uniformly spread over the interval of normalized frequencies, [0, 0.5]. Thus, the size of the Gabor dictionary is |Dg | = 8000 without taking all possible time-shifts into account. However, the time shift is considered when we compute the correlation between the atom and the current residual signal. Next, we created music CADs as discussed in Sec. III. Instrument sounds from the RWC Musical Instrument Sound Database are used to build the music CADs. Test signals were chosen from several excerpts of real audio signals, e.g., recording of musical instrument sounds, speech signals, and environmental sounds. All sounds in the experiments were downsampled to 11,025 Hz. To measure the quality of reconstructed audio with respect to the original one, we used the source-to-distortion ratio (SDR) as proposed in [5].

(9)

0.4

Frequency

Phγm Rm−1 (t) = hRm−1 , hγm i hγm (t),

IV. E XPERIMENTAL R ESULTS

0.3

0.2

0.1

0

where ηe is a pre-defined threshold based on an energy criterion. The new residual can be computed as Rm (t) = Rm−1 (t) − hRm−1 , hγm i hγm (t). Musical notes can be easily identified by (9), which yield large projection values. Then, the enhanced signal can be reconstructed as a weighted sum of content-adaptive atoms chosen from Dm by m(t) ˆ '

J X M X

j hRm−1 , hγm i hγm (t),

(10)

j=1 m=1

where J is the number of non-overlapping window frames and j hRm−1 , hγm i is the gain of hγm at the mth step in frame j. The residual background signal can be obtained simply by ˆb(t) = x(t) − m(t). ˆ

0.4

Frequency

It is worthwhile to point out that a similar approach has been proposed in [3] in creating the so-called harmonic dictionary. However, their signal model and goals are different from ours. To approximate musical signals, they estimated the best harmonic subspace by choosing sub-dictionaries iteratively, where subspaces whose correlation is below some threshold are removed from current sub-dictionaries. Their harmonic dictionary is an extended Gabor dictionary constructed by adding synthesized harmonic atoms to Gabor atoms.

0.3

0.2

0.1

5

10

15

20

25

30

35

40

Index of atoms

(a)

0

5

10

15

20

25

30

35

40

Index of atoms

(b)

Fig. 4. Spectral decompositions of (a) atoms in the dictionary obtained with sparse coding, and (b) atoms in CAD for clarinet, where each column represents an atom.

B. Computational Complexity To understand the relationship between the complexity and the dictionary size, we compare the average time to find the best atom in MP [1], TBP [4], and the proposed CAD representation technique. There are two factors that affect the dictionary size [4]: the number of frequency bins Fn , and the atom size that determines the number of scales Ns . That is, |D| = Ns × Fn .

4.0 TBP CAD

0.2 0.1

|c0|=67

|Dg|=21600

3.0

0 512

1024

2048

MP CAD

5.0

Mean time (sec)

MP CAD

|c0|=247 |c0|=121

Mean time (sec)

Mean time (sec)

Despite different scales of Gabor atoms, CAD always has the same number of atoms (e.g., |Dm | = 40) due to a fixed number of musical notes from the music database (e.g., a total of 40 clarinet notes was adopted as training sounds). Fig. 5 shows that the computational time mostly depends on the dictionary size, and CAD performs better than MP. The complexity of TBP highly depends on the number of children of root node, denoted by |c0 | [4]. Note that the performance of TBP is comparable with that of the proposed algorithm. The experiment was performed on a Pentium Centrino at 1.5 GHz with 1.25 GB memory.

|Dg|=8800

4.0

|Dg|=18000 |Dg|=14400

2.0 |Dg|=10800 |Dg|=7200

1.0

|Dg|=8000

2.0

|Dg|=3600

|Dg|=7200

0 0 512

1024

2048

13.78

Atom size (sample)

6.89

4.59

3.45

2.76

2.30

Frequency resolution (Hz)

(a)

(b)

Fig. 5. Comparison of the mean time to find the best atom with respect to the dictionary size, where (a) a different number of scales (i.e., atom size) with fixed Fn = 800, and (b) a different number of frequency bins with fixed Ns = 9 are compared.

C. Approximation Capability

Amplitude Amplitude

To evaluate the performance of music sound approximation, a real clarinet audio signal that consists in seven different notes was approximated using CAD, and the SDR value as a function of cumulative atoms is shown in Fig. 6. After seven atoms from the CAD are used, SDRs of the reconstructed signal become saturated around 19 dB. In other words, seven atoms are enough to capture most of the energy of the clarinet signal. Original signal 0.1

TABLE I R ESULTS IN MUSIC ENHANCEMENT. mixture

reconstructed signals in SDR (dB) music, m(t) ˆ background, ˆb(t)

clarinet + white noise (SNR=5dB) clarinet + street clarinet + train (inside) clarinet + speech piano + speech violin + speech

14.03 12.96 12.33 10.77 6.90 5.18

7.92 12.18 10.54 6.79 2.62

are shown in Table I. Generally speaking, the CAD-based algorithm has successfully extracted music content, resulting in good performance in music enhancement, except for polyphonic notes (i.e., piano) and vibrato (i.e., violin). The poorer performance could be explained by the fact that piano sounds are more complicated than solo instruments, and violin is usually played with vibrato that has the frequency and amplitude modulation effect. E. Other Applications For audio source separation, we compared the performance of CAD with several algorithms based on spectral decomposition of audio signals and clustering their components in [7]. It was shown that the CAD representation is more efficient in source separation. For details, we refer to [7]. V. C ONCLUSION AND F UTURE W ORK An efficient representation of musical signals was studied in this work. The proposed representation was shown to be effective in music enhancement. Further research is needed to find CAD associated with various special music effects such as vibrato, irregular pitch sweeps, and inharmonicity. One possible direction could be the use of chirp atoms [8]. Due to the different chirp rate in chirp atoms, the instantaneous frequency of atoms varies linearly with time, which can represent the pitch sweeps of the vibrato sound efficiently. R EFERENCES

0 -0.1

Approximated signal 0.1 0 -0.1 1

2

3

4

5

Time (sec) SDR (dB)

25 20 15 10 5 0

5

10

15

20

25

30

35

40

Number of atoms

Fig. 6. The approximation capability of the proposed CAD method: the original signal (top), the reconstructed signal (middle) using seven atoms and the SDR value as a function of the number of approximating atoms (bottom).

D. Music Enhancement To evaluate the performance of the proposed CAD representation for the music enhancement application, we tested a wider range of audio sounds, including musical instrument sounds, speech, and environmental sounds. The results

[1] S. G. Mallat and Z. Zhang, “Matching pursuit with time-frequency dictionaries,” IEEE Trans. Signal Process., vol. 41, pp. 3397–3415, Dec. 1993. [2] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Comput., vol. 12, pp. 337–365, 2000. [3] R. Gribonval and E. Bacry, “Harmonic decomposition of audio signals with matching pursuit,” IEEE Trans. Signal Process., vol. 51, pp. 101– 111, Jan. 2003. [4] P. Jost, P. Vandergheynst, and P. Frossard, “Tree-based pursuit: Algorithm and properties,” IEEE Trans. Signal Process., vol. 54, pp. 4685–4697, Dec. 2006. [5] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 14, pp. 1462–1469, Jul. 2006. [6] S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of polyphonic music using sparse coding,” IEEE Transactions on Neural Networks, vol. 17, pp. 179–196, 2006. [7] N. Cho, Y. Shiu, and C.-C. J. Kuo, “Audio source separation with matching pursuit and content-adaptive dictionaries (mp-cad),” in IEEE Workshop on Applications of Signal Process. Audio Acoust., New Paltz, NY, 2007. [8] R. Gribonval, “Fast matching pursuit with a multiscale dictionary of gaussian chirps,” IEEE Trans. Signal Process., vol. 49, pp. 994–1001, May. 2001.