Online Noisy Single-Channel Source Separation Using Adaptive Spectrum Amplitude Estimator and Masking
1
N. Tengtrairat, W.L. Woo*, Senior Member, IEEE, S.S. Dlay and B. Gao, Member, IEEE Abstract βA novel single-channel source separation method is presented to recover the original signals given only a single observed mixture in noisy environment. The proposed separation method is an online adaptive process and independent of parameters initialization. In this paper, a noisy pseudo-stereo mixing model is developed by formulating an artificial mixture from the observed mixture where the signals are modeled by the autoregressive process. The proposed demixing process composes of two steps: Firstly, the noisy mixing model is enhanced by selecting the time-frequency (TF) units of signal-presence and computing the mixture spectral amplitude. Secondly, an adaptive estimation of the parameters associated with each source is computed frame-byframe which is then used to construct a TF mask for the separation process. To assess the performance of the proposed method, noisy mixtures of real-audio sources with nonstationary noise have been conducted under various SNRs. Experiments show that the proposed algorithm has yielded superior separation performance especially in low input SNR compared with existing methods. Index Terms β Blind source separation, underdetermined mixture, single-channel separation, noise reduction, masking.
I. INTRODUCTION
S
INGLE-channel
blind source separation (SCBSS) is the process of recovering underlying source signals from an unknown mixing given only a single sensor without any prior information of source signals. SCBSS has interested many researchers during the last decade. In the field of biomedical signal processing, SCBSS is used in several different areas. Applications of ECG/EEG recordings given by the electromyography (EMG) signal have been developed to distinguish heart-beat signal from an observed recording based on diverse approaches i.e. independent component analysis (ICA), nonnegative matrix factorization (NMF), singular spectrum analysis (SSA) [1-3]. Conventional ICA approach cannot be directly applied to a single-channel source separation. Thus, modified ICA methods were proposed. Single-channel independent component analysis (SCICA) approach in [4] applies the standard ICA to separate the independent signals from a single mixture. The special structure induced by mapping the observed mixture into a multi-channel model. The algorithm has certain limitations. For example, signals are assumed to be statistically independent. Secondly mixtures compose of nonoverlapping spectrum-density signals. SCBSS of EEG recoding based on singular spectrum analysis (SSA) was proposed in [5]. N. Tengtrairat is with Department of Software Engineering, Payap University, Chiang Mai, Thailand. W.L. Woo and S.S. Dlay are with School of Electrical and Electronic Engineering, Newcastle University, England, UK. B. Gao is with School of Automation, University of Electronic Science and Technology of China, Chengdu, China. * (Corresponding author e-mail:
[email protected]) This paper appeared in the IEEE Transactions on Signal Processing, vol. 64, no. 7, pp. 1881-1895, 2016.
SSA decomposes a time series into a number of interpretable components with distinct subspaces and selects the subgroup of eigenvalues to reconstruct the original source. Another recent application of the SCBSS is image separation in the field of nondestructive test and evaluation (NDT&E) [6-8]. In NDT&E, researchers are interested with the study of defects. Imaging technique is used usually to image the target object when excited by an external signal. The captured image is a result of a superposition of several independent events where each event is associated with a particular physics phenomenon. The aim is to estimate these independent events and monitor the associated physical features in order to detect and monitor defects. In general, SCBSS can be categorized into two groups i.e. model-based and data-driven methodologies. In this study, we focus on data-driven SCBSS. A popular method is the computational auditory scene analysis (CASA). CASA has been proposed for the isolation of speech from noise by using the ideal binary masking (IBM) in time-frequency domain. A binary masking approach has been introduced to suppress noise from the noisy input and also maintain speech intelligibility. In [9], this method consists of two phases: Firstly, training phase evaluates an ideal binary masking (IBM) by using a Gaussian mixture model (GMM) to label each TF unit whether speechdominant or noise-dominant. Secondly, an enhancement phase is to construct a binary masking by using the IBM. Later in [10], a new binary-masking algorithm trained using deep neural networks (DNNs) with unsupervised restricted Boltzmann machines (RBMs) is proposed to improve the intelligibility of hearing-impaired listeners by separation of speech from noise through IBM estimation. Extension of GMM with user-generated exemplar source is proposed in [11]. This work uses an exemplar source provided from an external user to estimate the sources. Data-driven methods such as the sparse non-negative matrix factorization (SNMF) [12-13] determine a set of basis for each speaker and a mixture is mapped onto the joint bases of the speakers. It requires no assumption on sources such as statistical independence or grammatical model. However, the SNMF method does not model the temporal structure [14] and it requires large amount of computation to determine the speaker independent basis. The SNMF2D [15] was proposed which used a double convolution to model both spreading of spectral basis and variation of temporal structure inherent in the sources. Some successes have already been reported in recent literature [16-19] to show the validity of SNMF2D in separating single channel mixture. The SNMF has regained interest recently where the domain of interest lies in the complex spectrogram which gives rise to the complex NMF (CNMF). Some promising results have recently been reported in [20] with adaptive sparseness. On the other hand, binaural source separation method generally delivers better separation performance than a single recorder in the underdetermined scenario. The Degenerate Unmixing Estimation Technique (DUET) [21] and its variants [22, 23] have been proposed as a separating method using binary time-frequency (TF) masks. A major advantage of DUET is that the estimates
from two channels are combined inherently as part of the clustering process. The DUET algorithm has been demonstrated to recover the underlying sparse sources given two anechoic mixtures in the TF domain. Recently, DUET has been extended to the single-channel mixture and the algorithm was termed as the Single Observation Likelihood estimatiOn (SOLO) [24, 25]. The SOLO constructs an artificial stereo mixture which is then used to form a binary mask for separation. All of the above SCBSS algorithms are derived for noise-free condition which lacks the potential and robust to solve the problem in noisy environments. Since the presence of noise seriously degrades the performance, many algorithms for handling background noise have been developed. In a realistic situation of audio applications, desired signals will be corrupted by an additive background noise. Mathematically, noisy singlechannel blind source separation (NSCBSS) can be expressed as:
(2) π‘β
Formulate Noisy Pseudo-Stereo Mixture
π₯2 (π‘)
(1)
where π(π, π) is an observed noisy mixture at the π frequency bin of the π π‘β frame, π(π, π) = βπ π=1 ππ (π, π) is a sum of the source signals (i.e. mixture signal without noise), and π(π, π) denotes the noise. An enhanced spectrum of mixture signal πΜ(π, π) is given as πΜ(π, π) = πΊ(π, π)π(π, π) where πΊ(π, π) is a spectral gain. Hence, speech-enhancement performance depends solely on the spectral gain by applying a frequency-dependent gain function to the spectral components of the noisy speech, in an effort to suppress the noise components to higher quality of speech components. Many approaches have been established in recent decades, for example the spectral subtraction method, minimum-mean square error (MMSE) estimation, and a maximum a posteriori (MAP) estimation. The spectral subtraction method [26] achieves noise reduction by subtracting estimated noise spectral amplitude from the observed spectral amplitude without concern of speech spectral components. Secondly, the MMSE estimator [27] and its more recent versions [28] apply a frequency dependent gain function to the spectral components of the noisy speech. Its solution is featured by the noise variance, a priori SNR, and a posteriori SNR where the noise variance is known or can be estimated. Lastly, the speech enhancement method using a maximum a posteriori (MAP) estimation [29, 30] modeled the speech probability density function (PDF) by a parametric super-Gaussian function developed from a histogram. This method has an effective noise reduction capability especially in low SNR environments which is superior among the three methods. In the paper, we consider the NSCBSS problem as one noisy mixture of N unknown sources signals. The contributions of the paper are summarized below: 1) It is an online adaptive separation method where the observed mixture is segmented into small frames. The separation process is executed adaptively frame-by-frame. Hence, the robustness of the proposed algorithm can benefit for real-time signal processing applications. 2) It is
STFT
stage
π1 (π, π) Enhancement
where π‘ = 1, 2, β¦ , π denotes time index, π(π‘) is unknown noise signal and the goal is to estimate the sources π π (π‘), βπ β π of length π when only the observation signal π₯(π‘) is available. A well-known approach to improve intelligibility and perceptual quality of degraded speech is a speech enhancement approach. The speech enhancement approach is to remove background noise in a noisy speech. Most of the common enhancement techniques operate in the frequency domain which can generally be expressed as π(π, π) = π(π, π) + π(π, π)
π₯1 (π‘)
π2 (π, π)
Audio Activity Detection (AAD)
π1 (π, π) π2 (π, π)
Spectral Amplitude Estimator
π1 (π, π) Separation
π₯(π‘) = π 1 (π‘) + π 2 (π‘) + β― + π π (π‘) + π(π‘)
2 an adaptive parameters estimation method. The parameters are adaptively estimated from two consecutive frames. The selfadaptive property is preferred for time-varying signals especially speech and highly nonstationary noise. 3) It is independent of parameters initialization, i.e. no need for random initial inputs or any predetermined structure on the sensors. This renders robustness to the proposed method. 4) It has computational simplicity and does not exploit high-order statistic. Hence this yields the benefit of ease of implementation. To achieve the above, the proposed method requires the following assumptions: the source signals are characterized as AR processes, the sources satisfy the windowed-disjoint orthogonality (WDO) and the local stationary of the time-frequency representation.
Construct Mask
ππ (π)
π2 (π, π)
Mixing Attenuation Estimator
ππ (π, π) Demixing Mixture
Fig.1. Overview of the proposed algorithm.
The overview of the proposed method is illustrated in Fig.1 which is organized as follows: Section II introduces the noisy pseudo-stereo mixture model. Section III proposes an online demixing method i.e. the mixture enhancement and the separation process. Section IV presents the separability of the pseudo-stereo model. Experimental results with a series of performance comparison with other SCBSS methods are conducted and discussed in Section V. Finally, Section VI concludes the paper. II. PROPOSED SINGLE - CHANNEL NOISY MIXING MODEL A. Proposed Pseudo-Stereo Noisy Mixture Model In this paper, for simplicity we consider the case of a singlechannel noisy mixture of two sources and a noise in time domain as π₯1 (π‘) = π 1 (π‘) + π 2 (π‘) + π1 (π‘)
(3)
where π₯1 (π‘) is the single channel mixture, π1 (π‘) is an additive uncorrelated noise that can be stationary or nonstationary, and π 1 (π‘) and π 2 (π‘) are the original source signals which are assumed to be modeled by the autoregressive (AR) process [31]: π·π
π π (π‘) = β βπ=1 ππ π (π; π‘)π π (π‘ β π) + ππ (π‘)
(4)
where ππ π (π; π‘) denotes the ππ‘β order AR coefficient of the ππ‘β source at time π‘, π·π is the maximum AR order, and ππ (π‘) is an independent identically distributed (i.i.d.) random signal with
3 zero mean and variance ππ2π . This model enables us to formulate a virtual mixture by weighting and time-shifting the single channel mixture π₯1 (π‘) as π₯2 (π‘) =
π₯1 (π‘) + πΎπ₯1 (π‘βπΏ) 1+|πΎ|
(5)
where πΎ β ο is the weight parameter, and πΏ β β€ is the timedelay. The noisy mixture in (3) and (5) is termed as βpseudostereoβ because it has an artificial resemblance of a stereo signal except that it is given by one location which results in the same time-delay but different attenuation of the source signals. To show this, we can express (5) in terms of the source signals, AR coefficient and time-delay as π₯1 (π‘) + πΎπ₯1 (π‘βπΏ) 1+|πΎ| (βππ (πΏ;π‘)+πΎ) 1
π₯2 (π‘) = =
π 1 (π‘ β πΏ) +
(βππ (πΏ;π‘)+πΎ) 2
1+|πΎ| π (π‘) + πΎπ1 (π‘βπΏ) + 1 1+|πΎ|
+
1+|πΎ|
,πΏββ€ βππ π (πΏ;π‘)+πΎ
(7)
1+|πΎ| π·π
π2 (π‘; πΏ, πΎ) =
πβ πΏ
(8)
1+|πΎ| π1 (π‘) + πΎπ1 (π‘βπΏ)
(9)
1+|πΎ|
where ππ (π‘; πΏ, πΎ) and ππ (π‘; πΏ, πΎ) represent the mixing attenuation and the residue of the ππ‘β source, respectively, and π2 (π‘; πΏ, πΎ) denotes noise obtained by weighting and time-shifting of the additive noise π1 (π‘). Using (7)-(9), the overall proposed noisy mixing model can now be formulated in terms of the sources and the noise as π₯1 (π‘) = π 1 (π‘) + π 2 (π‘) + π1 (π‘) π₯2 (π‘) = π1 (π‘; πΏ, πΎ)π 1 (π‘ β πΏ) + π2 (π‘; πΏ, πΎ)π 2 (π‘ β πΏ) + π1 (π‘; πΏ, πΎ) + π2 (π‘; πΏ, πΎ) + π2 (π‘; πΏ, πΎ)
(10)
B. Time-Frequency Representation The TF representation of the noisy mixing model is obtained using the Short-Time Fourier Transform (STFT) of π₯π (π‘), π = 1,2 as π1 (π, π) = π1 (π, π) + π2 (π, π) + π1 (π, π) π2 (π, π) β π1 (π)π βπππΏ π1 (π β πΏ, π) + π2 (π)π βπππΏ π2 (π β πΏ, π) β π·
1 (βπ=1
πβ πΏ
ππ 1 (π;π) βπππ π π1 (π 1+|πΎ| π·
2 + βπ=1
πβ πΏ
β π, π)
ππ 2 (π;π) βπππ π π2 (π 1+|πΎ|
where ππππ₯ = 2πππππ₯ βππ , πΏπππ₯ is the maximum time delay, ππππ₯ is the maximum frequency present in the sources and ππ is the sampling frequency. Hence, πΏπππ₯ can be determined from (14) according to ππ 2ππππ₯
(15)
As long as the delay parameter is less than πΏπππ₯ , there will not be any phase ambiguity. This condition will be used to determine the range of πΏ in formulating the pseudo-stereo mixture.
The proposed online single-channel noisy demixing method mainly comprises of two steps: The first step is mixture enhancement which aims to reduce the additive noise and extracts the source information. The second step is the separation process which isolates the original signals by multiplying a mask on the noise-reduced mixture. The mask is constructed by evaluating the cost function given by each source-signature estimator. A. Proposed Single-Channel Mixture Enhancement 1) Audio Activity Detection The audio activity detection (AAD) method enhances the noisy mixture by selecting the TF units that contain source signals and removing TF units without source signals. To begin, the two statistical hypotheses are set i.e. π»0 (π, π) and π»1 (π, π) which denote the source absence and presence at ππ‘β frequency bin of the π π‘β frame, respectively. π»0 (π, π): Source absence: π(π, π) = π(π, π) π»1 (π, π): Source presence: π(π, π) = π(π, π) + π(π, π) (16) where π(π, π) is a mixture given by π1 (π, π) or π2 (π, π), π(π, π) is a sum of source signals i.e. π(π, π) = π1 (π, π) + π2 (π, π), and π(π, π) is the additive noise. The term π(π, π) and π(π, π) are assumed to be complex Gaussian distributed. Source presence at a particular (π, π) unit is detected by computing a local source absence probability (LSAP) and selecting the (π, π) unit that the LSAP is less than a local threshold ππΏ where ππΏ can be set by the user. The LSAP can be expressed as
β π, π)) + π2 (π, π)
π(π»0 (π, π)|π(π, π)) =
(11)
=
for βπ, π. In (11), we have used the fact that |ππ (π‘)| βͺ |π π (π‘)|, thus the TF of ππ (π‘) in (13) can be simplified to π
π (π, π) =
(14)
III. PROPOSED ONLINE SINGLE - CHANNEL NOISY DEMIXING METHOD
ππ (π‘)ββπ=1 ππ π (π;π‘)π π(π‘βπ)
ππ (π‘; πΏ, πΎ) =
|ππππ₯ πΏπππ₯ | < π
(6)
Defining the followings: ππ (π‘; πΏ, πΎ) =
which forms a part of π
π (π, π) without the contribution of the source ππ (π, π). Notice that factor π βπππΏ is only uniquely specified if |ππΏ| < π, otherwise this would cause phase-wrap [32]. Selecting improper time-delay πΏ will lead to phase-wrap if the maximum frequency of the source is exceeded. In order to avoid phase ambiguity, we must satisfy
πΏπππ₯
1. The term π (π»1 (π, π)|π(π, π)) denotes a SPP given by the Bayesβ theorem: π (π»1 (π, π)|π(π, π)) =
Μ (π,π)|π» (π,π))π(π» ) π(π 1 1
2
Μ (π,π)| 1+ππ |π =( exp {βπΈ 2 } + 1) ππ Μ ππ(π,π)
πΜ(π, π) = π΄Μ(π, π)π πππ
1) Adaptive Mixing Parameter Estimator The sources are assumed to satisfy the local stationarity of the time-frequency representation. This refers to the approximation of ππ (π β ο¦, π) β ππ (π, π) where ο¦ is the maximum time-delay (shift) associated with the Short-Time Fourier Transform (STFT) πΉ π (β) with an appropriate window function π(β). If ο¦ is small compared with the length of π(β) then π(β βο¦) β π(β). Hence, the Fourier transform of a windowed function with shift ο¦ yields approximately the same Fourier transform without ο¦. For the proposed method, the pseudo-stereo mixture is shifted by πΏ and by invoking the local stationarity this leads to πππΉπ
π π (π‘ β πΏ) β
π βπππΏ ππ (π β πΏ, π) β π βπππΏ ππ (π, π) ,
βπΏ, |πΏ| β€ ο¦
(28)
Thus, the STFT of π π (π‘ β πΏ) where |πΏ| β€ ο¦ is approximately π βπππΏ ππ (π, π) according to the local stationarity. Secondly, assuming that the sources satisfy the windowed-disjoint orthogonality (WDO) condition: ππ (π, π)ππ (π, π) β 0,
βπ β π , βπ, π (29)
where ππ (π, π) and ππ (π, π) are the STFT of π π (π‘) and π π (π‘). Hence, the ππ‘β source is dominant at a particular (π, π) unit, the noise-reduced mixture can be more specifically expressed as: Μ1 (π, π) π1 (π, π) = πΜπ (π, π) + π π2 (π, π) = ππ (π)π βπππΏ πΜπ (π β πΏ, π) β πβ πΏ
(25)
(27)
B. Proposed Single - Channel Source Separation
π·
β1
(26)
In conclusion, the proposed mixture enhancement method will benefit the source separation by providing the greater degree of source information by attempting to select the TF units of source presence and reject the TF units of solely noise. The noisereduced mixture can now be modeled as π(π, π) = Μ(π, π) which will then be separated by a binary π΄Μ(π, π)π πππ + π TF mask.
π βπ=1
Μ (π,π)|π» (π,π))π(π» )+π(π Μ (π,π)|π» (π,π))π(π» ) π(π 0 0 1 1
1+ππ 1 ( )) β1 ππ π(π» (π,π)|π Μ (π,π)) β1 1
Using the ππ and π (π»1 (π, π)|π(π, π)) > 0.08, the a posteriori SNR then satisfies πΎΜπππ
(π, π) > 1. Hence, the term πΜ(π, π) can be obtained by computing both estimators of the previous and current frames. Therefore, to extract source information even when source components are weak in low input SNR, the proposed iMMSE-STSA firstly estimate the a posteriori SNR using (26) and then using this estimate for computing the spectral amplitude. Finally, the estimated spectra of the mixture can be formulated as
Using the subadditivity properties of the absolute value, we obtain πΈ[
5 Eqn. (31) is solved for the a posteriori
ππ π (π;π) 1+|πΎ|
Μ2 (π, π) π βπππ πΜπ (π β π, π) + π
Μ2 (π, π), β [ππ (π) β πΆπ (π, π)]π βπππΏ πΜπ (π, π) + π (π, π) β οπ
(30)
for πΏ and π β€ ο¦. The term π·π 1 βππ(πβπΏ) πΆπ (π, π) = | | βπ=1 ππ π (π; π)π is given by (13) and 1+ πΎ
πβ πΏ π‘β
οπ is the π source presence area defined οπ βΆ= {(π, π): πΜπ (π, π) β 0, βπ β π}. The estimate ππ (π, π) = ππ (π) β πΆπ (π, π) associated with the ππ‘β be determined as
as of source can
6 π(π1 (π, π), π2 (π, π)|ππ (π, π), ππ , ππ2ΜΜ ) π
the likelihood function with respect to ππ (π, π) and then substituting the obtained result into the Gaussian likelihood function. The resulting instantaneous likelihood function assumes the following form: πΏπ (π, π) βΆ= π(π1 (π, π), π2 (π, π)|ππ (π, π), ππ , ππ2ΜΜ ) π
Μ ( Μ π 2 π,π) πππΏ π Μ ( Μ π 1 π,π) Μ (π,π)+π Μ (π,π) [ππ (π)βπΆπ (π,π)]πβπππΏ π π 2 = π πππΏ Μ Μ ππ (π,π)+π1(π,π)
ππ (π, π) =
=
Μ1 (π, π) and π Μ2 (π, π) can be assumed to be small after the π mixture enhancement step (as shown in Section V.B). In this case, we can expressed ππ (π, π) as ππ (π, π) =
Μ (π,π) [ππ (π)βπΆπ (π,π)]πβπππΏ π π
ππ (π,π)
π (π,π)
(π)
(π)
ππ (π) =
Μ Μ (π,π) π Μ Μ ( 2 πππΏ ] Μ (π,π)π Μ βπ|π 1 2 π,π)|πΌπ[ ΜΜ ( )π π1 π,π Μ Μ Μ (π,π)π Μ (π,π)| βπ|π 1 2
=
ππ π (π;π)
Μ2 (π, π) π βπππ πΜπ (π β π, π) + π 1+|πΎ| πβ πΏ πΜ π (π,π)βπΈπ (π,π) 1+πΎπβπππΏ πΎ Μ 1 (π, π) π βπππΏ πΜπ (π β πΏ, π) + + 1+|πΎ| π 1+|πΎ| 1+|πΎ|
π βπ=1
By invoking the local stationarity, we then obtain πΈ (π,π) 1+πΎπ βπππΏ Μ Μ 1 (π, π)) β π (ππ (π, π) + π 1+|πΎ| 1+|πΎ|
π2 (π, π) =
π2 (π, π) β (
πΈπ (π,π) 1+πΎπβπππΏ ) π1 (π, π) β 1+|πΎ| 1+|πΎ|
(π)
π½(π, π) = argmin πΊπ (π, π)
(32)
(34)
where 0 < ππ < 1 is a smoothing parameter of the adaptive mixing attenuation estimator. 2) Construction of Masks The binary TF masks can be constructed by labeling each TF unit with the π argument through maximizing the instantaneous likelihood function. The instantaneous likelihood function is derived from the maximum likelihood (ML) method by first formulating the Gaussian likelihood function
(39)
π
Μ Μ Μ (π,π)β(1+πΎπ Μ (π,π) Μ πβπππΏ π Μ
π )π 1 1 π 1+|πΎ|
πΊπ (π, π) = |
2
Μ Μ
π (π) Μ 2Μ (π,π)+π Μ 2Μ (π,π) π π π π
(33)
Relating (33) with (31), we can use similar idea to express ππ (π) = πΜπ (π) β πΆΜπ (π) where πΜπ (π) and πΆΜπ (π) are the power weighted estimation of ππ (π) and πΆπ (π, π), respectively. Secondly, the adaptive mixing attenuation estimator ππ (π) is obtained by smoothing ππ (π β 1) and ππ (π): ππ (π) = ππ ππ (π β 1) + (1 β ππ )ππ (π)
(38)
In this light, the proposed cost function πΊπ (π, π) can be formulated based on the single mixture π1 (π, π) by substituting this expression into (36) which leads to
βπππΏ
(π)
(37)
for πΏ β€ ο¦. The derivation of π2 (π, π) in the source domain in (37) allows us to express π2 (π, π) in the mixture domain as:
The above can then be combined to form the estimate of (32) as ππ (π) = ππ (π) + πππ (π)
(36)
π2 (π, π) = ππ (π)π βπππΏ πΜπ (π β πΏ, π) β
π (π,π)
Μ Μ (π,π) π Μ ( Μ 2 πππΏ ] Μ Μ βπ|π 1 π,π)π2 (π,π)|π
π[ ΜΜ ( )π π1 π,π Μ Μ ( Μ (π ,π)π Μ βπ|π 1 2 π,,π)|
2 Μ ( Μ Μ Μ Μ Μ
π (π)πβπππΏ π |π 1 π,π)βπ2 (π,π)| ππππππ Μ 2 (π) Μ
Μ 2Μ (π,π)+π Μ 2Μ (π,π) π π π π π2 π1
Using (30), the term π2 (π, π) can be expressed as: π·
where ππ(π) (π, π) = π
π [π2(π,π) π πππΏ ] and ππ(π)(π, π) = πΌπ [π2(π,π) π πππΏ ] 1 1 are the real and imaginary parts of ππ (π, π), respectively, and π = ββ1. We propose to adaptively estimate ππ (π, π) frame-byframe. Firstly, a power weighted TF histogram will be used to estimate ππ (π, π) for each frame and the TF units are then clustered into a number of groups corresponding to the number of sources in the mixture. The power weighted histogram is a function of (π, π) with the weight β |π1 (π, π)π2 (π, π)| therefore the real and imaginary parts of ππ (π, π) for each frame basis can be estimated as
(35)
The function πΏπ (π, π) clusters every (π, π) unit to the ππ‘β dominating source for πΏπ (π, π) β₯ πΏπ (π, π), βπ β π. This process is equivalent to the following minimization problem:
πππΏ
= ππ (π) β πΆπ (π, π) (π) (π) = ππ (π, π) + πππ (π, π) , β(π, π) β οπ (31)
ππ (π) =
2 Μ Μ Μ (π,π)βπ Μ (π,π)| Μ (π)πβπππΏ π Μ
1 1 |π π 1 2 ππ₯π (β ) 2π 2 π Μ 2 (π) Μ
Μ 2Μ (π,π)+π Μ 2Μ (π,π) π π π2 π1
πΉ(π, π) = π
using (30), maximizing
2
2
|
(40)
1
Since ππ (π‘) βͺ π π (π‘), the term πΈπ (π, π)/(1 + |πΎ|) is negligible. Hence, π2 (π, π) β (
1+πΎπ βπππΏ 1+|πΎ|
) π1 (π, π). Using (39) and (40), in the
instance when the π source dominates at (π, π) β οπ , the function π½(π, π) will correctly identify the source if and only if πΊπ=π (π, π) < πΊπβ π (π, π). To elucidate this condition, firstly, the case when π = π is considered by setting ππ = 0: π‘β
πΊπ=π (π, π) = 1+πΎπ βπππΏ
Μ1 (π, π)) β ( Μ1 (π, π))| |ππ (π)π βπππΏ (πΜπ (π, π) + π ) (πΜπ (π, π) + π 1+|πΎ| βπππΏ βπππΏ βπππΏ Μ Μ Μ = |πΜπ (π)π ππ (π, π) β πΆπ (π)π ππ (π, π) + ππ (π)π π1 (π, π) β (
1+πΎπ βπππΏ 1+|πΎ|
1+πΎπ ) πΜπ (π, π) β (
= |β (πΆΜπ (π) + πΜπ (π,π) 1+|πΎ|
βπππΏ
1+|πΎ|
Μ1 (π, π)| )π
2
2
πππ (πΏ;π)
+(
1+|πΎ|
Μ1 (π, π) β ) π βπππΏ πΜπ (π, π) + ππ (π)π βπππΏ π
1+πΎπ βπππΏ 1+|πΎ|
Μ1 (π, π)| )π
2
When π β π, following the above step leads to
(41)
7 πΜπ (π,π) 1+|πΎ|
+(
πππ (πΏ;π)
Μ (π, π) + π Μ (π, π) β Μ Μ
π(π)πβπππΏ π ) πβπππΏ π π 1 1+|πΎ| 2 1+πΎπ βπππΏ
Μ Μ
π(π) β ππ (π) β πΊπβ π (π, π) = |(π
1+|πΎ|
Μ1 (π, π)| )π
Finally, convert the estimated sources from TF domain into time domain i.e. π ΜΜπ (π‘).
(42)
To guarantee that πΊπ=π (π, π) < πΊπβ π (π, π) is always satisfied, then we must specified a condition for πΆΜπ . Starting with (41) and (42), we have
IV. ANALYSIS OF SEPARABILITY OF THE PROPOSED PSEUDOSTEREO MIXTURE MODEL
The separability of the noise-free mixing model can be examined from the noise-free pseudo-stereo mixture by 2 π (πΏ;π) Μ (π,π) 1+πΎπ Μ1(π, π) β (π Μ1(π, π))| considering ππ (π‘; πΏ, πΎ) and ππ (π‘; πΏ, πΎ) in the following three |(ππ (π) β ππ (π) β ) π βπππΏ πΜπ (π, π) + ππ (π)π βπππΏ π +( )π 1+|πΎ| 1+|πΎ| 1+|πΎ| cases. Case 1 refers to identical sources mixed in the single (43) channel, Case 2 represents different sources but setting πΎ and πΏ Eq. (43) is bounded by for the pseudo-stereo mixture such that π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ), ππ (πΏ;π) πΜ (π,π) 1+πΎπ βπππΏ π and Case 3 corresponds to the most general case where the Μ 1 (π, π)| β | π Μ 1 (π, π)| < |πΆΜπ (π)πΜπ (π, π)| β | )π πΜπ (π, π) β ππ (π)π +( 1+|πΎ| 1+|πΎ| 1+|πΎ| π (πΏ;π) sources are distinct, and πΎ and πΏ are selected arbitrarily such that πΜ (π,π) 1+πΎπ Μ1(π, π)| + | Μ1(π, π)| |(ππ (π) β ππ (π) β ) πΜπ (π, π) + ππ (π)π +( )π 1+|πΎ| 1+|πΎ| 1+|πΎ| the mixing attenuations and residues are also different. The and therefore we obtain above cases are demonstrated by using the functions π½(π, π) and πππ (πΏ;π) Μ1 (π,π) πΊπ (π, π) from Section III.B.2). These function are recapped here π |πΆΜπ (π)| < |(ππ (π) β ππ (π) β ) + ππ (π) Μ (π,π) | + 1+|πΎ| ππ as: |β (πΆΜπ (π) +
πππ (πΏ;π)
πΜπ (π,π)
1+|πΎ|
1+|πΎ|
Μ1(π, π) β ( ) π βπππΏ πΜπ (π, π) + ππ (π)π βπππΏ π ππ
πππ (πΏ;π) 1+|πΎ|
1+πΎπ βπππΏ 1+|πΎ|
2
Μ1(π, π))| < )π βπππΏ
π
ππ
|
+(
βπππΏ
π
β ππ (π)
Μ1 (π,π) π | πΜπ (π,π)
+
2 1+|πΎ|
|1 + (1 + πΎπ βπππΏ )
Μ1 (π,π) π | πΜπ (π,π)
π½(π, π) = ππππππ πΊπ (π, π)
Μ1 (π, π) has small energy compared with source for βπ β π. As π energy they can be treated as negligible. Hence, Eq.(44) can be simplified to |πΆΜπ (π)| < |(ππ (π) β ππ (π) β
πππ (πΏ;π) 1+|πΎ|
πππ (πΏ;π)
)| + |
1+|πΎ|
|+
2 1+|πΎ|
(45)
If the condition in (45) is satisfied across οπ , the function (39)(40) will then correctly assign the (π, π) unit to the ππ‘β source. Once the TF plane of the mixtures are assigned into π groups of (π, π) units, the binary TF mask for the ππ‘β source can then be constructed as ππ (π, π) βΆ= {
1 π½(π, π) = π . 0 ππ‘βπππ€ππ π
(46)
πΊπ (π, π) = |ππ (π, π)π βπππΏ π1 (π, π) β (
βπππΏ
πΊπ (π, π) = |ππ (π, π)π =
|ππ (π, π)π βπππΏ ππ (π, π)
2
) π1 (π, π)|
ππ (π, π) β (
β
1+πΎπβπππΏ 1+|πΎ|
2
) ππ (π, π)| 2
πΎπβπππΏ β π (π, π)| 1+|πΎ| 1+|πΎ| π
ππ (π,π)
= |ππ (π)π βπππΏ ππ (π, π) β πΆπ (π, π)π βπππΏ ππ (π, π) +
The proposed algorithm is summarized in Table I.
πβ πΏ
(47)
1+|πΎ|
For each TF unit, the π π‘β argument that gives the minimum cost will be assigned to the π π‘β source. We may analyze (49) further by assuming that the ππ‘β source dominates at a particular TF unit. In this case, the observed mixture in TF domain reduces to π1 (π, π) = ππ (π, π) and therefore, (49) becomes
π·
πΜπ (π, π) = ππ (π, π)π1 (π, π)
1+πΎπ βπππΏ
(49)
π βπ=1
Table I: Overview proposed algorithm 1. Pseudo-Stereo Mixture step: Formulate the pseudo-stereo mixture π₯2 (π‘) using (5). 2. Transform step: Transform two mixtures π₯1 (π‘) and π₯2 (π‘) into TF domain by using STFT. 3. Online Single-Channel Demixing: A. Single-Channel Source Enhancement step: 1) Audio Activity Detection: Compute the local SAP at the π π‘β frame bin and the ππ‘β frequency of two mixtures using (17) and the global SAP for the π π‘β frame using (18). If the global SAP > ππΊ then updates πΜπ2Μπ (π, π) using (19). 2) iMMSE-STSA Estimator: Compute the iMMSE estimator of the source spectral amplitude using (24) and formulate the estimated spectra of the ππ‘β sources πΜ(π, π) using (27) for both mixtures. B. Separation step: 1) Compute the mixing attenuation estimators (π) (π) (ππ (π), ππ (π)) at the π π‘β frame using (32) and (34). 2) Label (π, π) units using (39) and (40), and form the binary TF mask ππ (π, π). Recover the original sources by
(48)
π
(44)
ππ π (π;π)πβπππ 1+|πΎ|
2
ππ (π β π, π) β ππ (π)π βπππΏ ππ (π, π)|
(50)
We consider the following three cases: Case 1: If π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ) = π(π‘; πΏ, πΎ) π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ) = π(π‘; πΏ, πΎ), then π₯2 (π‘; πΏ, πΎ) = (
βπ(πΏ;π‘)+πΎ ) π₯1 (π‘ 1+|πΎ|
and
β πΏ) + 2π(π‘; πΏ, πΎ).
In this case, there is no benefit achieved at all. The second mixture is simply formulated as a time-delayed of the first mixture multiply by a scalar plus the redundant residue the separability of this case is presented by substituting the pseudostereo mixture of Case 1 into the cost function. Since both residues are equal, then πΆ1 (π, π) = πΆ2 (π, π) = πΆ(π, π) = 1 βπ·π=1 ππ (π; π)π βππ(πβπΏ) . For Case 1, the function π½(π, π) 1+|πΎ|
πβ πΏ
given by (50) becomes: π½(π, π) = ππππππ |π(π)πβπππΏ ππ (π, π) β πΆ(π, π)πβπππΏ ππ (π, π) + π
ππ (π;π)πβπππ βπ· ππ (π π=1 1+|πΎ| πβ πΏ
2
β π, π) β π(π)π βπππΏ ππ (π, π)|
Invoking the local stationarity of the sources ππ (π β π·π , π) = ππ (π, π) for |π·π | β€ ο¦, the above leads to
8 π½(π, π) = ππππππ |βπ· π=1 π
(ππ (π;π)πβπππ βππ (π;π)πβπππ )
1+|πΎ|
πβ πΏ
2
| |ππ (π, π)|
V. RESULTS AND ANALYSIS
2
= 0 for βπ. As a result, the function π½(π, π) is zero for all π arguments i.e. π½1 = π½2 = 0. In this case, the function π½(π, π) cannot distinguish the π arguments, the mixture is not separable. Case 2: If π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ) = π(π‘; πΏ, πΎ) π1 (π‘; πΏ, πΎ) β π2 (π‘; πΏ, πΎ), then π₯2 (π‘; πΏ, πΎ) = (
βπ(πΏ;π‘)+πΎ ) π₯1 (π‘ 1+|πΎ|
and
β πΏ) + π1 (π‘; πΏ, πΎ) + π2 (π‘; πΏ, πΎ).
This case remains almost similar to the previous case and differs only in terms of π1 (π‘; πΏ, πΎ) β π2 (π‘; πΏ, πΎ). As each residue ππ (π‘; πΏ, πΎ) is related to the ππ‘β source via πΆπ (π, π), the separability of this mixture can be analyzed using π½(π, π)and (50) as π½(π, π) = ππππππ |π(π)πβπππΏ ππ (π, π) β πΆπ (π, π)πβπππΏ ππ (π, π) + π βπππ π·π ππ π (π;π)π βπ=1 ππ (π 1+|πΎ| πβ πΏ
=
2
β π, π) β
π(π)π βπππΏ ππ (π, π)|
π·π (ππ π (π;π)βππ π (π;π)) βπππ ππππππ |βπ=1 π | 1+|πΎ| π πβ πΏ
2
|ππ (π, π)|
2
It can be deduced from above that the cost function yields a zero value for π = π, and nonzero value for π β π. Despite the mixing attenuation for both sources are identical, the function π½(π, π) is still able to distinguish the π arguments by using only the difference of residues. Therefore, the mixture of Case 2 is separable. Case 3: π1 (π‘; πΏ, πΎ) β π2 (π‘; πΏ, πΎ) and π1 (π‘; πΏ, πΎ) β π2 (π‘; πΏ, πΎ) (or π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ) ) then π₯2 (π‘; πΏ, πΎ) = (
βππ 1 (πΏ;π‘)+πΎ ) π 1 (π‘ 1+|πΎ|
β πΏ) + (
βππ 2 (πΏ;π‘)+πΎ ) π 2 (π‘ 1+|πΎ|
β πΏ) +
π1 (π‘; πΏ, πΎ) + π2 (π‘; πΏ, πΎ)
We first treat the situation of π1 (π‘; πΏ, πΎ) = π2 (π‘; πΏ, πΎ). Since the mixing attenuations π1 (π) and π2 (π) correspond respectively to π 1 (π‘) and π 2 (π‘) then the function π½(π, π) given by (50) can be expressed as π½(π, π) = ππππππ |ππ (π)π βπππΏ ππ (π, π) β πΆ(π, π)π βπππΏ ππ (π, π) + π
βπππ
ππ (π;π)π βπ· π=1 1+|πΎ| πβ πΏ
2
ππ (π β π, π) β ππ (π)π βπππΏ ππ (π, π)| 2
= ππππππ|(ππ (π) β ππ (π))π βπππΏ | |ππ (π, π)|
A noisy mixture is generated by adding two sources and an uncorrelated nonstationary noise with various input SNRs. 20 speech, 20 music signals and noise signals are selected from TIMIT, RWC, and Noisex databases, respectively. Additionally, we have conducted experiments to determine the optimal ππ and the choice of ππ . All experiments are conducted under the same conditions as follows: The sources are mixed with normalized power over the duration of the signals. All mixed signals are sampled at 16 kHz sampling rate. The TF representation is computed by using the STFT of 1024-point Hamming window with 50% overlap. The parameters are set as follows: for the pseudo-stereo noisy mixture πΏ = 2 and πΎ = 4 for the smoothing parameter of the noise power and the a priori SNR estimates ππ = 0.95 and ππ = 0.98, respectively, and π(π»0 ) = π(π»1 ) = 0.5. The separation performance is evaluated by measuring the distortion between the original source and the estimated one according to the signal-to-distortion (SDR) ratio [35] defined as 2 2 SDR = 10 πππ10 (βπ π‘πππππ‘ β ββππππ‘πππ + πππππ π + ππππ‘ππ β ) where ππππ‘πππ , πππππ π , and ππππ‘ππ represent the interference from other sources, noise and artifact signals. MATLAB is used as the programming platform. All simulations and analyses are performed using a PC with Intel Core 2 CPU 3GHz and 3GB RAM. A. Determination of Optimal ππ for Mixture Enhancement The optimal ππ is determined by minimizing the proposed integrated probability of error in (21) and (22) in Section III.A.1). The term π varies from 0ππ΅ to 30ππ΅ by 5ππ΅ increment. The candidate ππ is converted from linear scale to dB (i.e. 10 log10 ππ = ππππ΅ ππ΅) with various ππππ΅ from 0ππ΅ to 50ππ΅ by 5ππ΅ increment. Fig.2 on the left-hand side shows the plot of ππ (π, ππ ) for various π values. As a result of individual π, the minimum ππ (π, ππ ) is obtained at ππ = πΜπ = π. Therefore, the optimal ππ is then set by π. However in realistic scenario, the term π is unknown. Thus, the optimal ππ in (22) is determined by approximating the above integral in (22) by discretely evaluating the term at various π values and taking the average. The result is shown on the right-hand side of Fig.2. It can be seen that the range of πΜπ that yields the minimum error is between 10dB and 15dB. Based on this result, the optimal ππ can be set at 10 log10 ππ = 12.5 dB for all experiments.
2
ΞΎ = 0dB ΞΎ = 10dB ΞΎ = 20dB ΞΎ = 30dB
π½(π, π) = ππππππ [|(ππ (π) β ππ (π))π βπππΏ
+
π
π·π (ππ π (π;π)βππ π (π;π)) βπππ βπ=1 π | 1+|πΎ| πβ πΏ
ΞΎ = 5dB ΞΎ = 15dB ΞΎ = 25dB
0.5
0.25
0.4
0.23
Probability of error
This cost function yields a nonzero value only for π β π. In this case, the function π½(π, π) can separate the π arguments due to the difference of ππ and ππ . The case of π1 (π‘; πΏ, πΎ) β π2 (π‘; πΏ, πΎ) follows similar line of argument as above where the function π½(π, π) becomes
Probability of error
π
0.3 0.2 0.1
0
2
|ππ (π, π)| ]
This cost function yields a nonzero value only for π β π; thus the function π½(π, π) is able to distinguish the π arguments. In summary, by considering ππ (π‘; πΏ, πΎ) and ππ (π‘; πΏ, πΎ) with respect to above three cases, only Case 2 and Case 3 are separable.
0.18 0.15 0.13 0.10
0.0
2
0.20
5 10 15 20 25 30 35 40 45 50
ΞΎf dB [dB]
0
5 10 15 20 25 30 35 40 45 50
ΞΎf dB [dB]
Fig.2. Probability of error ππ (π, ππ ) of individual π value (left) and integrated probability of error for various ππ (right).
B. Mixture Enhancement Performance To verify the proposed mixture enhancement method, a test has been conducted and compared the mixture enhancement
24 21 18 15 12 9 6 3 0 -3
standard MMSE Proposed mixture enhancement
noisy mixture modified MMSE
standard MMSE Proposed mixture enhancement
5.0
SIG
4.0 3.0 2.0 1.0 0.0 0
5
10
15 20 Input SNR [dB]
25
30
Fig.4. Comparison of average SIG testing for the noisy mixture, standard MMSE, modified MMSE, and Proposed mixture enhancement.
The proposed mixture enhancement method renders the best quality and intelligibility of the enhanced mixture among the three MMSE methods for across the range of input SNR. A visual test has also been conducted by using mixed real-audio sources (speech + music) and an uncorrelated additive noise. A clean mixture of speech and musical sources is shown in Fig.5 (a). A noisy mixture consists of the two audio sources and a white Gaussian noise with 5dB SNR. The enhanced mixture is obtained by applying the proposed enhancement method on the noisy mixture. Visually, an enhanced mixture in Fig.5 (c) has efficiently extracted the sources spectrum compared with the noisy mixture in Fig.5 (b). original clean mixture
noisy mixture 8000
7000
7000
6000
6000
5000
5000
Frequency
8000
4000
4000
3000
3000
2000
2000
1000
0
1000
0.5
1
1.5
0
2
Time
0.5
1
1.5
2
Time
(a) clean mixture
(b) noisy mixture enhanced noisy mixture
8000 7000 6000 5000
Frequency
SegSNR [dB]
noisy mixture modified MMSE
9 distortion (SIG) [43] has been used as the opinion test of intelligibility. A five-category rating scale is used for each aspect of the evaluation. A five-category rating scale is used for each aspect of the evaluation. For SIG, the corresponding scales are: 1) Very unnatural, very degraded, 2) Fairly unnatural, fairly degraded, 3) Somewhat natural, somewhat degraded, 4) Fairly natural, little degradation, 5) Very natural, no degradation. The SIG results are shown in Fig. 4.
Frequency
method with the original MMSE and the recent modified MMSE [36] by using segmental SNR (SegSNR, in dB) and the perceptual evaluation of speech quality (PESQ) measures [38]. The experiments have been assessed on three types of mixtures i.e. music + music, speech + music, and speech + speech. For the standard MMSE, the smoothing parameter ππ was set at 0.98 according to [27] which shows a strong correlation of πΜ(π, π) and corresponding previous enhanced spectral amplitudes. As such, the term πΜ (π, π) will be smoothness across time where this property suits for stationary signals. Thus, the current frame estimation πΜ(π, π) inclines to be smaller than its previous estimation πΜ(π β 1, π). Consequently, the smooth πΜ(π, π) will be underestimated. This leads to over-suppression: not only noise components but also the source signals; and the sensitive spectral amplitude estimator π΄Μ(π, π). The modified MMSE gives better noise suppression and the quality of reconstructed signals than the standard MMSE method where ππ = 0.5 for low input SNR and ππ = 0.8 for high input SNR as shown in Fig.3. However, the modified MMSE demands higher computational time consuming for the training step but still removes more source components compared with the proposed mixture enhancement method. In Fig. 3, the modified MMSE gives lower perceptual intelligibility and quality of the estimated signals than the proposed mixture enhancement method even though the modified MMSE yields better SegSNR. Therefore, our proposed mixture enhancement method retains the perceptual quality of the sources and maintain a comparably high SegSNR while being able to reduce noise. The proposed mixture enhancement method yields the best PESQ performance where the average PESQ improvement are 27% and 19% over the standard and modified MMSE methods, respectively. In the interval of [0, 20] dB input SNR, the proposed mixtureenhancement method is able to significantly remove noise from the noisy mixture and also retain intelligible perception of the noise-reduced mixture. As evidenced in Fig.3, the proposed enhancement method gains the average improvement over the noisy mixture at 3.0dB (76%) for SegSNR and 0.4 (12%) for PESQ.
4000 3000
0
5
10
15
20
25
30
2000
Input SNR [dB]
1000 0
0.5
1
1.5
2
Time
noisy mixture modified MMSE
standard MMSE Proposed mixture enhancement
5
(c) enhanced mixture Fig.5. Spectrograms of original clean mixture, clean mixture and additive white noise, noisy mixture enhanced using proposed iMMSE-STSA estimator.
PESQ
4 3
C. Choice of ππ for estimating ππ (π)
2
The adaptive mixing attenuation estimator in (34) i.e. ππ (π) = ππ ππ (π β 1) + (1 β ππ )ππ (π) is weighted at every two consecutive frame of ππ through ππ . To determine ππ , 100 experiments have been conducted on 100 noise-free mixtures by implementing the proposed algorithm but excluded the enhancement step. Each noise-free mixture is simulated by adding two synthetic nonstationary AR sources. The nonstationary AR source is synthesized by using the model (3)
1 0 0
5
10
15 20 Input SNR [dB]
25
30
Fig.3. SegSNR (top) and PESQ (bottom) on mixtures of two sources and additive noises at different input SNRs.
The subjective testing of signal quality and intelligibility has been conducted based on ITU-T standard (P.835). The signal
with 2.56s length which divided into five sections i.e. π1 = [0 , 0.51s], π2 = (0.51s , 1.03s], π3 = (1.03s , 1.54s], π4 = (1.54s , 2.05s], and π5 = (2.05s , 2.56s], respectively. The term ππ π and ππ (π‘) of π π (π‘) have been changed section by section. The samples of synthetic source signals are shown in Fig.6 in the top row. original source 1
original source 2 5
Amplitude
0 -5 0
1
0 -5 0
2
1
Time [s]
2
Time [s]
πΊπβ π (π, π) has been satisfied when |πΆΜπ (π)| < |(ππ (π) β ππ (π) β
single-channel mixture
ππ (πΏ;π) π
0
1+|πΎ| 1
2
Time [s]
estimated source 1
estimated source 2 5
Amplitude
0 -5 0
1
-5 0
2
1
Time [s]
2
Time [s]
Fig.6. Two original sources, noise-free mixture and two estimated sources with ππ = 0.95.
Firstly, the term ππ is tested on a range from 0.05 to 0.95 by 0.1 increment. As a result, from ππ = 0.05 to ππ = 0.85, the average SDR results have increased slightly. Between 0.85 β€ ππ β€ 0.95, the average SDR rises sharply with the average improvement of 3ππ΅ per source. The term ππ is then further tested on [0.86,0.99] with 0.01 increments and its results are illustrated in Fig.7. The highest average SDR is within the interval of ππ from 0.91 to 0.98. Hence the optimal choice of ππ will be within [0.91,0.98]. estimated s1
SDR [dB]
14.5 14.0 13.5 13.0 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
according
2
(45).
We
0.7 0.95 0.99 true
|+ 1+|πΎ| 2, the |πΆΜ2 | condition is also true. Therefore, the cost function has correctly assigned all (π, π) units to their respective original sources. This is clearly evident by the same SDR results between the ππ (π) and the ππ (π). Therefore, we selected ππ around 0.95 for all experiments. 1+|πΎ|
1.5
2 1 0 -1 -2 0
0.5
1
1.5
1 0.5 0 -0.5 -1 0
2
0.5
Time [s]
1 1.5 Time [s]
2
Fig.9. |πΆΜπ (π)| condition of π = 1 on the left plot and π = 2 on the plot where the dot-dash line refers to |πΆΜπ (π)| and the continuous line refers to |(ππ (π) β ππ (π) β
πππ (πΏ;π)
πππ (πΏ;π)
2
1+|πΎ|
1+|πΎ|
1+|Ξ³|
)| + |
|+
, π β π.
D. Separation Performance The separation performance of the proposed method has been assessed by using 150 mixtures. The noises have been randomly selected from the NOISEX database which are: pink.wav, destroyerops.wav and factory2.wav. These noises represent stationary, nonstationary and highly nonstationary noises, respectively. The proposed approach will be compared with the single-channel nonnegative matrix 2-dimensional factorization (SNMF2D) and the single-channel independent component analysis (SCICA) [4]. The SNMF2D parameters are set as 1.005is 2, sparsity weight of 1.1, follows [4]: the number of factors number of phase shift and time shift is 31 and 7, respectively for 1 music. As for speech, both shifts are set to 4. Cost function of 0.995 SNMF2D is based on the Kullback-Leibler divergence. As for the SCICA, the number of block is 10 with unity time delay. 0.99 0
0.5
1.005
0.4 0.3 0.5
have
, thus the |πΆΜ1 (π)| condition is satisfied. For π =
0.6
0.2 0
to
1+|πΎ|
ππ1 (πΏ;π)
coefficient
Coefficient
2 1+|πΎ|
coefficient
πM Fig.7. Average SDR on the noise-free mixture of two synthetic AR sources with various ππ
0.7
|+
estimated s2
15.0
0.8
π
1+|πΎ|
computed the |πΆΜπ (π)| condition for π = 1 and 2 as shown in ππ (πΏ;π) Fig.9. For π = 1, |πΆΜ1 (π)| < |(π2 (π) β π1 (π) β 1 )| + |
0
ππ (πΏ;π)
1
1.5
1 0.995
2
0.99 0
Time [s]
Fig.8. Mixing coefficients of π1 (π) (true) and π1 (π) for ππ = 0.7, 0.95 ,0.99
0.91
0.5
1 1.5 Time [s]
2
1 1.5 Time [s]
2
0.5
1 1.5 Time [s]
2
0.9 0.89 0.88 0
Fig.10. Estimated coefficients of π1 (π) (left) and π2 (π) (right).
0.9 0.89 0.88 0
0.5
0.91
coefficient
Amplitude
5
)| + |
Coefficient
-5 0
Coefficient
Amplitude
5
coefficient
Amplitude
5
10 We have plotted an example of π1 (π) against π1 (π) with different ππ values in Fig.8. The term π1 (π) of ππ = 0.7 has highly oscillatory values. Conversely, ππ (π) varies slowly and resembles a straight line when ππ = 0.99 because ππ (π) at the π π‘β frame depends 99% on its previous value. When ππ = 0.95, ππ (π) tracks very closely with the true ππ (π). Hence, ππ has a crucial role in tracking the behavior of ππ (π). Although ππ (π) is an estimate of ππ (π), the separating performance of ππ (π) yields the same SDR as ππ (π) at 14.7ππ΅ and 14.9ππ΅ for π ΜΜ1 (π‘) and π ΜΜ2 (π‘), respectively. This is because the condition πΊπ=π (π, π)