Online Noisy Single-Channel Source Separation

2 downloads 0 Views 2MB Size Report
proposed algorithm has yielded superior separation performance especially in low input SNR ... cannot be directly applied to a single-channel source separation. Thus, modified ICA ... destructive test and evaluation (NDT&E) [6-8]. In NDT&E,.
Online Noisy Single-Channel Source Separation Using Adaptive Spectrum Amplitude Estimator and Masking

1

N. Tengtrairat, W.L. Woo*, Senior Member, IEEE, S.S. Dlay and B. Gao, Member, IEEE Abstract β€”A novel single-channel source separation method is presented to recover the original signals given only a single observed mixture in noisy environment. The proposed separation method is an online adaptive process and independent of parameters initialization. In this paper, a noisy pseudo-stereo mixing model is developed by formulating an artificial mixture from the observed mixture where the signals are modeled by the autoregressive process. The proposed demixing process composes of two steps: Firstly, the noisy mixing model is enhanced by selecting the time-frequency (TF) units of signal-presence and computing the mixture spectral amplitude. Secondly, an adaptive estimation of the parameters associated with each source is computed frame-byframe which is then used to construct a TF mask for the separation process. To assess the performance of the proposed method, noisy mixtures of real-audio sources with nonstationary noise have been conducted under various SNRs. Experiments show that the proposed algorithm has yielded superior separation performance especially in low input SNR compared with existing methods. Index Terms β€” Blind source separation, underdetermined mixture, single-channel separation, noise reduction, masking.

I. INTRODUCTION

S

INGLE-channel

blind source separation (SCBSS) is the process of recovering underlying source signals from an unknown mixing given only a single sensor without any prior information of source signals. SCBSS has interested many researchers during the last decade. In the field of biomedical signal processing, SCBSS is used in several different areas. Applications of ECG/EEG recordings given by the electromyography (EMG) signal have been developed to distinguish heart-beat signal from an observed recording based on diverse approaches i.e. independent component analysis (ICA), nonnegative matrix factorization (NMF), singular spectrum analysis (SSA) [1-3]. Conventional ICA approach cannot be directly applied to a single-channel source separation. Thus, modified ICA methods were proposed. Single-channel independent component analysis (SCICA) approach in [4] applies the standard ICA to separate the independent signals from a single mixture. The special structure induced by mapping the observed mixture into a multi-channel model. The algorithm has certain limitations. For example, signals are assumed to be statistically independent. Secondly mixtures compose of nonoverlapping spectrum-density signals. SCBSS of EEG recoding based on singular spectrum analysis (SSA) was proposed in [5]. N. Tengtrairat is with Department of Software Engineering, Payap University, Chiang Mai, Thailand. W.L. Woo and S.S. Dlay are with School of Electrical and Electronic Engineering, Newcastle University, England, UK. B. Gao is with School of Automation, University of Electronic Science and Technology of China, Chengdu, China. * (Corresponding author e-mail: [email protected]) This paper appeared in the IEEE Transactions on Signal Processing, vol. 64, no. 7, pp. 1881-1895, 2016.

SSA decomposes a time series into a number of interpretable components with distinct subspaces and selects the subgroup of eigenvalues to reconstruct the original source. Another recent application of the SCBSS is image separation in the field of nondestructive test and evaluation (NDT&E) [6-8]. In NDT&E, researchers are interested with the study of defects. Imaging technique is used usually to image the target object when excited by an external signal. The captured image is a result of a superposition of several independent events where each event is associated with a particular physics phenomenon. The aim is to estimate these independent events and monitor the associated physical features in order to detect and monitor defects. In general, SCBSS can be categorized into two groups i.e. model-based and data-driven methodologies. In this study, we focus on data-driven SCBSS. A popular method is the computational auditory scene analysis (CASA). CASA has been proposed for the isolation of speech from noise by using the ideal binary masking (IBM) in time-frequency domain. A binary masking approach has been introduced to suppress noise from the noisy input and also maintain speech intelligibility. In [9], this method consists of two phases: Firstly, training phase evaluates an ideal binary masking (IBM) by using a Gaussian mixture model (GMM) to label each TF unit whether speechdominant or noise-dominant. Secondly, an enhancement phase is to construct a binary masking by using the IBM. Later in [10], a new binary-masking algorithm trained using deep neural networks (DNNs) with unsupervised restricted Boltzmann machines (RBMs) is proposed to improve the intelligibility of hearing-impaired listeners by separation of speech from noise through IBM estimation. Extension of GMM with user-generated exemplar source is proposed in [11]. This work uses an exemplar source provided from an external user to estimate the sources. Data-driven methods such as the sparse non-negative matrix factorization (SNMF) [12-13] determine a set of basis for each speaker and a mixture is mapped onto the joint bases of the speakers. It requires no assumption on sources such as statistical independence or grammatical model. However, the SNMF method does not model the temporal structure [14] and it requires large amount of computation to determine the speaker independent basis. The SNMF2D [15] was proposed which used a double convolution to model both spreading of spectral basis and variation of temporal structure inherent in the sources. Some successes have already been reported in recent literature [16-19] to show the validity of SNMF2D in separating single channel mixture. The SNMF has regained interest recently where the domain of interest lies in the complex spectrogram which gives rise to the complex NMF (CNMF). Some promising results have recently been reported in [20] with adaptive sparseness. On the other hand, binaural source separation method generally delivers better separation performance than a single recorder in the underdetermined scenario. The Degenerate Unmixing Estimation Technique (DUET) [21] and its variants [22, 23] have been proposed as a separating method using binary time-frequency (TF) masks. A major advantage of DUET is that the estimates

from two channels are combined inherently as part of the clustering process. The DUET algorithm has been demonstrated to recover the underlying sparse sources given two anechoic mixtures in the TF domain. Recently, DUET has been extended to the single-channel mixture and the algorithm was termed as the Single Observation Likelihood estimatiOn (SOLO) [24, 25]. The SOLO constructs an artificial stereo mixture which is then used to form a binary mask for separation. All of the above SCBSS algorithms are derived for noise-free condition which lacks the potential and robust to solve the problem in noisy environments. Since the presence of noise seriously degrades the performance, many algorithms for handling background noise have been developed. In a realistic situation of audio applications, desired signals will be corrupted by an additive background noise. Mathematically, noisy singlechannel blind source separation (NSCBSS) can be expressed as:

(2) π‘‘β„Ž

Formulate Noisy Pseudo-Stereo Mixture

π‘₯2 (𝑑)

(1)

where 𝑋(𝜏, πœ”) is an observed noisy mixture at the πœ” frequency bin of the 𝜏 π‘‘β„Ž frame, 𝑆(𝜏, πœ”) = βˆ‘π‘ 𝑖=1 𝑆𝑖 (𝜏, πœ”) is a sum of the source signals (i.e. mixture signal without noise), and 𝑁(𝜏, πœ”) denotes the noise. An enhanced spectrum of mixture signal 𝑆̃(𝜏, πœ”) is given as 𝑆̃(𝜏, πœ”) = 𝐺(𝜏, πœ”)𝑋(𝜏, πœ”) where 𝐺(𝜏, πœ”) is a spectral gain. Hence, speech-enhancement performance depends solely on the spectral gain by applying a frequency-dependent gain function to the spectral components of the noisy speech, in an effort to suppress the noise components to higher quality of speech components. Many approaches have been established in recent decades, for example the spectral subtraction method, minimum-mean square error (MMSE) estimation, and a maximum a posteriori (MAP) estimation. The spectral subtraction method [26] achieves noise reduction by subtracting estimated noise spectral amplitude from the observed spectral amplitude without concern of speech spectral components. Secondly, the MMSE estimator [27] and its more recent versions [28] apply a frequency dependent gain function to the spectral components of the noisy speech. Its solution is featured by the noise variance, a priori SNR, and a posteriori SNR where the noise variance is known or can be estimated. Lastly, the speech enhancement method using a maximum a posteriori (MAP) estimation [29, 30] modeled the speech probability density function (PDF) by a parametric super-Gaussian function developed from a histogram. This method has an effective noise reduction capability especially in low SNR environments which is superior among the three methods. In the paper, we consider the NSCBSS problem as one noisy mixture of N unknown sources signals. The contributions of the paper are summarized below: 1) It is an online adaptive separation method where the observed mixture is segmented into small frames. The separation process is executed adaptively frame-by-frame. Hence, the robustness of the proposed algorithm can benefit for real-time signal processing applications. 2) It is

STFT

stage

𝑋1 (𝜏, πœ”) Enhancement

where 𝑑 = 1, 2, … , 𝑇 denotes time index, 𝑛(𝑑) is unknown noise signal and the goal is to estimate the sources 𝑠𝑛 (𝑑), βˆ€π‘› ∈ 𝑁 of length 𝑇 when only the observation signal π‘₯(𝑑) is available. A well-known approach to improve intelligibility and perceptual quality of degraded speech is a speech enhancement approach. The speech enhancement approach is to remove background noise in a noisy speech. Most of the common enhancement techniques operate in the frequency domain which can generally be expressed as 𝑋(𝜏, πœ”) = 𝑆(𝜏, πœ”) + 𝑁(𝜏, πœ”)

π‘₯1 (𝑑)

𝑋2 (𝜏, πœ”)

Audio Activity Detection (AAD)

𝑋1 (𝜏, πœ”) 𝑋2 (𝜏, πœ”)

Spectral Amplitude Estimator

𝑋1 (𝜏, πœ”) Separation

π‘₯(𝑑) = 𝑠1 (𝑑) + 𝑠2 (𝑑) + β‹― + 𝑠𝑁 (𝑑) + 𝑛(𝑑)

2 an adaptive parameters estimation method. The parameters are adaptively estimated from two consecutive frames. The selfadaptive property is preferred for time-varying signals especially speech and highly nonstationary noise. 3) It is independent of parameters initialization, i.e. no need for random initial inputs or any predetermined structure on the sensors. This renders robustness to the proposed method. 4) It has computational simplicity and does not exploit high-order statistic. Hence this yields the benefit of ease of implementation. To achieve the above, the proposed method requires the following assumptions: the source signals are characterized as AR processes, the sources satisfy the windowed-disjoint orthogonality (WDO) and the local stationary of the time-frequency representation.

Construct Mask

π‘Žπ‘— (𝜏)

𝑋2 (𝜏, πœ”)

Mixing Attenuation Estimator

𝑀𝑗 (𝜏, πœ”) Demixing Mixture

Fig.1. Overview of the proposed algorithm.

The overview of the proposed method is illustrated in Fig.1 which is organized as follows: Section II introduces the noisy pseudo-stereo mixture model. Section III proposes an online demixing method i.e. the mixture enhancement and the separation process. Section IV presents the separability of the pseudo-stereo model. Experimental results with a series of performance comparison with other SCBSS methods are conducted and discussed in Section V. Finally, Section VI concludes the paper. II. PROPOSED SINGLE - CHANNEL NOISY MIXING MODEL A. Proposed Pseudo-Stereo Noisy Mixture Model In this paper, for simplicity we consider the case of a singlechannel noisy mixture of two sources and a noise in time domain as π‘₯1 (𝑑) = 𝑠1 (𝑑) + 𝑠2 (𝑑) + 𝑛1 (𝑑)

(3)

where π‘₯1 (𝑑) is the single channel mixture, 𝑛1 (𝑑) is an additive uncorrelated noise that can be stationary or nonstationary, and 𝑠1 (𝑑) and 𝑠2 (𝑑) are the original source signals which are assumed to be modeled by the autoregressive (AR) process [31]: 𝐷𝑗

𝑠𝑗 (𝑑) = βˆ’ βˆ‘π‘š=1 π‘Žπ‘ π‘— (π‘š; 𝑑)𝑠𝑗 (𝑑 βˆ’ π‘š) + 𝑒𝑗 (𝑑)

(4)

where π‘Žπ‘ π‘— (π‘š; 𝑑) denotes the π‘šπ‘‘β„Ž order AR coefficient of the π‘—π‘‘β„Ž source at time 𝑑, 𝐷𝑗 is the maximum AR order, and 𝑒𝑗 (𝑑) is an independent identically distributed (i.i.d.) random signal with

3 zero mean and variance πœŽπ‘’2𝑗 . This model enables us to formulate a virtual mixture by weighting and time-shifting the single channel mixture π‘₯1 (𝑑) as π‘₯2 (𝑑) =

π‘₯1 (𝑑) + 𝛾π‘₯1 (π‘‘βˆ’π›Ώ) 1+|𝛾|

(5)

where 𝛾 ∈  is the weight parameter, and 𝛿 ∈ β„€ is the timedelay. The noisy mixture in (3) and (5) is termed as β€œpseudostereo” because it has an artificial resemblance of a stereo signal except that it is given by one location which results in the same time-delay but different attenuation of the source signals. To show this, we can express (5) in terms of the source signals, AR coefficient and time-delay as π‘₯1 (𝑑) + 𝛾π‘₯1 (π‘‘βˆ’π›Ώ) 1+|𝛾| (βˆ’π‘Žπ‘  (𝛿;𝑑)+𝛾) 1

π‘₯2 (𝑑) = =

𝑠1 (𝑑 βˆ’ 𝛿) +

(βˆ’π‘Žπ‘  (𝛿;𝑑)+𝛾) 2

1+|𝛾| 𝑛 (𝑑) + 𝛾𝑛1 (π‘‘βˆ’π›Ώ) + 1 1+|𝛾|

+

1+|𝛾|

,π›Ώβˆˆβ„€ βˆ’π‘Žπ‘ π‘— (𝛿;𝑑)+𝛾

(7)

1+|𝛾| 𝐷𝑗

𝑛2 (𝑑; 𝛿, 𝛾) =

π‘šβ‰ π›Ώ

(8)

1+|𝛾| 𝑛1 (𝑑) + 𝛾𝑛1 (π‘‘βˆ’π›Ώ)

(9)

1+|𝛾|

where π‘Žπ‘— (𝑑; 𝛿, 𝛾) and π‘Ÿπ‘— (𝑑; 𝛿, 𝛾) represent the mixing attenuation and the residue of the π‘—π‘‘β„Ž source, respectively, and 𝑛2 (𝑑; 𝛿, 𝛾) denotes noise obtained by weighting and time-shifting of the additive noise 𝑛1 (𝑑). Using (7)-(9), the overall proposed noisy mixing model can now be formulated in terms of the sources and the noise as π‘₯1 (𝑑) = 𝑠1 (𝑑) + 𝑠2 (𝑑) + 𝑛1 (𝑑) π‘₯2 (𝑑) = π‘Ž1 (𝑑; 𝛿, 𝛾)𝑠1 (𝑑 βˆ’ 𝛿) + π‘Ž2 (𝑑; 𝛿, 𝛾)𝑠2 (𝑑 βˆ’ 𝛿) + π‘Ÿ1 (𝑑; 𝛿, 𝛾) + π‘Ÿ2 (𝑑; 𝛿, 𝛾) + 𝑛2 (𝑑; 𝛿, 𝛾)

(10)

B. Time-Frequency Representation The TF representation of the noisy mixing model is obtained using the Short-Time Fourier Transform (STFT) of π‘₯𝑗 (𝑑), 𝑗 = 1,2 as 𝑋1 (𝜏, πœ”) = 𝑆1 (𝜏, πœ”) + 𝑆2 (𝜏, πœ”) + 𝑁1 (𝜏, πœ”) 𝑋2 (𝜏, πœ”) β‰ˆ π‘Ž1 (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆1 (𝜏 βˆ’ 𝛿, πœ”) + π‘Ž2 (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆2 (𝜏 βˆ’ 𝛿, πœ”) βˆ’ 𝐷

1 (βˆ‘π‘š=1

π‘šβ‰ π›Ώ

π‘Žπ‘ 1 (π‘š;𝜏) βˆ’π‘–πœ”π‘š 𝑒 𝑆1 (𝜏 1+|𝛾| 𝐷

2 + βˆ‘π‘š=1

π‘šβ‰ π›Ώ

βˆ’ π‘š, πœ”)

π‘Žπ‘ 2 (π‘š;𝜏) βˆ’π‘–πœ”π‘š 𝑒 𝑆2 (𝜏 1+|𝛾|

where πœ”π‘šπ‘Žπ‘₯ = 2πœ‹π‘“π‘šπ‘Žπ‘₯ ⁄𝑓𝑠 , π›Ώπ‘šπ‘Žπ‘₯ is the maximum time delay, π‘“π‘šπ‘Žπ‘₯ is the maximum frequency present in the sources and 𝑓𝑠 is the sampling frequency. Hence, π›Ώπ‘šπ‘Žπ‘₯ can be determined from (14) according to 𝑓𝑠 2π‘“π‘šπ‘Žπ‘₯

(15)

As long as the delay parameter is less than π›Ώπ‘šπ‘Žπ‘₯ , there will not be any phase ambiguity. This condition will be used to determine the range of 𝛿 in formulating the pseudo-stereo mixture.

The proposed online single-channel noisy demixing method mainly comprises of two steps: The first step is mixture enhancement which aims to reduce the additive noise and extracts the source information. The second step is the separation process which isolates the original signals by multiplying a mask on the noise-reduced mixture. The mask is constructed by evaluating the cost function given by each source-signature estimator. A. Proposed Single-Channel Mixture Enhancement 1) Audio Activity Detection The audio activity detection (AAD) method enhances the noisy mixture by selecting the TF units that contain source signals and removing TF units without source signals. To begin, the two statistical hypotheses are set i.e. 𝐻0 (𝜏, πœ”) and 𝐻1 (𝜏, πœ”) which denote the source absence and presence at πœ”π‘‘β„Ž frequency bin of the 𝜏 π‘‘β„Ž frame, respectively. 𝐻0 (𝜏, πœ”): Source absence: 𝑋(𝜏, πœ”) = 𝑁(𝜏, πœ”) 𝐻1 (𝜏, πœ”): Source presence: 𝑋(𝜏, πœ”) = 𝑆(𝜏, πœ”) + 𝑁(𝜏, πœ”) (16) where 𝑋(𝜏, πœ”) is a mixture given by 𝑋1 (𝜏, πœ”) or 𝑋2 (𝜏, πœ”), 𝑆(𝜏, πœ”) is a sum of source signals i.e. 𝑆(𝜏, πœ”) = 𝑆1 (𝜏, πœ”) + 𝑆2 (𝜏, πœ”), and 𝑁(𝜏, πœ”) is the additive noise. The term 𝑆(𝜏, πœ”) and 𝑁(𝜏, πœ”) are assumed to be complex Gaussian distributed. Source presence at a particular (𝜏, πœ”) unit is detected by computing a local source absence probability (LSAP) and selecting the (𝜏, πœ”) unit that the LSAP is less than a local threshold 𝑇𝐿 where 𝑇𝐿 can be set by the user. The LSAP can be expressed as

βˆ’ π‘š, πœ”)) + 𝑁2 (𝜏, πœ”)

𝑝(𝐻0 (𝜏, πœ”)|𝑋(𝜏, πœ”)) =

(11)

=

for βˆ€πœ, πœ”. In (11), we have used the fact that |𝑒𝑗 (𝑑)| β‰ͺ |𝑠𝑗 (𝑑)|, thus the TF of π‘Ÿπ‘— (𝑑) in (13) can be simplified to 𝑅𝑗 (𝜏, πœ”) =

(14)

III. PROPOSED ONLINE SINGLE - CHANNEL NOISY DEMIXING METHOD

𝑒𝑗 (𝑑)βˆ’βˆ‘π‘š=1 π‘Žπ‘ π‘— (π‘š;𝑑)𝑠𝑗(π‘‘βˆ’π‘š)

π‘Ÿπ‘— (𝑑; 𝛿, 𝛾) =

|πœ”π‘šπ‘Žπ‘₯ π›Ώπ‘šπ‘Žπ‘₯ | < πœ‹

(6)

Defining the followings: π‘Žπ‘— (𝑑; 𝛿, 𝛾) =

which forms a part of 𝑅𝑗 (𝜏, πœ”) without the contribution of the source 𝑆𝑗 (𝜏, πœ”). Notice that factor 𝑒 βˆ’π‘–πœ”π›Ώ is only uniquely specified if |πœ”π›Ώ| < πœ‹, otherwise this would cause phase-wrap [32]. Selecting improper time-delay 𝛿 will lead to phase-wrap if the maximum frequency of the source is exceeded. In order to avoid phase ambiguity, we must satisfy

π›Ώπ‘šπ‘Žπ‘₯
1. The term 𝑝 (𝐻1 (𝜏, πœ”)|𝑋(𝜏, πœ”)) denotes a SPP given by the Bayes’ theorem: 𝑝 (𝐻1 (𝜏, πœ”)|𝑋(𝜏, πœ”)) =

Μƒ (𝜏,πœ”)|𝐻 (𝜏,πœ”))𝑝(𝐻 ) 𝑝(𝑋 1 1

2

Μƒ (𝜏,πœ”)| 1+πœ‰π‘“ |𝑋 =( exp {βˆ’πΈ 2 } + 1) π‘žπœ” Μ‚ πœŽπ‘(𝜏,πœ”)

𝑆̂(𝜏, πœ”) = 𝐴̂(𝜏, πœ”)𝑒 π‘–πœƒπœ”

1) Adaptive Mixing Parameter Estimator The sources are assumed to satisfy the local stationarity of the time-frequency representation. This refers to the approximation of 𝑆𝑗 (𝜏 βˆ’ , πœ”) β‰ˆ 𝑆𝑗 (𝜏, πœ”) where  is the maximum time-delay (shift) associated with the Short-Time Fourier Transform (STFT) 𝐹 π‘Š (βˆ™) with an appropriate window function π‘Š(βˆ™). If  is small compared with the length of π‘Š(βˆ™) then π‘Š(βˆ™ βˆ’ο¦) β‰ˆ π‘Š(βˆ™). Hence, the Fourier transform of a windowed function with shift  yields approximately the same Fourier transform without . For the proposed method, the pseudo-stereo mixture is shifted by 𝛿 and by invoking the local stationarity this leads to 𝑆𝑇𝐹𝑇

𝑠𝑗 (𝑑 βˆ’ 𝛿) β†’

𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏 βˆ’ 𝛿, πœ”) β‰ˆ 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) ,

βˆ€π›Ώ, |𝛿| ≀ 

(28)

Thus, the STFT of 𝑠𝑗 (𝑑 βˆ’ 𝛿) where |𝛿| ≀  is approximately 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) according to the local stationarity. Secondly, assuming that the sources satisfy the windowed-disjoint orthogonality (WDO) condition: 𝑆𝑖 (𝜏, πœ”)𝑆𝑗 (𝜏, πœ”) β‰ˆ 0,

βˆ€π‘– β‰  𝑗 , βˆ€πœ, πœ” (29)

where 𝑆𝑖 (𝜏, πœ”) and 𝑆𝑗 (𝜏, πœ”) are the STFT of 𝑠𝑖 (𝑑) and 𝑠𝑗 (𝑑). Hence, the π‘—π‘‘β„Ž source is dominant at a particular (𝜏, πœ”) unit, the noise-reduced mixture can be more specifically expressed as: Μƒ1 (𝜏, πœ”) 𝑋1 (𝜏, πœ”) = 𝑆̂𝑗 (𝜏, πœ”) + 𝑁 𝑋2 (𝜏, πœ”) = π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏 βˆ’ 𝛿, πœ”) βˆ’ π‘šβ‰ π›Ώ

(25)

(27)

B. Proposed Single - Channel Source Separation

𝐷

βˆ’1

(26)

In conclusion, the proposed mixture enhancement method will benefit the source separation by providing the greater degree of source information by attempting to select the TF units of source presence and reject the TF units of solely noise. The noisereduced mixture can now be modeled as 𝑋(𝜏, πœ”) = Μƒ(𝜏, πœ”) which will then be separated by a binary 𝐴̂(𝜏, πœ”)𝑒 π‘–πœƒπœ” + 𝑁 TF mask.

𝑗 βˆ‘π‘š=1

Μƒ (𝜏,πœ”)|𝐻 (𝜏,πœ”))𝑝(𝐻 )+𝑝(𝑋 Μƒ (𝜏,πœ”)|𝐻 (𝜏,πœ”))𝑝(𝐻 ) 𝑝(𝑋 0 0 1 1

1+πœ‰π‘“ 1 ( )) βˆ’1 π‘žπœ” 𝑝(𝐻 (𝜏,πœ”)|𝑋 Μƒ (𝜏,πœ”)) βˆ’1 1

Using the πœ‰π‘“ and 𝑝 (𝐻1 (𝜏, πœ”)|𝑋(𝜏, πœ”)) > 0.08, the a posteriori SNR then satisfies 𝛾̂𝑆𝑁𝑅 (𝜏, πœ”) > 1. Hence, the term πœ‰Μ‚(𝜏, πœ”) can be obtained by computing both estimators of the previous and current frames. Therefore, to extract source information even when source components are weak in low input SNR, the proposed iMMSE-STSA firstly estimate the a posteriori SNR using (26) and then using this estimate for computing the spectral amplitude. Finally, the estimated spectra of the mixture can be formulated as

Using the subadditivity properties of the absolute value, we obtain 𝐸[

5 Eqn. (31) is solved for the a posteriori

π‘Žπ‘ π‘— (π‘š;𝜏) 1+|𝛾|

Μƒ2 (𝜏, πœ”) 𝑒 βˆ’π‘–πœ”π‘š 𝑆̂𝑗 (𝜏 βˆ’ π‘š, πœ”) + 𝑁

Μƒ2 (𝜏, πœ”), β‰ˆ [π‘Žπ‘— (𝜏) βˆ’ 𝐢𝑗 (𝜏, πœ”)]𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏, πœ”) + 𝑁 (𝜏, πœ”) ∈ 𝑗

(30)

for 𝛿 and π‘š ≀ . The term 𝐷𝑗 1 βˆ’π‘–πœ”(π‘šβˆ’π›Ώ) 𝐢𝑗 (𝜏, πœ”) = | | βˆ‘π‘š=1 π‘Žπ‘ π‘— (π‘š; 𝜏)𝑒 is given by (13) and 1+ 𝛾

π‘šβ‰ π›Ώ π‘‘β„Ž

𝑗 is the 𝑗 source presence area defined 𝑗 ∢= {(𝜏, πœ”): 𝑆̂𝑗 (𝜏, πœ”) β‰  0, βˆ€π‘˜ β‰  𝑗}. The estimate π‘Žπ‘— (𝜏, πœ”) = π‘Žπ‘— (𝜏) βˆ’ 𝐢𝑗 (𝜏, πœ”) associated with the π‘—π‘‘β„Ž be determined as

as of source can

6 𝑝(𝑋1 (𝜏, πœ”), 𝑋2 (𝜏, πœ”)|𝑆𝑗 (𝜏, πœ”), π‘Žπ‘— , πœŽπ‘2ΜƒΜƒ ) 𝑗

the likelihood function with respect to 𝑆𝑗 (𝜏, πœ”) and then substituting the obtained result into the Gaussian likelihood function. The resulting instantaneous likelihood function assumes the following form: 𝐿𝑗 (𝜏, πœ”) ∢= 𝑝(𝑋1 (𝜏, πœ”), 𝑋2 (𝜏, πœ”)|𝑆𝑗 (𝜏, πœ”), π‘Žπ‘— , πœŽπ‘2ΜƒΜƒ ) 𝑗

Μ‚ ( Μƒ 𝑋 2 𝜏,πœ”) π‘–πœ”π›Ώ 𝑒 Μ‚ ( Μƒ 𝑋 1 𝜏,πœ”) Μ‚ (𝜏,πœ”)+𝑁 Μƒ (𝜏,πœ”) [π‘Žπ‘— (𝜏)βˆ’πΆπ‘— (𝜏,πœ”)]π‘’βˆ’π‘–πœ”π›Ώ 𝑆 𝑗 2 = 𝑒 π‘–πœ”π›Ώ Μ‚ Μƒ 𝑆𝑗 (𝜏,πœ”)+𝑁1(𝜏,πœ”)

π‘Žπ‘— (𝜏, πœ”) =

=

Μƒ1 (𝜏, πœ”) and 𝑁 Μƒ2 (𝜏, πœ”) can be assumed to be small after the 𝑁 mixture enhancement step (as shown in Section V.B). In this case, we can expressed π‘Žπ‘— (𝜏, πœ”) as π‘Žπ‘— (𝜏, πœ”) =

Μ‚ (𝜏,πœ”) [π‘Žπ‘— (𝜏)βˆ’πΆπ‘— (𝜏,πœ”)]π‘’βˆ’π‘–πœ”π›Ώ 𝑆 𝑗

𝑆𝑗 (𝜏,πœ”)

𝑋 (𝜏,πœ”)

(π‘Ÿ)

(𝑖)

π‘Žπ‘— (𝜏) =

Μ‚ Μƒ (𝜏,πœ”) 𝑋 Μ‚ Μ‚ ( 2 π‘–πœ”π›Ώ ] Μƒ (𝜏,πœ”)𝑋 Μƒ βˆ‘πœ”|𝑋 1 2 𝜏,πœ”)|πΌπ‘š[ Μ‚Μƒ ( )𝑒 𝑋1 𝜏,πœ” Μ‚ Μ‚ Μƒ (𝜏,πœ”)𝑋 Μƒ (𝜏,πœ”)| βˆ‘πœ”|𝑋 1 2

=

π‘Žπ‘ π‘— (π‘š;𝜏)

Μƒ2 (𝜏, πœ”) 𝑒 βˆ’π‘–πœ”π‘š 𝑆̂𝑗 (𝜏 βˆ’ π‘š, πœ”) + 𝑁 1+|𝛾| π‘šβ‰ π›Ώ 𝑆̂ 𝑗 (𝜏,πœ”)βˆ’πΈπ‘— (𝜏,πœ”) 1+π›Ύπ‘’βˆ’π‘–πœ”π›Ώ 𝛾 Μƒ 1 (𝜏, πœ”) 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏 βˆ’ 𝛿, πœ”) + + 1+|𝛾| 𝑁 1+|𝛾| 1+|𝛾|

𝑗 βˆ‘π‘š=1

By invoking the local stationarity, we then obtain 𝐸 (𝜏,πœ”) 1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ Μ‚ Μƒ 1 (𝜏, πœ”)) βˆ’ 𝑗 (𝑆𝑗 (𝜏, πœ”) + 𝑁 1+|𝛾| 1+|𝛾|

𝑋2 (𝜏, πœ”) =

𝑋2 (𝜏, πœ”) β‰ˆ (

𝐸𝑗 (𝜏,πœ”) 1+π›Ύπ‘’βˆ’π‘–πœ”π›Ώ ) 𝑋1 (𝜏, πœ”) βˆ’ 1+|𝛾| 1+|𝛾|

(𝑖)

𝐽(𝜏, πœ”) = argmin πΊπ‘˜ (𝜏, πœ”)

(32)

(34)

where 0 < πœπ‘€ < 1 is a smoothing parameter of the adaptive mixing attenuation estimator. 2) Construction of Masks The binary TF masks can be constructed by labeling each TF unit with the π‘˜ argument through maximizing the instantaneous likelihood function. The instantaneous likelihood function is derived from the maximum likelihood (ML) method by first formulating the Gaussian likelihood function

(39)

π‘˜

Μ‚ Μ‚ Μƒ (𝜏,πœ”)βˆ’(1+𝛾𝑒 Μƒ (𝜏,πœ”) Μƒ π‘’βˆ’π‘–πœ”π›Ώ 𝑋 Μ… π‘Ž )𝑋 1 1 π‘˜ 1+|𝛾|

πΊπ‘˜ (𝜏, πœ”) = |

2

Μƒ Μ… π‘˜ (𝜏) Μ‚ 2Μƒ (𝜏,πœ”)+𝜎 Μ‚ 2Μƒ (𝜏,πœ”) π‘Ž 𝜎 𝑁 𝑁

(33)

Relating (33) with (31), we can use similar idea to express π‘Žπ‘— (𝜏) = π‘ŽΜ‚π‘— (𝜏) βˆ’ 𝐢̂𝑗 (𝜏) where π‘ŽΜ‚π‘— (𝜏) and 𝐢̂𝑗 (𝜏) are the power weighted estimation of π‘Žπ‘— (𝜏) and 𝐢𝑗 (𝜏, πœ”), respectively. Secondly, the adaptive mixing attenuation estimator π‘Žπ‘— (𝜏) is obtained by smoothing π‘Žπ‘— (𝜏 βˆ’ 1) and π‘Žπ‘— (𝜏): π‘Žπ‘— (𝜏) = πœπ‘€ π‘Žπ‘— (𝜏 βˆ’ 1) + (1 βˆ’ πœπ‘€ )π‘Žπ‘— (𝜏)

(38)

In this light, the proposed cost function πΊπ‘˜ (𝜏, πœ”) can be formulated based on the single mixture 𝑋1 (𝜏, πœ”) by substituting this expression into (36) which leads to

βˆ’π‘–πœ”π›Ώ

(π‘Ÿ)

(37)

for 𝛿 ≀ . The derivation of 𝑋2 (𝜏, πœ”) in the source domain in (37) allows us to express 𝑋2 (𝜏, πœ”) in the mixture domain as:

The above can then be combined to form the estimate of (32) as π‘Žπ‘— (𝜏) = π‘Žπ‘— (𝜏) + π‘–π‘Žπ‘— (𝜏)

(36)

𝑋2 (𝜏, πœ”) = π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏 βˆ’ 𝛿, πœ”) βˆ’

𝑋 (𝜏,πœ”)

Μ‚ Μƒ (𝜏,πœ”) 𝑋 Μ‚ ( Μ‚ 2 π‘–πœ”π›Ώ ] Μƒ Μƒ βˆ‘πœ”|𝑋 1 𝜏,πœ”)𝑋2 (𝜏,πœ”)|𝑅𝑒[ Μ‚Μƒ ( )𝑒 𝑋1 𝜏,πœ” Μ‚ Μ‚ ( Μƒ (𝜏 ,πœ”)𝑋 Μƒ βˆ‘πœ”|𝑋 1 2 𝜏,,πœ”)|

2 Μ‚ ( Μ‚ Μƒ Μƒ Μƒ Μ… π‘˜ (𝜏)π‘’βˆ’π‘–πœ”π›Ώ 𝑋 |π‘Ž 1 𝜏,πœ”)βˆ’π‘‹2 (𝜏,πœ”)| π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› Μƒ 2 (𝜏) Μ… Μ‚ 2Μƒ (𝜏,πœ”)+𝜎 Μ‚ 2Μƒ (𝜏,πœ”) π‘Ž 𝜎 π‘˜ π‘˜ 𝑁2 𝑁1

Using (30), the term 𝑋2 (𝜏, πœ”) can be expressed as: 𝐷

where π‘Žπ‘—(π‘Ÿ) (𝜏, πœ”) = 𝑅𝑒 [𝑋2(𝜏,πœ”) 𝑒 π‘–πœ”π›Ώ ] and π‘Žπ‘—(𝑖)(𝜏, πœ”) = πΌπ‘š [𝑋2(𝜏,πœ”) 𝑒 π‘–πœ”π›Ώ ] 1 1 are the real and imaginary parts of π‘Žπ‘— (𝜏, πœ”), respectively, and 𝑖 = βˆšβˆ’1. We propose to adaptively estimate π‘Žπ‘— (𝜏, πœ”) frame-byframe. Firstly, a power weighted TF histogram will be used to estimate π‘Žπ‘— (𝜏, πœ”) for each frame and the TF units are then clustered into a number of groups corresponding to the number of sources in the mixture. The power weighted histogram is a function of (𝜏, πœ”) with the weight βˆ‘ |𝑋1 (𝜏, πœ”)𝑋2 (𝜏, πœ”)| therefore the real and imaginary parts of π‘Žπ‘— (𝜏, πœ”) for each frame basis can be estimated as

(35)

The function 𝐿𝑗 (𝜏, πœ”) clusters every (𝜏, πœ”) unit to the π‘—π‘‘β„Ž dominating source for 𝐿𝑗 (𝜏, πœ”) β‰₯ πΏπ‘˜ (𝜏, πœ”), βˆ€π‘˜ β‰  𝑗. This process is equivalent to the following minimization problem:

π‘–πœ”π›Ώ

= π‘Žπ‘— (𝜏) βˆ’ 𝐢𝑗 (𝜏, πœ”) (π‘Ÿ) (𝑖) = π‘Žπ‘— (𝜏, πœ”) + π‘–π‘Žπ‘— (𝜏, πœ”) , βˆ€(𝜏, πœ”) ∈ 𝑗 (31)

π‘Žπ‘— (𝜏) =

2 Μ‚ Μ‚ Μƒ (𝜏,πœ”)βˆ’π‘‹ Μƒ (𝜏,πœ”)| Μƒ (𝜏)π‘’βˆ’π‘–πœ”π›Ώ 𝑋 Μ… 1 1 |π‘Ž 𝑗 1 2 𝑒π‘₯𝑝 (βˆ’ ) 2πœ‹ 2 𝜎 Μƒ 2 (𝜏) Μ… Μ‚ 2Μƒ (𝜏,πœ”)+𝜎 Μ‚ 2Μƒ (𝜏,πœ”) π‘Ž 𝑗 𝑁2 𝑁1

𝐹(𝜏, πœ”) = 𝑒

using (30), maximizing

2

2

|

(40)

1

Since 𝑒𝑗 (𝑑) β‰ͺ 𝑠𝑗 (𝑑), the term 𝐸𝑗 (𝜏, πœ”)/(1 + |𝛾|) is negligible. Hence, 𝑋2 (𝜏, πœ”) β‰ˆ (

1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ 1+|𝛾|

) 𝑋1 (𝜏, πœ”). Using (39) and (40), in the

instance when the 𝑗 source dominates at (𝜏, πœ”) ∈ 𝑗 , the function 𝐽(𝜏, πœ”) will correctly identify the source if and only if πΊπ‘˜=𝑗 (𝜏, πœ”) < πΊπ‘˜β‰ π‘— (𝜏, πœ”). To elucidate this condition, firstly, the case when π‘˜ = 𝑗 is considered by setting πœπ‘€ = 0: π‘‘β„Ž

πΊπ‘˜=𝑗 (𝜏, πœ”) = 1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ

Μƒ1 (𝜏, πœ”)) βˆ’ ( Μƒ1 (𝜏, πœ”))| |π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ (𝑆̂𝑗 (𝜏, πœ”) + 𝑁 ) (𝑆̂𝑗 (𝜏, πœ”) + 𝑁 1+|𝛾| βˆ’π‘–πœ”π›Ώ βˆ’π‘–πœ”π›Ώ βˆ’π‘–πœ”π›Ώ Μ‚ Μ‚ Μƒ = |π‘ŽΜ‚π‘— (𝜏)𝑒 𝑆𝑗 (𝜏, πœ”) βˆ’ 𝐢𝑗 (𝜏)𝑒 𝑆𝑗 (𝜏, πœ”) + π‘Žπ‘— (𝜏)𝑒 𝑁1 (𝜏, πœ”) βˆ’ (

1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ 1+|𝛾|

1+𝛾𝑒 ) 𝑆̂𝑗 (𝜏, πœ”) βˆ’ (

= |βˆ’ (𝐢̂𝑗 (𝜏) + 𝑆̂𝑗 (𝜏,πœ”) 1+|𝛾|

βˆ’π‘–πœ”π›Ώ

1+|𝛾|

Μƒ1 (𝜏, πœ”)| )𝑁

2

2

π‘Žπ‘†π‘— (𝛿;𝜏)

+(

1+|𝛾|

Μƒ1 (𝜏, πœ”) βˆ’ ) 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏, πœ”) + π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑁

1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ 1+|𝛾|

Μƒ1 (𝜏, πœ”)| )𝑁

2

When π‘˜ β‰  𝑗, following the above step leads to

(41)

7 𝑆̂𝑗 (𝜏,πœ”) 1+|𝛾|

+(

π‘Žπ‘†π‘— (𝛿;𝜏)

Μ‚ (𝜏, πœ”) + π‘Ž Μƒ (𝜏, πœ”) βˆ’ Μƒ Μ… π‘˜(𝜏)π‘’βˆ’π‘–πœ”π›Ώ 𝑁 ) π‘’βˆ’π‘–πœ”π›Ώ 𝑆 𝑗 1 1+|𝛾| 2 1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ

Μƒ Μ… π‘˜(𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’ πΊπ‘˜β‰ π‘— (𝜏, πœ”) = |(π‘Ž

1+|𝛾|

Μƒ1 (𝜏, πœ”)| )𝑁

Finally, convert the estimated sources from TF domain into time domain i.e. 𝑠̂̂𝑗 (𝑑).

(42)

To guarantee that πΊπ‘˜=𝑗 (𝜏, πœ”) < πΊπ‘˜β‰ π‘— (𝜏, πœ”) is always satisfied, then we must specified a condition for 𝐢̂𝑗 . Starting with (41) and (42), we have

IV. ANALYSIS OF SEPARABILITY OF THE PROPOSED PSEUDOSTEREO MIXTURE MODEL

The separability of the noise-free mixing model can be examined from the noise-free pseudo-stereo mixture by 2 π‘Ž (𝛿;𝜏) Μ‚ (𝜏,πœ”) 1+𝛾𝑒 Μƒ1(𝜏, πœ”) βˆ’ (𝑆 Μƒ1(𝜏, πœ”))| considering π‘Žπ‘— (𝑑; 𝛿, 𝛾) and π‘Ÿπ‘— (𝑑; 𝛿, 𝛾) in the following three |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’ ) 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏, πœ”) + π‘Žπ‘˜ (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑁 +( )𝑁 1+|𝛾| 1+|𝛾| 1+|𝛾| cases. Case 1 refers to identical sources mixed in the single (43) channel, Case 2 represents different sources but setting 𝛾 and 𝛿 Eq. (43) is bounded by for the pseudo-stereo mixture such that π‘Ž1 (𝑑; 𝛿, 𝛾) = π‘Ž2 (𝑑; 𝛿, 𝛾), π‘Žπ‘† (𝛿;𝜏) 𝑆̂ (𝜏,πœ”) 1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ 𝑗 and Case 3 corresponds to the most general case where the Μƒ 1 (𝜏, πœ”)| – | 𝑗 Μƒ 1 (𝜏, πœ”)| < |𝐢̂𝑗 (𝜏)𝑆̂𝑗 (𝜏, πœ”)| βˆ’ | )𝑁 𝑆̂𝑗 (𝜏, πœ”) βˆ’ π‘Žπ‘— (𝜏)𝑁 +( 1+|𝛾| 1+|𝛾| 1+|𝛾| π‘Ž (𝛿;𝜏) sources are distinct, and 𝛾 and 𝛿 are selected arbitrarily such that 𝑆̂ (𝜏,πœ”) 1+𝛾𝑒 Μƒ1(𝜏, πœ”)| + | Μƒ1(𝜏, πœ”)| |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’ ) 𝑆̂𝑗 (𝜏, πœ”) + π‘Žπ‘˜ (𝜏)𝑁 +( )𝑁 1+|𝛾| 1+|𝛾| 1+|𝛾| the mixing attenuations and residues are also different. The and therefore we obtain above cases are demonstrated by using the functions 𝐽(𝜏, πœ”) and π‘Žπ‘†π‘— (𝛿;𝜏) Μƒ1 (𝜏,πœ”) πΊπ‘˜ (𝜏, πœ”) from Section III.B.2). These function are recapped here 𝑁 |𝐢̂𝑗 (𝜏)| < |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’ ) + π‘Žπ‘˜ (𝜏) Μ‚ (𝜏,πœ”) | + 1+|𝛾| 𝑆𝑗 as: |βˆ’ (𝐢̂𝑗 (𝜏) +

π‘Žπ‘†π‘— (𝛿;𝜏)

𝑆̂𝑗 (𝜏,πœ”)

1+|𝛾|

1+|𝛾|

Μƒ1(𝜏, πœ”) βˆ’ ( ) 𝑒 βˆ’π‘–πœ”π›Ώ 𝑆̂𝑗 (𝜏, πœ”) + π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑁 𝑆𝑗

π‘Žπ‘†π‘— (𝛿;𝜏) 1+|𝛾|

1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ 1+|𝛾|

2

Μƒ1(𝜏, πœ”))| < )𝑁 βˆ’π‘–πœ”π›Ώ

𝑗

𝑆𝑗

|

+(

βˆ’π‘–πœ”π›Ώ

𝑗

βˆ’ π‘Žπ‘— (𝜏)

Μƒ1 (𝜏,πœ”) 𝑁 | 𝑆̂𝑗 (𝜏,πœ”)

+

2 1+|𝛾|

|1 + (1 + 𝛾𝑒 βˆ’π‘–πœ”π›Ώ )

Μƒ1 (𝜏,πœ”) 𝑁 | 𝑆̂𝑗 (𝜏,πœ”)

𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› πΊπ‘˜ (𝜏, πœ”)

Μƒ1 (𝜏, πœ”) has small energy compared with source for βˆ€π‘— β‰  π‘˜. As 𝑁 energy they can be treated as negligible. Hence, Eq.(44) can be simplified to |𝐢̂𝑗 (𝜏)| < |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’

π‘Žπ‘†π‘— (𝛿;𝜏) 1+|𝛾|

π‘Žπ‘†π‘— (𝛿;𝜏)

)| + |

1+|𝛾|

|+

2 1+|𝛾|

(45)

If the condition in (45) is satisfied across 𝑗 , the function (39)(40) will then correctly assign the (𝜏, πœ”) unit to the π‘—π‘‘β„Ž source. Once the TF plane of the mixtures are assigned into π‘˜ groups of (𝜏, πœ”) units, the binary TF mask for the π‘—π‘‘β„Ž source can then be constructed as 𝑀𝑗 (𝜏, πœ”) ∢= {

1 𝐽(𝜏, πœ”) = 𝑗 . 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

(46)

πΊπ‘˜ (𝜏, πœ”) = |π‘Žπ‘˜ (𝜏, πœ”)𝑒 βˆ’π‘–πœ”π›Ώ 𝑋1 (𝜏, πœ”) βˆ’ (

βˆ’π‘–πœ”π›Ώ

πΊπ‘˜ (𝜏, πœ”) = |π‘Žπ‘˜ (𝜏, πœ”)𝑒 =

|π‘Žπ‘˜ (𝜏, πœ”)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”)

2

) 𝑋1 (𝜏, πœ”)|

𝑆𝑗 (𝜏, πœ”) βˆ’ (

βˆ’

1+π›Ύπ‘’βˆ’π‘–πœ”π›Ώ 1+|𝛾|

2

) 𝑆𝑗 (𝜏, πœ”)| 2

π›Ύπ‘’βˆ’π‘–πœ”π›Ώ βˆ’ 𝑆 (𝜏, πœ”)| 1+|𝛾| 1+|𝛾| 𝑗

𝑆𝑗 (𝜏,πœ”)

= |π‘Žπ‘˜ (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) βˆ’ πΆπ‘˜ (𝜏, πœ”)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) +

The proposed algorithm is summarized in Table I.

π‘šβ‰ π›Ώ

(47)

1+|𝛾|

For each TF unit, the π‘˜ π‘‘β„Ž argument that gives the minimum cost will be assigned to the π‘˜ π‘‘β„Ž source. We may analyze (49) further by assuming that the π‘—π‘‘β„Ž source dominates at a particular TF unit. In this case, the observed mixture in TF domain reduces to 𝑋1 (𝜏, πœ”) = 𝑆𝑗 (𝜏, πœ”) and therefore, (49) becomes

𝐷

𝑆̂𝑗 (𝜏, πœ”) = 𝑀𝑗 (𝜏, πœ”)𝑋1 (𝜏, πœ”)

1+𝛾𝑒 βˆ’π‘–πœ”π›Ώ

(49)

𝑗 βˆ‘π‘š=1

Table I: Overview proposed algorithm 1. Pseudo-Stereo Mixture step: Formulate the pseudo-stereo mixture π‘₯2 (𝑑) using (5). 2. Transform step: Transform two mixtures π‘₯1 (𝑑) and π‘₯2 (𝑑) into TF domain by using STFT. 3. Online Single-Channel Demixing: A. Single-Channel Source Enhancement step: 1) Audio Activity Detection: Compute the local SAP at the 𝜏 π‘‘β„Ž frame bin and the πœ”π‘‘β„Ž frequency of two mixtures using (17) and the global SAP for the 𝜏 π‘‘β„Ž frame using (18). If the global SAP > 𝑇𝐺 then updates πœŽΜ‚π‘2̃𝑗 (𝜏, πœ”) using (19). 2) iMMSE-STSA Estimator: Compute the iMMSE estimator of the source spectral amplitude using (24) and formulate the estimated spectra of the π‘—π‘‘β„Ž sources 𝑆̃(𝜏, πœ”) using (27) for both mixtures. B. Separation step: 1) Compute the mixing attenuation estimators (π‘Ÿ) (𝑖) (π‘Žπ‘— (𝜏), π‘Žπ‘— (𝜏)) at the 𝜏 π‘‘β„Ž frame using (32) and (34). 2) Label (𝜏, πœ”) units using (39) and (40), and form the binary TF mask 𝑀𝑗 (𝜏, πœ”). Recover the original sources by

(48)

π‘˜

(44)

π‘Žπ‘ π‘— (π‘š;𝜏)π‘’βˆ’π‘–πœ”π‘š 1+|𝛾|

2

𝑆𝑗 (𝜏 βˆ’ π‘š, πœ”) βˆ’ π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”)|

(50)

We consider the following three cases: Case 1: If π‘Ž1 (𝑑; 𝛿, 𝛾) = π‘Ž2 (𝑑; 𝛿, 𝛾) = π‘Ž(𝑑; 𝛿, 𝛾) π‘Ÿ1 (𝑑; 𝛿, 𝛾) = π‘Ÿ2 (𝑑; 𝛿, 𝛾) = π‘Ÿ(𝑑; 𝛿, 𝛾), then π‘₯2 (𝑑; 𝛿, 𝛾) = (

βˆ’π‘Ž(𝛿;𝑑)+𝛾 ) π‘₯1 (𝑑 1+|𝛾|

and

βˆ’ 𝛿) + 2π‘Ÿ(𝑑; 𝛿, 𝛾).

In this case, there is no benefit achieved at all. The second mixture is simply formulated as a time-delayed of the first mixture multiply by a scalar plus the redundant residue the separability of this case is presented by substituting the pseudostereo mixture of Case 1 into the cost function. Since both residues are equal, then 𝐢1 (𝜏, πœ”) = 𝐢2 (𝜏, πœ”) = 𝐢(𝜏, πœ”) = 1 βˆ‘π·π‘š=1 π‘Žπ‘  (π‘š; 𝜏)𝑒 βˆ’π‘–πœ”(π‘šβˆ’π›Ώ) . For Case 1, the function 𝐽(𝜏, πœ”) 1+|𝛾|

π‘šβ‰ π›Ώ

given by (50) becomes: 𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› |π‘Ž(𝜏)π‘’βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) βˆ’ 𝐢(𝜏, πœ”)π‘’βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) + π‘˜

π‘Žπ‘  (π‘š;𝜏)π‘’βˆ’π‘–πœ”π‘š βˆ‘π· 𝑆𝑗 (𝜏 π‘š=1 1+|𝛾| π‘šβ‰ π›Ώ

2

βˆ’ π‘š, πœ”) βˆ’ π‘Ž(𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”)|

Invoking the local stationarity of the sources 𝑆𝑗 (𝜏 βˆ’ 𝐷𝑗 , πœ”) = 𝑆𝑗 (𝜏, πœ”) for |𝐷𝑗 | ≀ , the above leads to

8 𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› |βˆ‘π· π‘š=1 π‘˜

(π‘Žπ‘  (π‘š;𝜏)π‘’βˆ’π‘–πœ”π‘š βˆ’π‘Žπ‘  (π‘š;𝜏)π‘’βˆ’π‘–πœ”π‘š )

1+|𝛾|

π‘šβ‰ π›Ώ

2

| |𝑆𝑗 (𝜏, πœ”)|

V. RESULTS AND ANALYSIS

2

= 0 for βˆ€π‘˜. As a result, the function 𝐽(𝜏, πœ”) is zero for all π‘˜ arguments i.e. 𝐽1 = 𝐽2 = 0. In this case, the function 𝐽(𝜏, πœ”) cannot distinguish the π‘˜ arguments, the mixture is not separable. Case 2: If π‘Ž1 (𝑑; 𝛿, 𝛾) = π‘Ž2 (𝑑; 𝛿, 𝛾) = π‘Ž(𝑑; 𝛿, 𝛾) π‘Ÿ1 (𝑑; 𝛿, 𝛾) β‰  π‘Ÿ2 (𝑑; 𝛿, 𝛾), then π‘₯2 (𝑑; 𝛿, 𝛾) = (

βˆ’π‘Ž(𝛿;𝑑)+𝛾 ) π‘₯1 (𝑑 1+|𝛾|

and

βˆ’ 𝛿) + π‘Ÿ1 (𝑑; 𝛿, 𝛾) + π‘Ÿ2 (𝑑; 𝛿, 𝛾).

This case remains almost similar to the previous case and differs only in terms of π‘Ÿ1 (𝑑; 𝛿, 𝛾) β‰  π‘Ÿ2 (𝑑; 𝛿, 𝛾). As each residue π‘Ÿπ‘— (𝑑; 𝛿, 𝛾) is related to the π‘—π‘‘β„Ž source via 𝐢𝑗 (𝜏, πœ”), the separability of this mixture can be analyzed using 𝐽(𝜏, πœ”)and (50) as 𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› |π‘Ž(𝜏)π‘’βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) βˆ’ πΆπ‘˜ (𝜏, πœ”)π‘’βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) + π‘˜ βˆ’π‘–πœ”π‘š 𝐷𝑗 π‘Žπ‘ π‘— (π‘š;𝜏)𝑒 βˆ‘π‘š=1 𝑆𝑗 (𝜏 1+|𝛾| π‘šβ‰ π›Ώ

=

2

βˆ’ π‘š, πœ”) βˆ’

π‘Ž(𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”)|

𝐷𝑗 (π‘Žπ‘ π‘— (π‘š;𝜏)βˆ’π‘Žπ‘ π‘˜ (π‘š;𝜏)) βˆ’π‘–πœ”π‘š π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› |βˆ‘π‘š=1 𝑒 | 1+|𝛾| π‘˜ π‘šβ‰ π›Ώ

2

|𝑆𝑗 (𝜏, πœ”)|

2

It can be deduced from above that the cost function yields a zero value for π‘˜ = 𝑗, and nonzero value for π‘˜ β‰  𝑗. Despite the mixing attenuation for both sources are identical, the function 𝐽(𝜏, πœ”) is still able to distinguish the π‘˜ arguments by using only the difference of residues. Therefore, the mixture of Case 2 is separable. Case 3: π‘Ž1 (𝑑; 𝛿, 𝛾) β‰  π‘Ž2 (𝑑; 𝛿, 𝛾) and π‘Ÿ1 (𝑑; 𝛿, 𝛾) β‰  π‘Ÿ2 (𝑑; 𝛿, 𝛾) (or π‘Ÿ1 (𝑑; 𝛿, 𝛾) = π‘Ÿ2 (𝑑; 𝛿, 𝛾) ) then π‘₯2 (𝑑; 𝛿, 𝛾) = (

βˆ’π‘Žπ‘  1 (𝛿;𝑑)+𝛾 ) 𝑠1 (𝑑 1+|𝛾|

βˆ’ 𝛿) + (

βˆ’π‘Žπ‘  2 (𝛿;𝑑)+𝛾 ) 𝑠2 (𝑑 1+|𝛾|

βˆ’ 𝛿) +

π‘Ÿ1 (𝑑; 𝛿, 𝛾) + π‘Ÿ2 (𝑑; 𝛿, 𝛾)

We first treat the situation of π‘Ÿ1 (𝑑; 𝛿, 𝛾) = π‘Ÿ2 (𝑑; 𝛿, 𝛾). Since the mixing attenuations π‘Ž1 (𝜏) and π‘Ž2 (𝜏) correspond respectively to 𝑠1 (𝑑) and 𝑠2 (𝑑) then the function 𝐽(𝜏, πœ”) given by (50) can be expressed as 𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› |π‘Žπ‘˜ (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) βˆ’ 𝐢(𝜏, πœ”)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”) + π‘˜

βˆ’π‘–πœ”π‘š

π‘Žπ‘† (π‘š;𝜏)𝑒 βˆ‘π· π‘š=1 1+|𝛾| π‘šβ‰ π›Ώ

2

𝑆𝑗 (𝜏 βˆ’ π‘š, πœ”) βˆ’ π‘Žπ‘— (𝜏)𝑒 βˆ’π‘–πœ”π›Ώ 𝑆𝑗 (𝜏, πœ”)| 2

= π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›|(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏))𝑒 βˆ’π‘–πœ”π›Ώ | |𝑆𝑗 (𝜏, πœ”)|

A noisy mixture is generated by adding two sources and an uncorrelated nonstationary noise with various input SNRs. 20 speech, 20 music signals and noise signals are selected from TIMIT, RWC, and Noisex databases, respectively. Additionally, we have conducted experiments to determine the optimal πœ‰π‘“ and the choice of πœπ‘€ . All experiments are conducted under the same conditions as follows: The sources are mixed with normalized power over the duration of the signals. All mixed signals are sampled at 16 kHz sampling rate. The TF representation is computed by using the STFT of 1024-point Hamming window with 50% overlap. The parameters are set as follows: for the pseudo-stereo noisy mixture 𝛿 = 2 and 𝛾 = 4 for the smoothing parameter of the noise power and the a priori SNR estimates πœπ‘ = 0.95 and πœπœ‰ = 0.98, respectively, and 𝑝(𝐻0 ) = 𝑝(𝐻1 ) = 0.5. The separation performance is evaluated by measuring the distortion between the original source and the estimated one according to the signal-to-distortion (SDR) ratio [35] defined as 2 2 SDR = 10 π‘™π‘œπ‘”10 (β€–π‘ π‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘ β€– β„β€–π‘’π‘–π‘›π‘‘π‘’π‘Ÿπ‘“ + π‘’π‘›π‘œπ‘–π‘ π‘’ + π‘’π‘Žπ‘Ÿπ‘‘π‘–π‘“ β€– ) where π‘’π‘–π‘›π‘‘π‘’π‘Ÿπ‘“ , π‘’π‘›π‘œπ‘–π‘ π‘’ , and π‘’π‘Žπ‘Ÿπ‘‘π‘–π‘“ represent the interference from other sources, noise and artifact signals. MATLAB is used as the programming platform. All simulations and analyses are performed using a PC with Intel Core 2 CPU 3GHz and 3GB RAM. A. Determination of Optimal πœ‰π‘“ for Mixture Enhancement The optimal πœ‰π‘“ is determined by minimizing the proposed integrated probability of error in (21) and (22) in Section III.A.1). The term πœ‰ varies from 0𝑑𝐡 to 30𝑑𝐡 by 5𝑑𝐡 increment. The candidate πœ‰π‘“ is converted from linear scale to dB (i.e. 10 log10 πœ‰π‘“ = πœ‰π‘“π‘‘π΅ 𝑑𝐡) with various πœ‰π‘“π‘‘π΅ from 0𝑑𝐡 to 50𝑑𝐡 by 5𝑑𝐡 increment. Fig.2 on the left-hand side shows the plot of 𝑝𝑒 (πœ‰, πœ‰π‘“ ) for various πœ‰ values. As a result of individual πœ‰, the minimum 𝑝𝑒 (πœ‰, πœ‰π‘“ ) is obtained at πœ‰π‘“ = πœ‰Μ‚π‘“ = πœ‰. Therefore, the optimal πœ‰π‘“ is then set by πœ‰. However in realistic scenario, the term πœ‰ is unknown. Thus, the optimal πœ‰π‘“ in (22) is determined by approximating the above integral in (22) by discretely evaluating the term at various πœ‰ values and taking the average. The result is shown on the right-hand side of Fig.2. It can be seen that the range of πœ‰Μƒπ‘“ that yields the minimum error is between 10dB and 15dB. Based on this result, the optimal πœ‰π‘“ can be set at 10 log10 πœ‰π‘“ = 12.5 dB for all experiments.

2

ΞΎ = 0dB ΞΎ = 10dB ΞΎ = 20dB ΞΎ = 30dB

𝐽(𝜏, πœ”) = π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘› [|(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏))𝑒 βˆ’π‘–πœ”π›Ώ

+

π‘˜

𝐷𝑗 (π‘Žπ‘ π‘— (π‘š;𝜏)βˆ’π‘Žπ‘ π‘˜ (π‘š;𝜏)) βˆ’π‘–πœ”π‘š βˆ‘π‘š=1 𝑒 | 1+|𝛾| π‘šβ‰ π›Ώ

ΞΎ = 5dB ΞΎ = 15dB ΞΎ = 25dB

0.5

0.25

0.4

0.23

Probability of error

This cost function yields a nonzero value only for π‘˜ β‰  𝑗. In this case, the function 𝐽(𝜏, πœ”) can separate the π‘˜ arguments due to the difference of π‘Žπ‘˜ and π‘Žπ‘— . The case of π‘Ÿ1 (𝑑; 𝛿, 𝛾) β‰  π‘Ÿ2 (𝑑; 𝛿, 𝛾) follows similar line of argument as above where the function 𝐽(𝜏, πœ”) becomes

Probability of error

π‘˜

0.3 0.2 0.1

0

2

|𝑆𝑗 (𝜏, πœ”)| ]

This cost function yields a nonzero value only for π‘˜ β‰  𝑗; thus the function 𝐽(𝜏, πœ”) is able to distinguish the π‘˜ arguments. In summary, by considering π‘Žπ‘— (𝑑; 𝛿, 𝛾) and π‘Ÿπ‘— (𝑑; 𝛿, 𝛾) with respect to above three cases, only Case 2 and Case 3 are separable.

0.18 0.15 0.13 0.10

0.0

2

0.20

5 10 15 20 25 30 35 40 45 50

ΞΎf dB [dB]

0

5 10 15 20 25 30 35 40 45 50

ΞΎf dB [dB]

Fig.2. Probability of error 𝑝𝑒 (πœ‰, πœ‰π‘“ ) of individual πœ‰ value (left) and integrated probability of error for various πœ‰π‘“ (right).

B. Mixture Enhancement Performance To verify the proposed mixture enhancement method, a test has been conducted and compared the mixture enhancement

24 21 18 15 12 9 6 3 0 -3

standard MMSE Proposed mixture enhancement

noisy mixture modified MMSE

standard MMSE Proposed mixture enhancement

5.0

SIG

4.0 3.0 2.0 1.0 0.0 0

5

10

15 20 Input SNR [dB]

25

30

Fig.4. Comparison of average SIG testing for the noisy mixture, standard MMSE, modified MMSE, and Proposed mixture enhancement.

The proposed mixture enhancement method renders the best quality and intelligibility of the enhanced mixture among the three MMSE methods for across the range of input SNR. A visual test has also been conducted by using mixed real-audio sources (speech + music) and an uncorrelated additive noise. A clean mixture of speech and musical sources is shown in Fig.5 (a). A noisy mixture consists of the two audio sources and a white Gaussian noise with 5dB SNR. The enhanced mixture is obtained by applying the proposed enhancement method on the noisy mixture. Visually, an enhanced mixture in Fig.5 (c) has efficiently extracted the sources spectrum compared with the noisy mixture in Fig.5 (b). original clean mixture

noisy mixture 8000

7000

7000

6000

6000

5000

5000

Frequency

8000

4000

4000

3000

3000

2000

2000

1000

0

1000

0.5

1

1.5

0

2

Time

0.5

1

1.5

2

Time

(a) clean mixture

(b) noisy mixture enhanced noisy mixture

8000 7000 6000 5000

Frequency

SegSNR [dB]

noisy mixture modified MMSE

9 distortion (SIG) [43] has been used as the opinion test of intelligibility. A five-category rating scale is used for each aspect of the evaluation. A five-category rating scale is used for each aspect of the evaluation. For SIG, the corresponding scales are: 1) Very unnatural, very degraded, 2) Fairly unnatural, fairly degraded, 3) Somewhat natural, somewhat degraded, 4) Fairly natural, little degradation, 5) Very natural, no degradation. The SIG results are shown in Fig. 4.

Frequency

method with the original MMSE and the recent modified MMSE [36] by using segmental SNR (SegSNR, in dB) and the perceptual evaluation of speech quality (PESQ) measures [38]. The experiments have been assessed on three types of mixtures i.e. music + music, speech + music, and speech + speech. For the standard MMSE, the smoothing parameter πœπœ‰ was set at 0.98 according to [27] which shows a strong correlation of πœ‰Μ‚(𝜏, πœ”) and corresponding previous enhanced spectral amplitudes. As such, the term πœ‰Μ‚ (𝜏, πœ”) will be smoothness across time where this property suits for stationary signals. Thus, the current frame estimation πœ‰Μ‚(𝜏, πœ”) inclines to be smaller than its previous estimation πœ‰Μ‚(𝜏 βˆ’ 1, πœ”). Consequently, the smooth πœ‰Μ‚(𝜏, πœ”) will be underestimated. This leads to over-suppression: not only noise components but also the source signals; and the sensitive spectral amplitude estimator 𝐴̃(𝜏, πœ”). The modified MMSE gives better noise suppression and the quality of reconstructed signals than the standard MMSE method where πœπœ‰ = 0.5 for low input SNR and πœπœ‰ = 0.8 for high input SNR as shown in Fig.3. However, the modified MMSE demands higher computational time consuming for the training step but still removes more source components compared with the proposed mixture enhancement method. In Fig. 3, the modified MMSE gives lower perceptual intelligibility and quality of the estimated signals than the proposed mixture enhancement method even though the modified MMSE yields better SegSNR. Therefore, our proposed mixture enhancement method retains the perceptual quality of the sources and maintain a comparably high SegSNR while being able to reduce noise. The proposed mixture enhancement method yields the best PESQ performance where the average PESQ improvement are 27% and 19% over the standard and modified MMSE methods, respectively. In the interval of [0, 20] dB input SNR, the proposed mixtureenhancement method is able to significantly remove noise from the noisy mixture and also retain intelligible perception of the noise-reduced mixture. As evidenced in Fig.3, the proposed enhancement method gains the average improvement over the noisy mixture at 3.0dB (76%) for SegSNR and 0.4 (12%) for PESQ.

4000 3000

0

5

10

15

20

25

30

2000

Input SNR [dB]

1000 0

0.5

1

1.5

2

Time

noisy mixture modified MMSE

standard MMSE Proposed mixture enhancement

5

(c) enhanced mixture Fig.5. Spectrograms of original clean mixture, clean mixture and additive white noise, noisy mixture enhanced using proposed iMMSE-STSA estimator.

PESQ

4 3

C. Choice of πœπ‘€ for estimating π‘Žπ‘— (𝜏)

2

The adaptive mixing attenuation estimator in (34) i.e. π‘Žπ‘— (𝜏) = πœπ‘€ π‘Žπ‘— (𝜏 βˆ’ 1) + (1 βˆ’ πœπ‘€ )π‘Žπ‘— (𝜏) is weighted at every two consecutive frame of π‘Žπ‘— through πœπ‘€ . To determine πœπ‘€ , 100 experiments have been conducted on 100 noise-free mixtures by implementing the proposed algorithm but excluded the enhancement step. Each noise-free mixture is simulated by adding two synthetic nonstationary AR sources. The nonstationary AR source is synthesized by using the model (3)

1 0 0

5

10

15 20 Input SNR [dB]

25

30

Fig.3. SegSNR (top) and PESQ (bottom) on mixtures of two sources and additive noises at different input SNRs.

The subjective testing of signal quality and intelligibility has been conducted based on ITU-T standard (P.835). The signal

with 2.56s length which divided into five sections i.e. 𝑇1 = [0 , 0.51s], 𝑇2 = (0.51s , 1.03s], 𝑇3 = (1.03s , 1.54s], 𝑇4 = (1.54s , 2.05s], and 𝑇5 = (2.05s , 2.56s], respectively. The term π‘Žπ‘ π‘— and 𝑒𝑗 (𝑑) of 𝑠𝑗 (𝑑) have been changed section by section. The samples of synthetic source signals are shown in Fig.6 in the top row. original source 1

original source 2 5

Amplitude

0 -5 0

1

0 -5 0

2

1

Time [s]

2

Time [s]

πΊπ‘˜β‰ π‘— (𝜏, πœ”) has been satisfied when |𝐢̂𝑗 (𝜏)| < |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’

single-channel mixture

π‘Žπ‘† (𝛿;𝜏) 𝑗

0

1+|𝛾| 1

2

Time [s]

estimated source 1

estimated source 2 5

Amplitude

0 -5 0

1

-5 0

2

1

Time [s]

2

Time [s]

Fig.6. Two original sources, noise-free mixture and two estimated sources with πœπ‘€ = 0.95.

Firstly, the term πœπ‘€ is tested on a range from 0.05 to 0.95 by 0.1 increment. As a result, from πœπ‘€ = 0.05 to πœπ‘€ = 0.85, the average SDR results have increased slightly. Between 0.85 ≀ πœπ‘€ ≀ 0.95, the average SDR rises sharply with the average improvement of 3𝑑𝐡 per source. The term πœπ‘€ is then further tested on [0.86,0.99] with 0.01 increments and its results are illustrated in Fig.7. The highest average SDR is within the interval of πœπ‘€ from 0.91 to 0.98. Hence the optimal choice of πœπ‘€ will be within [0.91,0.98]. estimated s1

SDR [dB]

14.5 14.0 13.5 13.0 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

according

2

(45).

We

0.7 0.95 0.99 true

|+ 1+|𝛾| 2, the |𝐢̂2 | condition is also true. Therefore, the cost function has correctly assigned all (𝜏, πœ”) units to their respective original sources. This is clearly evident by the same SDR results between the π‘Žπ‘— (𝜏) and the π‘Žπ‘— (𝜏). Therefore, we selected πœπ‘€ around 0.95 for all experiments. 1+|𝛾|

1.5

2 1 0 -1 -2 0

0.5

1

1.5

1 0.5 0 -0.5 -1 0

2

0.5

Time [s]

1 1.5 Time [s]

2

Fig.9. |𝐢̂𝑗 (𝜏)| condition of 𝑗 = 1 on the left plot and 𝑗 = 2 on the plot where the dot-dash line refers to |𝐢̂𝑗 (𝜏)| and the continuous line refers to |(π‘Žπ‘˜ (𝜏) βˆ’ π‘Žπ‘— (𝜏) βˆ’

π‘Žπ‘†π‘— (𝛿;𝜏)

π‘Žπ‘†π‘— (𝛿;𝜏)

2

1+|𝛾|

1+|𝛾|

1+|Ξ³|

)| + |

|+

, 𝑗 β‰  π‘˜.

D. Separation Performance The separation performance of the proposed method has been assessed by using 150 mixtures. The noises have been randomly selected from the NOISEX database which are: pink.wav, destroyerops.wav and factory2.wav. These noises represent stationary, nonstationary and highly nonstationary noises, respectively. The proposed approach will be compared with the single-channel nonnegative matrix 2-dimensional factorization (SNMF2D) and the single-channel independent component analysis (SCICA) [4]. The SNMF2D parameters are set as 1.005is 2, sparsity weight of 1.1, follows [4]: the number of factors number of phase shift and time shift is 31 and 7, respectively for 1 music. As for speech, both shifts are set to 4. Cost function of 0.995 SNMF2D is based on the Kullback-Leibler divergence. As for the SCICA, the number of block is 10 with unity time delay. 0.99 0

0.5

1.005

0.4 0.3 0.5

have

, thus the |𝐢̂1 (𝜏)| condition is satisfied. For 𝑗 =

0.6

0.2 0

to

1+|𝛾|

π‘Žπ‘†1 (𝛿;𝜏)

coefficient

Coefficient

2 1+|𝛾|

coefficient

𝜁M Fig.7. Average SDR on the noise-free mixture of two synthetic AR sources with various πœπ‘€

0.7

|+

estimated s2

15.0

0.8

𝑗

1+|𝛾|

computed the |𝐢̂𝑗 (𝜏)| condition for 𝑗 = 1 and 2 as shown in π‘Žπ‘† (𝛿;𝜏) Fig.9. For 𝑗 = 1, |𝐢̂1 (𝜏)| < |(π‘Ž2 (𝜏) βˆ’ π‘Ž1 (𝜏) βˆ’ 1 )| + |

0

π‘Žπ‘† (𝛿;𝜏)

1

1.5

1 0.995

2

0.99 0

Time [s]

Fig.8. Mixing coefficients of π‘Ž1 (𝜏) (true) and π‘Ž1 (𝜏) for πœπ‘€ = 0.7, 0.95 ,0.99

0.91

0.5

1 1.5 Time [s]

2

1 1.5 Time [s]

2

0.5

1 1.5 Time [s]

2

0.9 0.89 0.88 0

Fig.10. Estimated coefficients of π‘Ž1 (𝜏) (left) and π‘Ž2 (𝜏) (right).

0.9 0.89 0.88 0

0.5

0.91

coefficient

Amplitude

5

)| + |

Coefficient

-5 0

Coefficient

Amplitude

5

coefficient

Amplitude

5

10 We have plotted an example of π‘Ž1 (𝜏) against π‘Ž1 (𝜏) with different πœπ‘€ values in Fig.8. The term π‘Ž1 (𝜏) of πœπ‘€ = 0.7 has highly oscillatory values. Conversely, π‘Žπ‘— (𝜏) varies slowly and resembles a straight line when πœπ‘€ = 0.99 because π‘Žπ‘— (𝜏) at the 𝜏 π‘‘β„Ž frame depends 99% on its previous value. When πœπ‘€ = 0.95, π‘Žπ‘— (𝜏) tracks very closely with the true π‘Žπ‘— (𝜏). Hence, πœπ‘€ has a crucial role in tracking the behavior of π‘Žπ‘— (𝜏). Although π‘Žπ‘— (𝜏) is an estimate of π‘Žπ‘— (𝜏), the separating performance of π‘Žπ‘— (𝜏) yields the same SDR as π‘Žπ‘— (𝜏) at 14.7𝑑𝐡 and 14.9𝑑𝐡 for 𝑠̂̂1 (𝑑) and 𝑠̂̂2 (𝑑), respectively. This is because the condition πΊπ‘˜=𝑗 (𝜏, πœ”)