Outline. Introduction. Speaker subspace model. Monaural speech separation.
Binaural separation. Conclusions ... Thesis Defense. Ron Weiss. May 4, 2009.
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Underdetermined Source Separation Using Speaker Subspace Models Thesis Defense Ron Weiss
May 4, 2009
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
1 / 34
Outline
Introduction
Speaker subspace model
1
Introduction
2
Speaker subspace model
3
Monaural speech separation
4
Binaural separation
5
Conclusions
Ron Weiss
Monaural speech separation
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
Conclusions
May 4, 2009
2 / 34
Outline
Introduction
Speaker subspace model
1
Introduction
2
Speaker subspace model
3
Monaural speech separation
4
Binaural separation
5
Conclusions
Ron Weiss
Monaural speech separation
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
Conclusions
May 4, 2009
3 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Audio source separation
Many real world signals contain contributions from multiple sources E.g. cocktail party
Want to infer the original sources from the mixture Robust speech recognition Hearing aids Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
4 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Previous work
Instantaneous mixing system y1 (t) a11 .. .. . = .
yC (t)
aC 1
. . . a1I x1 (t) . . .. . .. .. . . . aCI xI (t)
Simplest case: more channels than sources (overdetermined) Perfect separation possible
Use constraints on source signals to guide separation Independence constraints (e.g. independent component analysis) Spatial constraints (e.g. beamforming)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
5 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Underdetermined source separation
More sources than channels, need stronger constraints CASA: Use perceptual cues similar to human auditory system Segment STFT into short glimpses of each source By harmonicity, common onset, etc. Sequential grouping heuristics Create time-frequency mask for each source
Inference based on prior source models Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
6 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Time-frequency masking Mixture
Clean source 8 Frequency (kHz)
Frequency (kHz)
8 6 4 2 0
6 4
0
2
−10
Masks
Reconstructed source (8.2 dB SNR)
−30
8 Frequency (kHz)
8 Frequency (kHz)
−20
0
6 4 2 0
−40
6
−50 4 2 0
0.5
1
1.5 2 Time (sec)
2.5
3
0.5
1
1.5 2 Time (sec)
2.5
3
Natural sounds tend to be sparse in time and frequency 10% of spectrogram cells contain 78% of energy
And redundant Still intelligible when 22% of source energy is masked Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
7 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Model-based separation
Use constraints from prior source models to guide separation Leverage differences in spectral characteristics of different sources
Hidden Markov models, log spectral features Factorial model inference e.g. IBM Iroquois system [Kristjansson et al., 2006] Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance under some conditions Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
8 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Model-based separation – Limitations
Rely on speaker-dependent models to disambiguate sources What if the task isn’t so well defined? No prior knowledge of speaker identities or grammar
Use speaker-independent (SI) model for all sources Need strong temporal constraints or sources will permute “place white by t 4 now” mixed with “lay green with p 9 again” Separated source: “place white by t p 9 again”
Solution: adapt speaker-independent model to compensate
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
9 / 34
Outline
Introduction
Speaker subspace model
1
Introduction
2
Speaker subspace model Model adaptation Eigenvoices
3
Monaural speech separation
4
Binaural separation
5
Conclusions
Ron Weiss
Monaural speech separation
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
Conclusions
10 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Model selection vs. adaptation Speaker models Mean voice Speaker subspace bases Quantization boundaries
Model selection (e.g. [Kristjansson et al., 2006]) Given set of speaker-dependent (SD) models: 1 2
Identify sources in mixture Use corresponding models for separation
How to generalize to speakers outside of training set? Selection – choose closest model Adaptation – interpolate
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
11 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Adjust model parameters to better match observations Caveats 1
Feature 2
Model adaptation
Want to adapt to a single utterance, not enough data for MLLR, MAP Need adaptation framework with few parameters
2
Original distribution Observations Adapted distribution Feature 1
Observations are mixture of multiple sources Iterative separation/adaptation algorithm
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
12 / 34
Outline
Introduction
Speaker subspace model
Eigenvoice adaptation
Monaural speech separation
Binaural separation
Conclusions
[Kuhn et al., 2000]
Train a set of SD models Pack params into speaker supervector Samples from space of speaker variation
Principal component analysis to find orthonormal bases for speaker subspace Model is linear combination of bases
Speaker models Speaker subspace bases Other models
Eigenvoice adaptation µ = µ ¯ + U
Ron Weiss
adapted
mean
model
voice
w + B
eigenvoice weights bases
h
channel channel bases
weights
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
13 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Eigenvoice bases Frequency (kHz)
Mean Voice 8
−10
6
−20 −30
4
−40 2 −50 b d g p t k jh ch s z f th v dh m n l
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 1 8
Mean voice = speaker-independent model
Frequency (kHz)
8 6
6 4
4
2
2
0
Independent bases to capture channel variation
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 2 8 8
Frequency (kHz)
Eigenvoices shift formant frequencies, add pitch
b d g p t k jh ch s z f th v dh m n l
6
6 4
4
2
2
0 b d g p t k jh ch s z f th v dh m n l
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
Eigenvoice dimension 3 8
Frequency (kHz)
8 6
6 4
4
2
2
0 b d g p t k jh ch s z f th v dh m n l
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
r w y iy ih eh ey ae aa aw ay ah ao owuw ax
May 4, 2009
14 / 34
Outline
Introduction
Speaker subspace model
1
Introduction
2
Speaker subspace model
3
Monaural speech separation Mixed signal model Adaptation algorithm Experiments
4
Binaural separation
5
Conclusions
Ron Weiss
Monaural speech separation
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
Conclusions
15 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Eigenvoice factorial HMM
Model mixture with combination of source HMMs Need adaptation parameters wi to estimate source signals xi (t) and vice versa Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
16 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Adaptation algorithm
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
17 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Adaptation example Mixture: t32_swil2a_m18_sbar9n 8 6 4 2 0
0 −20 −40 Adaptation iteration 1
Frequency (kHz)
8 6 4 2 0
0 −20 −40 Adaptation iteration 3
8 6 4 2 0
0 −20 −40 Adaptation iteration 5
8 6 4 2 0
0 −20 −40 SD model separation
8 6 4 2 0
0 −20 −40 0
0.5
1
1.5
Time (sec)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
18 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
2006 Speech separation challenge
Binaural separation
Conclusions
[Cooke and Lee, 2006]
Single channel mixtures of utterances from 34 different speakers Constrained grammar: command(4) color(4) preposition(4) letter(25) digit(10) adverb(4)
Separation/recognition task Determine letter and digit for source that said “white” Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
19 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Performance – Adapted vs. source-dependent models
−3 dB
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
20 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Experiments – Switchboard Mixture 0
Frequency (kHz)
8
−10
6
−20 4 −30 2
−40
1
2
3
4
−50
Time (sec)
What about previously unseen speakers? Switchboard: corpus of conversational telephone speech 200+ hours, 500+ speakers
Task significantly more difficult than Speech Separation Challenge Spontaneous speech Large vocabulary Significant channel variation across calls Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
21 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Switchboard – Results
Adaptation outperforms SD model selection Model selection errors due to channel variation
SD performance drops off under mismatched conditions SA performance improves as number of training speakers increases Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
22 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
1
Introduction
2
Speaker subspace model
3
Monaural speech separation
4
Binaural separation Mixed signal model Parameter estimation and source separation Experiments
5
Conclusions
Ron Weiss
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
Conclusions
23 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Binaural audition
y` (t) =
X
xi (t − τi` ) ∗ h`i (t)
i r
y (t) =
X
xi (t − τir ) ∗ hri (t)
i
Given stereo recording of multiple sound sources Utilize spatial cues to aid separation Interaural time difference (ITD) Interaural level difference (ILD)
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
24 / 34
Outline
Introduction
Speaker subspace model
MESSL: Interaural model
Monaural speech separation
Binaural separation
Conclusions
[Mandel and Ellis, 2007]
Model-based EM Source Separation and Localization Probabilistic model of interaural spectrogram Independent of underlying source signals
Assume each time-frequency cell is dominated by a single source EM algorithm to learn model parameters for each source Derive probabilistic time-frequency masks for separation Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
25 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
MESSL-SP: Source prior
Extend MESSL to include prior source model Pre-trained GMM for speech signals in mixture Channel model to compensate for HRTF and reverberation Can incorporate eigenvoice adaptation (MESSL-EV) Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
26 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Parameter estimation and source separation
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
27 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Experiments Frequency (kHz)
Ground truth (12.04 dB)
DUET (3.84 dB) 8
8
6
6
6
4
4
4
2
2
2
SSC
1 0
0 0
0.5
1
1.5
0.8
0 0
MESSL (5.66 dB) Frequency (kHz)
2D−FD−BSS (5.41 dB)
8
0.5
1
1.5
0
MESSL−SP (10.01 dB)
0.5
1
1.5
MESSL−EV (10.37 dB)
8
8
8
6
6
6
4
4
4
2
2
2
0
0
0.6 0.4
TIMIT
0.2
0
0.5 1 Time (sec)
1.5
0
0 0
0.5 1 Time (sec)
1.5
0
0.5 1 Time (sec)
1.5
Mixtures of 2 and 3 speech sources, anechoic and reverberant Evaluated on TIMIT and SSC test data Source models trained on SSC data (32 components) Compare MESSL systems to: DUET – Clustering using ILD/ITD histogram [Yilmaz and Rickard, 2004] 2S-FD-BSS – Frequency domain ICA [Sawada et al., 2007] Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
28 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Experiments – Performance as function of distractor angle SNR improvement (dB)
2 sources (anechoic)
2 sources (reverb)
15
15
10
10
5
5
0
0 0
20
40
60
80
20
40
60
80
Ground truth MESSL−EV
SNR improvement (dB)
3 sources (anechoic) 15
15
10
10
5
5
MESSL−SP MESSL 2S−FD−BSS
0
DUET
0 0
20
40
60
80
Separation (degrees)
Ron Weiss
3 sources (reverb)
20
40 60 80 Separation (degrees)
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
29 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Experiments – Matched vs. mismatched TIMIT 12
10
10
SNR Improvement (dB)
SNR Improvement (dB)
GRID 12
8 6 4 2 0 Ground truth
8 6 4 2 0
Average MESSL−EV
MESSL−SP
Average MESSL
2S−FD−BSS
DUET
SSC – matched train/test speakers MESSL-EV, MESSL-SP beat MESSL baseline by ∼ 3 dB in reverb MESSL-EV beats MESSL-SP by ∼ 1 dB on anechoic mixtures
TIMIT – mismatched train/test speakers Small difference between MESSL-EV and MESSL-SP Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
30 / 34
Outline
Introduction
Speaker subspace model
1
Introduction
2
Speaker subspace model
3
Monaural speech separation
4
Binaural separation
5
Conclusions
Ron Weiss
Monaural speech separation
Binaural separation
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
Conclusions
31 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Summary
Prior signal models for underdetermined source separation Subspace model for source adaptation Adapt Gaussian means and covariances using a single utterance Natural extension to compensate for source-independent channel effects
Monaural separation Speaker-dependent > speaker-adapted speaker-independent Adaptation helps generalize better to held out speakers Improves as number of training speakers increases
Binaural separation Extend MESSL framework to use source models (joint with M. Mandel) Improved performance by incorporating simple SI model Smaller improvement with adaptation
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
32 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
Contributions Model-based source separation making minimal assumptions using subspace adaptation Extend model-based approach to binaural separation Ellis, D. P. W. and Weiss, R. J. (2006). Model-based monaural source separation using a vector-quantized phase-vocoder representation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages V–957–960. Weiss, R. J. and Ellis, D. P. W. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (SAPA), pages 31–36. Weiss, R. J. and Ellis, D. P. W. (2007). Monaural speech separation using source-adapted models. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 114–117. Weiss, R. J. and Ellis, D. P. W. (2008). Speech separation using speaker-adapted eigenvoice speech models. Computer Speech and Language, In Press, Corrected Proof:–. Weiss, R. J., Mandel, M. I., and Ellis, D. P. W. (2008). Source separation based on binaural cues and source model constraints. In Proc. Interspeech, pages 419–422. Weiss, R. J. and Ellis, D. P. W. (2009). A Variational EM Algorithm for Learning Eigenvoice Parameters in Mixed Signals. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
33 / 34
Outline
Introduction
Speaker subspace model
Monaural speech separation
Binaural separation
Conclusions
References
Cooke, M. and Lee, T.-W. (2006). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (2006). Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proc. Interspeech, pages 97–100. Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing, 8(6):695–707. Mandel, M. I. and Ellis, D. P. W. (2007). EM localization and separation using interaural level and phase cues. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Sawada, H., Araki, S., and Makino, S. (2007). A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Yilmaz, O. and Rickard, S. (2004). Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847.
Ron Weiss
Underdetermined Source Separation Using Speaker Subspace Models
May 4, 2009
34 / 34
Extra slides
6
Extra slides
Extra slides
Factorial HMM separation
Each source signal is characterized by state sequence through its HMM Viterbi algorithm to find maximum likelihood path through combined factorial HMM Reconstruct source signals using Viterbi path Aggressively prune unlikely paths to speed up separation
Extra slides
Adaptation algorithm initialization 2000
1
2
Male Female
1000 w2
2
2
0 1
1
−1000 −2000
−1000
0 w
1
1000
2000
−1000
0 w
1000
1
2000
−1000
0 w
1000
2000
1
Fast convergence needs good initialization Want to differentiate source models to get best initial separation Treat each eigenvoice dimension independently Coarsely quantize weights Find most likely combination in mixture
Extra slides
Adaptation performance 70 65 60
Avg Accuracy
55 50 45
Diff Gender
40
Same Gender Same Talker
35 30 25 20
1
2
3
4
5
6
7 8 9 Iteration
10 11 12 13 14 15
Letter-digit accuracy averaged across all TMRs Adaptation clearly improves separation Same talker case hard – source permutations
Extra slides
Variational learning
Approximate EM algorithm to estimate adaptation parameters Treat each source HMM independently Introduce variational parameters to couple them
Extra slides
Performance – Learning algorithm comparison
Adapting Gaussian covariances and means significantly improves performance Hierarchical algorithm outperforms variational EM But variational algorithm is significantly (∼ 4x) faster At same speed variational EM performs better
Extra slides
Performance – Comparison to other participants
Extra slides
MESSL-EV: Putting it all together
Big mixture of Gaussians Interaural model ITD: Gaussian for each source and time delay ILD: Single Gaussian for each source
Source model Separate channel responses for each source at each ear Both channels share eigenvoice adaptation parameters
Explain each point in spectrogram by a particular source, time delay, and source model mixture component
Extra slides
MESSL-EV example Frequency (kHz) Frequency (kHz)
8
Frequency (kHz)
IPD (0.73 dB) 8
8
6 4 2 Full mask (10.37 dB)
0 ILD (8.54 dB)
1
6
6 4
4
2
0.5
2
0 SP (7.93 dB) 6
8
0 0
0.5
1 Time (sec)
1.5
0
4 2 0
IPD informative in low frequencies, but not in high frequencies ILD primarily adds information about high frequencies Source model introduces correlations across frequency and emphasizes reliable time-frequency regions Helps resolve ambiguities in interaural parameters from reverberation and spatial aliasing
Extra slides
Just for fun...