slides

Outline

Introduction

Speaker subspace model

Monaural speech separation

Binaural separation

Conclusions

Underdetermined Source Separation Using Speaker Subspace Models Thesis Defense Ron Weiss

May 4, 2009

Ron Weiss

Underdetermined Source Separation Using Speaker Subspace Models

May 4, 2009

1 / 34

Outline

Introduction


1

Introduction

2


3


4

Binaural separation

5

Conclusions

Ron Weiss


Binaural separation


Conclusions

May 4, 2009

2 / 34

Outline

Introduction


1

Introduction

2


3


4

Binaural separation

5

Conclusions

Ron Weiss


Binaural separation


Conclusions

May 4, 2009

3 / 34

Outline

Introduction



Binaural separation

Conclusions

Audio source separation

Many real world signals contain contributions from multiple sources E.g. cocktail party

Want to infer the original sources from the mixture Robust speech recognition Hearing aids Ron Weiss


May 4, 2009

4 / 34

Outline

Introduction



Binaural separation

Conclusions

Previous work

Instantaneous mixing system   y1 (t) a11  ..   ..  . = . 

yC (t)

aC 1

  . . . a1I x1 (t) .  .  .. . ..   ..  . . . aCI xI (t)

Simplest case: more channels than sources (overdetermined) Perfect separation possible

Use constraints on source signals to guide separation Independence constraints (e.g. independent component analysis) Spatial constraints (e.g. beamforming)

Ron Weiss


May 4, 2009

5 / 34

Outline

Introduction



Binaural separation

Conclusions

Underdetermined source separation

More sources than channels, need stronger constraints CASA: Use perceptual cues similar to human auditory system Segment STFT into short glimpses of each source By harmonicity, common onset, etc. Sequential grouping heuristics Create time-frequency mask for each source

Inference based on prior source models Ron Weiss


May 4, 2009

6 / 34

Outline

Introduction



Binaural separation

Conclusions

Time-frequency masking Mixture

Clean source 8 Frequency (kHz)

Frequency (kHz)

8 6 4 2 0

6 4

0

2

−10

Masks

Reconstructed source (8.2 dB SNR)

−30

8 Frequency (kHz)

8 Frequency (kHz)

−20

0

6 4 2 0

−40

6

−50 4 2 0

0.5

1

1.5 2 Time (sec)

2.5

3

0.5

1

1.5 2 Time (sec)

2.5

3

Natural sounds tend to be sparse in time and frequency 10% of spectrogram cells contain 78% of energy

And redundant Still intelligible when 22% of source energy is masked Ron Weiss


May 4, 2009

7 / 34

Outline

Introduction



Binaural separation

Conclusions

Model-based separation

Use constraints from prior source models to guide separation Leverage differences in spectral characteristics of different sources

Hidden Markov models, log spectral features Factorial model inference e.g. IBM Iroquois system [Kristjansson et al., 2006] Speaker-dependent models Acoustic dynamics and grammar constraints Superhuman performance under some conditions Ron Weiss


May 4, 2009

8 / 34

Outline

Introduction



Binaural separation

Conclusions

Model-based separation – Limitations

Rely on speaker-dependent models to disambiguate sources What if the task isn’t so well defined? No prior knowledge of speaker identities or grammar

Use speaker-independent (SI) model for all sources Need strong temporal constraints or sources will permute “place white by t 4 now” mixed with “lay green with p 9 again” Separated source: “place white by t p 9 again”

Solution: adapt speaker-independent model to compensate

Ron Weiss


May 4, 2009

9 / 34

Outline

Introduction


1

Introduction

2

Speaker subspace model Model adaptation Eigenvoices

3


4

Binaural separation

5

Conclusions

Ron Weiss


Binaural separation


May 4, 2009

Conclusions

10 / 34

Outline

Introduction



Binaural separation

Conclusions

Model selection vs. adaptation Speaker models Mean voice Speaker subspace bases Quantization boundaries

Model selection (e.g. [Kristjansson et al., 2006]) Given set of speaker-dependent (SD) models: 1 2

Identify sources in mixture Use corresponding models for separation

How to generalize to speakers outside of training set? Selection – choose closest model Adaptation – interpolate

Ron Weiss


May 4, 2009

11 / 34

Outline

Introduction



Binaural separation

Conclusions

Adjust model parameters to better match observations Caveats 1

Feature 2

Model adaptation

Want to adapt to a single utterance, not enough data for MLLR, MAP Need adaptation framework with few parameters

2

Original distribution Observations Adapted distribution Feature 1

Observations are mixture of multiple sources Iterative separation/adaptation algorithm

Ron Weiss


May 4, 2009

12 / 34

Outline

Introduction


Eigenvoice adaptation


Binaural separation

Conclusions

[Kuhn et al., 2000]

Train a set of SD models Pack params into speaker supervector Samples from space of speaker variation

Principal component analysis to find orthonormal bases for speaker subspace Model is linear combination of bases

Speaker models Speaker subspace bases Other models

Eigenvoice adaptation µ = µ ¯ + U

Ron Weiss

adapted

mean

model

voice

w + B

eigenvoice weights bases

h

channel channel bases

weights


May 4, 2009

13 / 34

Outline

Introduction



Binaural separation

Conclusions

Eigenvoice bases Frequency (kHz)

Mean Voice 8

−10

6

−20 −30

4

−40 2 −50 b d g p t k jh ch s z f th v dh m n l

r w y iy ih eh ey ae aa aw ay ah ao owuw ax

Eigenvoice dimension 1 8

Mean voice = speaker-independent model

Frequency (kHz)

8 6

6 4

4

2

2

0

Independent bases to capture channel variation


Eigenvoice dimension 2 8 8

Frequency (kHz)

Eigenvoices shift formant frequencies, add pitch

b d g p t k jh ch s z f th v dh m n l

6

6 4

4

2

2

0 b d g p t k jh ch s z f th v dh m n l


Eigenvoice dimension 3 8

Frequency (kHz)

8 6

6 4

4

2

2

0 b d g p t k jh ch s z f th v dh m n l

Ron Weiss



May 4, 2009

14 / 34

Outline

Introduction


1

Introduction

2


3

Monaural speech separation Mixed signal model Adaptation algorithm Experiments

4

Binaural separation

5

Conclusions

Ron Weiss


Binaural separation


May 4, 2009

Conclusions

15 / 34

Outline

Introduction



Binaural separation

Conclusions

Eigenvoice factorial HMM

Model mixture with combination of source HMMs Need adaptation parameters wi to estimate source signals xi (t) and vice versa Ron Weiss


May 4, 2009

16 / 34

Outline

Introduction



Binaural separation

Conclusions

Adaptation algorithm

Ron Weiss


May 4, 2009

17 / 34

Outline

Introduction



Binaural separation

Conclusions

Adaptation example Mixture: t32_swil2a_m18_sbar9n 8 6 4 2 0

0 −20 −40 Adaptation iteration 1

Frequency (kHz)

8 6 4 2 0


8 6 4 2 0


8 6 4 2 0

0 −20 −40 SD model separation

8 6 4 2 0

0 −20 −40 0

0.5

1

1.5

Time (sec)

Ron Weiss


May 4, 2009

18 / 34

Outline

Introduction



2006 Speech separation challenge

Binaural separation

Conclusions

[Cooke and Lee, 2006]

Single channel mixtures of utterances from 34 different speakers Constrained grammar: command(4) color(4) preposition(4) letter(25) digit(10) adverb(4)

Separation/recognition task Determine letter and digit for source that said “white” Ron Weiss


May 4, 2009

19 / 34

Outline

Introduction



Binaural separation

Conclusions

Performance – Adapted vs. source-dependent models

−3 dB

Ron Weiss


May 4, 2009

20 / 34

Outline

Introduction



Binaural separation

Conclusions

Experiments – Switchboard Mixture 0

Frequency (kHz)

8

−10

6

−20 4 −30 2

−40

1

2

3

4

−50

Time (sec)

What about previously unseen speakers? Switchboard: corpus of conversational telephone speech 200+ hours, 500+ speakers

Task significantly more difficult than Speech Separation Challenge Spontaneous speech Large vocabulary Significant channel variation across calls Ron Weiss


May 4, 2009

21 / 34

Outline

Introduction



Binaural separation

Conclusions

Switchboard – Results

Adaptation outperforms SD model selection Model selection errors due to channel variation

SD performance drops off under mismatched conditions SA performance improves as number of training speakers increases Ron Weiss


May 4, 2009

22 / 34

Outline

Introduction



1

Introduction

2


3


4

Binaural separation Mixed signal model Parameter estimation and source separation Experiments

5

Conclusions

Ron Weiss

Binaural separation


May 4, 2009

Conclusions

23 / 34

Outline

Introduction



Binaural separation

Conclusions

Binaural audition

y` (t) =

X

xi (t − τi` ) ∗ h`i (t)

i r

y (t) =

X

xi (t − τir ) ∗ hri (t)

i

Given stereo recording of multiple sound sources Utilize spatial cues to aid separation Interaural time difference (ITD) Interaural level difference (ILD)

Ron Weiss


May 4, 2009

24 / 34

Outline

Introduction


MESSL: Interaural model


Binaural separation

Conclusions

[Mandel and Ellis, 2007]

Model-based EM Source Separation and Localization Probabilistic model of interaural spectrogram Independent of underlying source signals

Assume each time-frequency cell is dominated by a single source EM algorithm to learn model parameters for each source Derive probabilistic time-frequency masks for separation Ron Weiss


May 4, 2009

25 / 34

Outline

Introduction



Binaural separation

Conclusions

MESSL-SP: Source prior

Extend MESSL to include prior source model Pre-trained GMM for speech signals in mixture Channel model to compensate for HRTF and reverberation Can incorporate eigenvoice adaptation (MESSL-EV) Ron Weiss


May 4, 2009

26 / 34

Outline

Introduction



Binaural separation

Conclusions

Parameter estimation and source separation

Ron Weiss


May 4, 2009

27 / 34

Outline

Introduction



Binaural separation

Conclusions

Experiments Frequency (kHz)

Ground truth (12.04 dB)

DUET (3.84 dB) 8

8

6

6

6

4

4

4

2

2

2

SSC

1 0

0 0

0.5

1

1.5

0.8

0 0

MESSL (5.66 dB) Frequency (kHz)

2D−FD−BSS (5.41 dB)

8

0.5

1

1.5

0

MESSL−SP (10.01 dB)

0.5

1

1.5

MESSL−EV (10.37 dB)

8

8

8

6

6

6

4

4

4

2

2

2

0

0

0.6 0.4

TIMIT

0.2

0

0.5 1 Time (sec)

1.5

0

0 0

0.5 1 Time (sec)

1.5

0

0.5 1 Time (sec)

1.5

Mixtures of 2 and 3 speech sources, anechoic and reverberant Evaluated on TIMIT and SSC test data Source models trained on SSC data (32 components) Compare MESSL systems to: DUET – Clustering using ILD/ITD histogram [Yilmaz and Rickard, 2004] 2S-FD-BSS – Frequency domain ICA [Sawada et al., 2007] Ron Weiss


May 4, 2009

28 / 34

Outline

Introduction



Binaural separation

Conclusions

Experiments – Performance as function of distractor angle SNR improvement (dB)

2 sources (anechoic)

2 sources (reverb)

15

15

10

10

5

5

0

0 0

20

40

60

80

20

40

60

80

Ground truth MESSL−EV

SNR improvement (dB)

3 sources (anechoic) 15

15

10

10

5

5

MESSL−SP MESSL 2S−FD−BSS

0

DUET

0 0

20

40

60

80

Separation (degrees)

Ron Weiss

3 sources (reverb)

20

40 60 80 Separation (degrees)


May 4, 2009

29 / 34

Outline

Introduction



Binaural separation

Conclusions

Experiments – Matched vs. mismatched TIMIT 12

10

10

SNR Improvement (dB)

SNR Improvement (dB)

GRID 12

8 6 4 2 0 Ground truth

8 6 4 2 0

Average MESSL−EV

MESSL−SP

Average MESSL

2S−FD−BSS

DUET

SSC – matched train/test speakers MESSL-EV, MESSL-SP beat MESSL baseline by ∼ 3 dB in reverb MESSL-EV beats MESSL-SP by ∼ 1 dB on anechoic mixtures

TIMIT – mismatched train/test speakers Small difference between MESSL-EV and MESSL-SP Ron Weiss


May 4, 2009

30 / 34

Outline

Introduction


1

Introduction

2


3


4

Binaural separation

5

Conclusions

Ron Weiss


Binaural separation


May 4, 2009

Conclusions

31 / 34

Outline

Introduction



Binaural separation

Conclusions

Summary

Prior signal models for underdetermined source separation Subspace model for source adaptation Adapt Gaussian means and covariances using a single utterance Natural extension to compensate for source-independent channel effects

Monaural separation Speaker-dependent > speaker-adapted speaker-independent Adaptation helps generalize better to held out speakers Improves as number of training speakers increases

Binaural separation Extend MESSL framework to use source models (joint with M. Mandel) Improved performance by incorporating simple SI model Smaller improvement with adaptation

Ron Weiss


May 4, 2009

32 / 34

Outline

Introduction



Binaural separation

Conclusions

Contributions Model-based source separation making minimal assumptions using subspace adaptation Extend model-based approach to binaural separation Ellis, D. P. W. and Weiss, R. J. (2006). Model-based monaural source separation using a vector-quantized phase-vocoder representation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages V–957–960. Weiss, R. J. and Ellis, D. P. W. (2006). Estimating single-channel source separation masks: Relevance vector machine classifiers vs. pitch-based masking. In Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (SAPA), pages 31–36. Weiss, R. J. and Ellis, D. P. W. (2007). Monaural speech separation using source-adapted models. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 114–117. Weiss, R. J. and Ellis, D. P. W. (2008). Speech separation using speaker-adapted eigenvoice speech models. Computer Speech and Language, In Press, Corrected Proof:–. Weiss, R. J., Mandel, M. I., and Ellis, D. P. W. (2008). Source separation based on binaural cues and source model constraints. In Proc. Interspeech, pages 419–422. Weiss, R. J. and Ellis, D. P. W. (2009). A Variational EM Algorithm for Learning Eigenvoice Parameters in Mixed Signals. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Ron Weiss


May 4, 2009

33 / 34

Outline

Introduction



Binaural separation

Conclusions

References

Cooke, M. and Lee, T.-W. (2006). The speech separation challenge. Kristjansson, T., Hershey, J., Olsen, P., Rennie, S., and Gopinath, R. (2006). Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system. In Proc. Interspeech, pages 97–100. Kuhn, R., Junqua, J., Nguyen, P., and Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transations on Speech and Audio Processing, 8(6):695–707. Mandel, M. I. and Ellis, D. P. W. (2007). EM localization and separation using interaural level and phase cues. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Sawada, H., Araki, S., and Makino, S. (2007). A two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Yilmaz, O. and Rickard, S. (2004). Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847.

Ron Weiss


May 4, 2009

34 / 34

Extra slides

6

Extra slides

Extra slides

Factorial HMM separation

Each source signal is characterized by state sequence through its HMM Viterbi algorithm to find maximum likelihood path through combined factorial HMM Reconstruct source signals using Viterbi path Aggressively prune unlikely paths to speed up separation

Extra slides

Adaptation algorithm initialization 2000

1

2

Male Female

1000 w2

2

2

0 1

1

−1000 −2000

−1000

0 w

1

1000

2000

−1000

0 w

1000

1

2000

−1000

0 w

1000

2000

1

Fast convergence needs good initialization Want to differentiate source models to get best initial separation Treat each eigenvoice dimension independently Coarsely quantize weights Find most likely combination in mixture

Extra slides

Adaptation performance 70 65 60

Avg Accuracy

55 50 45

Diff Gender

40

Same Gender Same Talker

35 30 25 20

1

2

3

4

5

6

7 8 9 Iteration

10 11 12 13 14 15

Letter-digit accuracy averaged across all TMRs Adaptation clearly improves separation Same talker case hard – source permutations

Extra slides

Variational learning

Approximate EM algorithm to estimate adaptation parameters Treat each source HMM independently Introduce variational parameters to couple them

Extra slides

Performance – Learning algorithm comparison

Adapting Gaussian covariances and means significantly improves performance Hierarchical algorithm outperforms variational EM But variational algorithm is significantly (∼ 4x) faster At same speed variational EM performs better

Extra slides

Performance – Comparison to other participants

Extra slides

MESSL-EV: Putting it all together

Big mixture of Gaussians Interaural model ITD: Gaussian for each source and time delay ILD: Single Gaussian for each source

Source model Separate channel responses for each source at each ear Both channels share eigenvoice adaptation parameters

Explain each point in spectrogram by a particular source, time delay, and source model mixture component

Extra slides

MESSL-EV example Frequency (kHz) Frequency (kHz)

8

Frequency (kHz)

IPD (0.73 dB) 8

8

6 4 2 Full mask (10.37 dB)

0 ILD (8.54 dB)

1

6

6 4

4

2

0.5

2

0 SP (7.93 dB) 6

8

0 0

0.5

1 Time (sec)

1.5

0

4 2 0

IPD informative in low frequencies, but not in high frequencies ILD primarily adds information about high frequencies Source model introduces correlations across frequency and emphasizes reliable time-frequency regions Helps resolve ambiguities in interaural parameters from reverberation and spatial aliasing

Extra slides

Just for fun...