Adaptive Noise Cancelation in Speech Signals

8 downloads 11997 Views 4MB Size Report
Oct 30, 2006 - 6.3 (a) CSS signal, 44.1 kHz (b) PSD of the CSS, 8192 pt. FFT using Hanning ... CSS. Composite Source Signal. DFT. Discrete Fourier Transform. DSP. Digital Signal ...... encyclopedia/JensensInequality.html〉. [48] QURESHI ...
Brno University of Technology Faculty of Electrical Engineering and Communication Department of Telecommunication

DOCTORAL THESIS

Adaptive Noise Cancelation in Speech Signals

Vladim´ır Malenovsk´ y

October 2006

BRNO UNIVERSITY OF TECHNOLOGY FACULTY OF ELECTRICAL ENGINEERING AND COMMUNICATION

Adaptive Noise Cancelation in Speech Signals by

Vladim´ır Malenovsk´ y Brno University of Technology, 2006

Submitted to the Faculty of Electrical Engineering and Communication as an integral part of the application for the degree Doctor of Philosophy

Submitted: Supervisor:

October 30, 2006 Ing. Zdenˇek Sm´ekal, Ph.D. Professor of Digital Signal Processing Department of Telecommunications Faculty of Electrical Engineering and Communication Brno University of Technology

This thesis is available in the library of the Department of Telecommunications of Brno University of Technology.

Acknowledgements I would like to express my great respect and appreciation to Prof. Zdenˇ ek Sm´ ekal, who accepted me as a Ph.D. student and let me do research under his guidance. Prof. Sm´ekal was the first person in my life who taught me what a professional research is and how interesting it might be. I would also like to thank him for his endless support during the whole period of my studies and for opportunities to participate in several projects and teaching activities. My thank also goes to Prof. Kjeld Hermansen, who accepted to become my supervisor during my internship at Aalborg University in Denmark. Without his support I would even never find out what adaptive filtering is. His ability of “looking into the problem through different glasses” helped me many times in overcoming the most pretentious problems. I am also grateful to Søren Skovg˚ ard Christensen, Tarik Vaizovi` c, Ulrich Bjørn and Karsten Nybrø Johansen . These students from Denmark persuaded me that successful research goes hand in hand with the ability to communicate with others and critically analyze. ˇ I am grateful to my longtime friend and colleague Kamil Svancara for discussions about signal processing, adaptive filtering and the impact of “good” literature on research. I also wish to thank my colleagues at the Department of Telecommunications in Brno V´ aclav Eksler, Michal Soumar, Radek Zezula and Martin V´ıtek. They always had time to share their knowledge and helped me in situations when I started to loose my head. Last, but by no means least, I am deeply grateful to my wife Jana, my parents Eduard and Alice, my sister Mirka and my wife’s parents Jaroslav and Nadˇ eˇ zda. Without their love, patience, understanding and support my life would not be that fruitful and this thesis would probably never be finished.

My research has been supported by Faculty of Electrical Engineering and Computer ˇ Grant Agency of the Czech Science by programme Erasmus/Socrates, partially by FRVS Republic under projects No. 1642/2004/G1 and No. 1064/2005/G1, and by project COST 0C277 (EU programme). Some projects were also cooperated by STROM Telecom Prague, s.r.o.

i

Abstract Today, adaptive algorithms represent one of the most frequently used computational tools for the processing of digital speech signals. This work investigates and analyzes the properties of adaptive algorithms in speech communication applications where rigorous conditions apply, such as noise and echo cancelation. Like other theses in this field do, it tries to tackle the ever-lasting problem of “computational complexity vs. rate of convergence”. It introduces some new adaptive methods that stem from the existing algorithms as well as a novel concept which has been entitled Optimal Step-Size (OSS). In the first part of the thesis we investigate some well-known, widely used adaptive techniques such as the Normalized Least Mean Squares (NLMS) and the Recursive Least Mean Squares (RLS). In spite of the fact that the NLMS and the RLS belong to the ”simplest” principles, as far as complexity is concerned, they are widely adopted across the spectrum of telecommunication products. It appears, however, that with the development of voice-operated devices the future systems would require a more powerful noise and echo suppression than could be achieved with the conventional methods. In the second part of the thesis we will show that, by modifying these methods better results can be achieved. In particular, we will introduce the Fast Block Least Mean Squares (FBLMS) method and the Simultaneous Perturbation Stochastic Approximation (SPSA) method as examples of methods that are based on a conventional LMS algorithm. However, due to some sophisticated modifications they yield better results. For the purposes of comparison of the developed methods with the existing ones we conduct several experiments in three different applications. These are system identification, background noise suppression and inverse filtering. In the last part of the thesis we describe a novel approach to adaptive filtering that we developed and which we named OSS. The proposed method possesses some features that make it robust even in a non-stationary environment. Experimental evaluations will show that its performance is comparable with the conventional methods and in some cases it will even exhibit an improvement. At the end of the thesis we discuss some issues pertaining to the testing of echo cancelers. In particular we describe our software solution that we developed according to ITU-T G.168 recommendation. We have designed a stand-alone application in Matlab Graphical User Interface Development Environment (GUIDE) that can be used to test the performance of AECs (Adaptive Echo Cancelers). By carrying out a graphical analysis the users are able to reveal the weak parts of their methods and estimate whether they meet the G.168 conditions or not.

ii

Abstrakt Adaptivn´ı algoritmy pˇredstavuj´ı v dneˇsn´ı dobˇe jeden z nejpouˇz´ıvanˇejˇs´ıch v´ ypoˇcetn´ıch n´astroj˚ u pro zpracov´an´ı digit´aln´ıch sign´al˚ u. Tato pr´ace se zab´ yv´a v´ yzkumem a anal´ yzou vlastnost´ı adaptivn´ıch algoritm˚ u, kter´e jsou pouˇz´ıv´any pro potlaˇcov´an´ı ˇsumu a echa v ˇreˇcov´ ych sign´alech. Stejnˇe jako u ˇrady dalˇs´ıch v´ yzkumn´ ych prac´ı v t´eto oblasti i zde je snaha o nalezen´ı ˇreˇsen´ı probl´emu “v´ ypoˇcetn´ı n´aroˇcnost vs. rychlost konvergence”. Zamˇeˇrili jsme se pˇredevˇs´ım na podrobnou anal´ yzu st´avaj´ıc´ıch metod, kter´e se v praxi pouˇz´ıvaj´ı nejˇcastˇeji a o zlepˇsen´ı jejich vlastnost´ı. Kromˇe modifikovan´ ych metod se n´am rovnˇeˇz podaˇrilo navrhnout vlastn´ı koncept adaptivn´ı filtrace, kter´ y je zaloˇzen na principu OSS (Optimal Step-Size). V prvn´ı ˇc´asti t´eto pr´ace se zamˇeˇrujeme na anal´ yzu adaptivn´ıch metod pro potlaˇcov´an´ı ˇsumu v ˇreˇcov´ ych sign´alech, kter´e se v souˇcasnosti pouˇz´ıvaj´ı nejv´ıce, tedy NLMS (Normalized Least-Mean Square) a RLS (Recursive Least Square). Navzdory tomu, ˇze tyto adaptivn´ı metody patˇr´ı mezi ”jednoduˇsˇs´ı”, jsou velmi popul´arn´ı a lze je nal´ezt t´emˇeˇr v kaˇzd´em telefonn´ım pˇr´ıstroji. V souˇcasnosti se vˇsak ukazuje, ˇze v souvislosti s rozvojem hlasovˇe ovl´adan´ ych zaˇr´ızen´ı bude potˇreba, aby potlaˇcen´ı ˇsumu a echa bylo jeˇstˇe lepˇs´ı a u ´ˇcinnˇejˇs´ı. St´avaj´ıc´ı metody vˇsak d´ıky sv´e jednoduchosti nemohou dosahovat lepˇs´ıch vlasnost´ı a tak je potˇreba hledat alternativn´ı metody, kter´e by splˇ novaly i n´aroˇcn´e poˇzadavky. Ve druh´e ˇc´asti t´eto pr´ace se proto snaˇz´ıme o modifikaci tˇechto metod a uk´aˇzeme, ˇze pˇri akceptaci vyˇsˇs´ı v´ ypoˇcetn´ı n´aroˇcnosti lze dos´ahnout zlepˇsen´ı nˇekter´ ych vlastnost´ı. Ze zkouman´ ych metod se zamˇeˇr´ıme zejm´ena na FBLMS (Frequency-domain Block LMS) a SPSA-LMS (Simultaneous Perturbation Stochastic Approximation LMS) jako pˇr´ıklady modifikac´ı klasick´e metody NLMS. Pro u ´ˇcely srovn´an´ı jednotliv´ ych algoritm˚ u mezi sebou jsme navrhli tˇri experiment´aln´ı testovac´ı aplikace a to identifikace nezn´am´eho syst´emu, potlaˇcen´ı ˇsumu pozad´ı a inverzn´ı filtrace. Ve stˇeˇzejn´ı ˇc´asti t´eto dizertaˇcn´ı pr´ace se budeme vˇenovat nov´emu konceptu adaptivn´ı filtrace, kter´ y se n´am podaˇrilo vyvinout a kter´ y jsme pojmenovali OSS (Optimal StepSize). Jedn´a se o gradientn´ı adaptivn´ı algoritmus, u kter´eho je velikost kroku stanovena tak, aby byla minimalizov´ana chybov´a funkce v dan´em smˇeru. Experiment´aln´ı v´ ysledky, kter´e jsou souˇc´ast´ı t´eto pr´ace ukazuj´ı, ˇze metoda je dostateˇcnˇe robustn´ı i v prostˇred´ıch s n´ahlou zmˇenou parametr˚ u a ˇze dosahuje srovnatlen´ ych (nˇekdy i lepˇs´ıch) vlastnost´ı neˇz st´avaj´ıc´ı metody. Nev´ yhodou je ovˇsem zv´ yˇsen´ı v´ ypoˇcetn´ı n´aroˇcnosti. Na z´avˇer se zm´ın´ıme tak´e o moˇznostech testov´an´ı potlaˇcovaˇc˚ u echa podle doporuˇcen´ı ITU-T G.168. Navrhli jsme vlastn´ı aplikaci, kter´a slouˇz´ı k testov´an´ı AEC (Adaptive Echo Cancelers) syst´em˚ u urˇcen´ ych pro potlaˇcov´an´ı echa. Grafick´e rozhran´ı aplikace dovoluje uˇzivateli sledovat vlastnosti dan´e metody, odhalit jej´ı slab´e str´anky a urˇcit zda splˇ nuje ˇci nesplˇ nuje poˇzadavky doporuˇcen´ı ITU-T G.168.

iii

Contents Acknowledgements

i

Abstract

ii

Abstrakt

iii

Introduction

1

1 Basic features of the speech enhancement system 1.1 Speech quality and intelligibility . . . . . . . . . . . 1.2 Additive noise and echoes . . . . . . . . . . . . . . 1.2.1 Speech communication system . . . . . . . . 1.2.2 Background noise . . . . . . . . . . . . . . . 1.2.3 Acoustic echo . . . . . . . . . . . . . . . . . 1.2.4 Electric (line) echo . . . . . . . . . . . . . . 1.2.5 Amplifier distortion . . . . . . . . . . . . . . 1.2.6 Multiplicative distortion . . . . . . . . . . . 1.2.7 Combined distortion . . . . . . . . . . . . . 1.3 Models of the speech signal . . . . . . . . . . . . . 1.3.1 Autoregressive (AR) model . . . . . . . . . 1.3.2 Probabilistic model . . . . . . . . . . . . . . 1.3.3 Damped complex sinusoidal model . . . . . 1.3.4 Hidden Markov model (HMM) . . . . . . . . 1.4 Estimation criteria and cost functions . . . . . . . . 2 Overview of the speech enhancement methods 2.1 Spectral subtraction . . . . . . . . . . . . . . . 2.2 Wiener filtering . . . . . . . . . . . . . . . . . . 2.3 Estimation-Maximization (EM) approach . . . . 2.4 Signal subspace decomposition . . . . . . . . . . 2.5 HMM-based strategies in speech enhancement .

iv

. . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . .

4 4 5 5 6 6 7 7 7 8 8 8 10 11 11 12

. . . . .

15 15 17 18 20 23

3 Introduction to adaptive algorithms 3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The basic adaptive system . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Properties of adaptive algorithms . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Misadjustment (steady-state error) . . . . . . . . . . . . . . . . . . 3.4.4 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Structural efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Experimental evaluation used throughout this thesis . . . . . . . . . . . . . 3.5.1 Input signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Models of unknown systems . . . . . . . . . . . . . . . . . . . . . . 3.5.3 EXPEV1 - Identification of an impulse response of acoustic environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 EXPEV2 - Dual-channel suppression of background noise from the speech signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 EXPEV3 - Inverse filtering of a distorted speech signal . . . . . . . 3.6 Reference Algorithm 1 - The Normalized Least Mean Square (NLMS) . . . 3.6.1 Quadratic nature of the MSE function . . . . . . . . . . . . . . . . 3.6.2 Experimental evaluation of the convergence performance . . . . . . 3.6.3 Computational complexity analysis . . . . . . . . . . . . . . . . . . 3.6.4 Convergence rate deterioration due to correlated input data . . . . 3.6.5 Convergence rate deterioration due to high number of filter coefficients 3.7 Reference Algorithm 2 - The Recursive Least Squares (RLS) . . . . . . . . 3.7.1 Experimental evaluation of the convergence performance . . . . . . 3.7.2 Computational complexity analysis . . . . . . . . . . . . . . . . . . 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27 29 30 31 31 31 32 32 32 32 33 34

4 Alternative stochastic gradient adaptive algorithms 4.1 Methods derived from the LMS (NLMS) algorithm . . . . . . . . . . 4.1.1 Description of algorithms . . . . . . . . . . . . . . . . . . . . . 4.1.2 Experimental evaluation of the convergence rate . . . . . . . . 4.1.3 Computational complexity analysis . . . . . . . . . . . . . . . 4.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Fast Block-LMS (FBLMS) algorithm . . . . . . . . . . . . . . . . . . 4.2.1 Calculation of the linear convolution term y(k) = A(k)wT (k) . 4.2.2 Calculation of the linear correlation term φ(k) = AT (k)e(k) . 4.2.3 Normalization of the tap-weight correction term . . . . . . . .

50 51 51 54 55 56 56 58 58 59

v

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

35 35 36 36 38 39 40 41 42 43 46 47 48

4.3

4.2.4 Experimental evaluation of the convergence performance . . . . 4.2.5 Computational complexity analysis . . . . . . . . . . . . . . . . 4.2.6 Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simultaneous perturbation stochastic approximation (SPSA) algorithm 4.3.1 Robbins-Monro stochastic approximation (R-M SA) . . . . . . . 4.3.2 Kiefer-Wolfowitz finite-difference SA algorithm . . . . . . . . . . 4.3.3 Simultaneous perturbation . . . . . . . . . . . . . . . . . . . . . 4.3.4 The Bernoulli movement . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Experimental evaluation of the convergence performance . . . .

. . . . . . . . .

5 Optimal step-size strategy in adaptive filtering 5.1 Affine projection algorithm (APA) . . . . . . . . . . . . . . . . . . . . . . 5.1.1 The zero-forcing principle . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The basic APA algorithm . . . . . . . . . . . . . . . . . . . . . . 5.2 Optimal step-size strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Variable step algorithms . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 The mathematical description of the OSS principle . . . . . . . . 5.2.3 Exponential averaging of the correlation matrix and the cross-correlation vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Regularization of the optimal-step correction term . . . . . . . . . 5.3 Experimental evaluation of the OSS-LMS algorithm . . . . . . . . . . . . 5.3.1 Results of the EXPEV1 experiment . . . . . . . . . . . . . . . . . 5.3.2 Results of the EXPEV2 experiment . . . . . . . . . . . . . . . . . 5.3.3 Results of the EXPEV3 experiment . . . . . . . . . . . . . . . . . 5.3.4 Computational complexity analysis . . . . . . . . . . . . . . . . . 5.4 Summary of the experimental evaluation . . . . . . . . . . . . . . . . . . 6 ITU-T G.168 echo canceler testing 6.1 Structure of line-echo canceling systems 6.2 Echo delay . . . . . . . . . . . . . . . . 6.3 Testing signals . . . . . . . . . . . . . . 6.4 Signal level measurement . . . . . . . . 6.5 Description of the testing software . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . .

59 60 62 63 63 64 64 67 67

. . . . . .

71 71 72 72 74 74 75

. . . . . . . .

77 78 79 79 80 82 83 85

. . . . .

87 87 88 89 90 91

7 Conclusions and future work 94 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A Graphical user interface of the application for echo canceler testing

vi

102

Biography

103

vii

List of Figures 1.1 1.2

1.3 2.1 2.2 2.3 2.4

Stages of the speech communication system, which are subject to noise corruption and/or distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Common types of noise signals in speech communication systems (left waveform, right - spectrogram), a) clean speech, b) white noise, c) pink noise, d) machine gun, e) F16 airplane, f) babble, g) standing car (Volvo), h) a car passing by on the street . . . . . . . . . . . . . . . . . . . . . . . . 9 Speech modeling using AR system driven by a random noise, a) block diagram, b) signal flow graph . . . . . . . . . . . . . . . . . . . . . . . . . 10 Schematic structure of the spectral subtraction method . . . . . . . . . . The principle of iterative Wiener filtering using speech modeling . . . . . Block diagram of the signal subspace method for speech enhancement with time-domain constraint; SVD denotes singular value decomposition . . . Schematic description of the GMM . . . . . . . . . . . . . . . . . . . . .

Schematic diagram of an adaptive system, parametrized by θ and driven by a reference signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Block diagram of an adaptive filter (tapped-delay line). . . . . . . . . . . 3.3 Scattered plots of the input vectors x(n) = [xn , xn−1 ], n = 1, . . . , 3000 corresponding to a) white noise, b) low-pass filtered white noise with fc = 0.6 f2s and c) low-pass filtered white noise with fc = 0.3 f2s . . . . . . . . . 3.4 A snapshot of the testing speech signal and its grammatical segmentation; Fs = 8000Hz, 16bits/sample, duration 1.8125s. . . . . . . . . . . . . . . . 3.5 (a) Car interior impulse response estimated using FIR filter (M = 256) and MLS sequence; (b) Magnitude spectrum . . . . . . . . . . . . . . . . 3.6 (a) Impulse response of the 20th order low-pass FIR filter with fc = 0.3; (b) Magnitude spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 (a) Impulse response of a raised cosine filter with fc = 0.25 and ∆b = 0.3; (b) Magnitude spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 (a) Exponentially decaying random impulse response; (b) Magnitude spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Schematic diagram of the EXPEV1 system identification scenario . . . . 3.10 Schematic diagram of the EXPEV2 dual-channel system for background noise suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 16 . 18 . 22 . 23

3.1

viii

. 29 . 30

. 33 . 34 . 35 . 36 . 37 . 38 . 39 . 40

3.11 Description of the EXPEV3 experimental system used for inverse filtering of a distorted speech signal . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 MSE performance of the filter with two tap-weights a) quadratic function J(w0 , w1 ) with single global minimum, b) elliptic character of the loci of points with equal J(w) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Convergence of the MSE function of the NLMS algorithm for different step sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 Contour plot of the 2-dimensional tap-weight adjustment process of the NLMS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 Deterioration of the NLMS convergence performance due to correlated input data (speech); a) Mean-square value of the error signal (MSE); b) Mean-square deviation of the tap-weight vector (MSD) . . . . . . . . . . 3.16 Deterioration of the NLMS convergence performance due to high number of filter coefficients; a) Mean-square value of the error signal (MSE); b) Mean-square deviation of the tap-weight vector (MSD) . . . . . . . . . . 3.17 Convergence rate analysis of the RLS algorithm under EXPEV1 conditions and comparison with the NLMS; the mean values were obtained by averaging over 100 independent trials . . . . . . . . . . . . . . . . . . . . 3.18 Experimental comparison of the convergence rate of the NLMS and RLS algorithms under EXPEV1 conditions with speech input; the mean values were obtained by averaging over 100 independent trials; the parameters of the algorithms were tuned to achieve the highest rate of convergence . . . 3.19 Discrepancy between the computational complexity and the convergence time of NLMS and RLS algorithms; the symbol “?” indicates an algorithm that would be ideal in terms of low complexity and high convergence rate 4.1 4.2 4.3 4.4 4.5

4.6

Regulariztion of the input correlation matrix; (a) before regularization, χ(R) = 1.72; (b) after regularization with γ = 0.1, χ(R) = 12.07 . . . . . Nonlinear function of the type “dead-zone” used in DZ-LMS algorithm . Convergence performance of the signed algorithms - mutual comparison . Results of the EXPEV1 experiment for the alternative stochastic gradient algorithms; (a) Mean-square error; (b) Mean-square deviation . . . . . . Schematic flow graph of the FBLMS algorithm; “the correlation constraint” involves the operations of DFT, replacement of the last L samples by zeros and inverse DFT, in that order . . . . . . . . . . . . . . . . . . . . . . . Experimental convergence analysis of the FBLMS algorithm under EXPEV1 scenario; colored input (K = 20, fc = 0.8), random exp. decaying impulse response, sign inversion of the parameters at n = 1000; (a) Meansquare deviation, (b) comparison of the final values of tap-weight estimates with true values in both parts of experiment, (c) MSD for decreased SNR (5 dB), (d) MSD for more colored input data (fc = 0.3) . . . . . . . . . .

ix

. 41

. 42 . 43 . 44

. 45

. 46

. 47

. 48

. 49 . 52 . 53 . 54 . 55

. 59

. 61

4.7

Comparison of the number of multiplications and additions of the LMS, NLMS and FBLMS algorithms depending on the block length N (for K = N in table 4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Influence of various parameters on sequences a(n) and c(n); a) Influence of the parameter α on the sequence a(n) (a =√0.16, A = 100); b) Influence of the parameter γ on the sequence c(n) (c = 0.3); c) Influence of the gain a on the sequence a(n) (α = 0.602, A = 100); d) Influence of the stabilization constant A on the sequence a(n) (a = 0.16, α = 0.602); e) Influence of the gain c on the sequence c(n) (γ = 0.101) . . . . . . . . . . . . . . . . . . . . 4.9 Properties of the Bernoulli random process for a 2-dimensional case; a) Bernoulli movement in the parameter space (starting point [0, 0], transition of each wi at approx. [+1,-1] steps, subject to white noise perturbation with the variance of 0.2; b) Histogram of [+1] successes in 100 Bernoulli trials c) Probability distribution of Bernoulli differences [+1,+1],[+1,-1],[-1,+1] and [-1,-1] generated by a pseudo-random generator with normal distribution 4.10 Convergence analysis of the SPSA algorithm and comparison with the NLMS; SPSA parameters: N = 2, α = 0.602, γ = 0.101, c = 1e − 3, A = 200, a = 0.66, NLMS parameters: N = 2, µN = 0.01; a) loci of w0 (n) versus w1 (n); b),c) squared-error performance . . . . . . . . . . . . . . . . . . . . Geometrical interpretation of the zero-forcing principle in the NLMS algorithm. The estimate of the weight vector is projected on the line which is perpendicular to the input vector and contains the optimal filter vector w0 . Taken from [58] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Optimal step-size weight adjustment; every line of the movement is calculated using averaged values of R and p . . . . . . . . . . . . . . . . . . . . 5.3 Impulse response of the exponential averaging filter for various values of λ 5.4 MSD function of the proposed method, compared with the NLMS and the RLS. In the middle of the experiment (i.e. when n=1000) the coefficients of the unknown FIR filter are reversed. . . . . . . . . . . . . . . . . . . . . 5.5 MSE function of the proposed method, compared with the NLMS and the RLS. In the middle of the experiment (i.e. when n=1000) the coefficients of the unknown FIR filter are reversed. . . . . . . . . . . . . . . . . . . . . 5.6 (a) Impulse response of an unknown system (20th order low-pass FIR filter with fc = 0.3), (b) the magnitude spectrum, (c) evolution of the model coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Convergence of the OSS method on a speech signal under EXPEV2 experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Speech signal corrupted by additive noise (blue) and its version cleaned by the OSS algorithm (yellow) . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Results of the EXPEV3 experiment - modeling the inverse function. MSD plot of the OSS, NLMS and RLS algorithms. . . . . . . . . . . . . . . . . . 5.10 EXPEV3 - time evolution of the impulse response of the adaptive filter (OSS method). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

65

68

70

5.1

x

73 76 78

79

80

81 82 83 84 85

6.1 6.2 6.3 6.4 6.5

Schematic diagram of the line-echo canceling system. DTD - double talk detector, NLP - nonlinear processor, CNG - comfort noise generator. . . . An impulse response of a hybrid circuit incl. network elements measured in a real north American network. . . . . . . . . . . . . . . . . . . . . . . (a) CSS signal, 44.1 kHz (b) PSD of the CSS, 8192 pt. FFT using Hanning window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic diagram of a short-term (instantaneous) level measurement device Testing scenario: left column - operations performed, right column - devices/tools used. The Network Echo Canceler (NEC) is a network echo canceler and Graphical User Interface (GUI) is a graphical user interface. Grey field represents software application and orange field represents hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88 88 90 91

93

A.1 Snapshot of the G.168 Echo Canceler Testing application . . . . . . . . . . 102

xi

List of Tables 3.1 3.2 4.1

4.2 4.3 5.1

Analysis of the number of operations in the NLMS algorithm for the frame length K = 256 and various number of filter coefficients . . . . . . . . . . . 41 Analysis of the number of operations in the RLS algorithm for the frame length K = 256 and various number of filter coefficients . . . . . . . . . . . 48 Summary of the computational complexity analysis of the adaptive algorithms discussed in section 4; N denotes the filter order and K denotes the number of output samples; †- 2KN for SD-LMS, K(2N + 1) for SS-LMS; ‡- the med(.) operation is not included . . . . . . . . . . . . . . . . . . . . 56 Computational complexity analysis of the FBLMS algorithm for the frame length K = 256 and various number of filter coefficients . . . . . . . . . . . 62 Analysis of the number of operations in the SPSA algorithm for the frame length K = 256 and various number of filter coefficients . . . . . . . . . . . 69 Computational complexity analysis of the OSS algorithm in its basic form for the frame length K = 256 and various number of filter coefficients . . . 86

xii

List of abbreviations and acronyms APA

Affine Projection Algorithm

AR

Auto-Regressive

CNG

Comfort Noise Generator

CSS

Composite Source Signal

DFT

Discrete Fourier Transform

DSP

Digital Signal Processor

DTD

Double-Talk Detector

DZ-LMS

Dead-Zone Least Mean Squares

EM

Estimation-Maximization

ERL

Echo Return Loss

FBLMS

Fast Block Least Mean Squares

FDSA

Finite Difference Stochastic Approximation

FIR

Finite Impulse Response

FFT

Fast Fourier Transform

GMM

Gaussian Mixture Model

GUI

Graphical User Interface

GUIDE

Matlab Graphical User Interface Development Environment

HMM

Hidden Markov Model

IIR

Infinite Impulse Response

IS

Itakura-Saito

ISI

Inter-Symbol Interference

KLT

Karhunen-Lo`eve Transform

xiii

LP

Linear Prediction

LPC

Linear Predictive Coding

LMF

Least Mean Fourth

LS

Least Squares

MAC

Multiply and ACumulate

MAP

Maximum A-Posteriori

ML

Maximum Likelihood

MLS

Maximum-Length Sequence

MMSE

Minimum Mean-Square Error

MOS

Mean Opinion Score

MSD

Mean Square Deviation

MSE

Mean-Square Error

MUSHRA MUlti Stimulus test with Hidden Reference and Anchors MV

Minimum Variance

NEC

Network Echo Canceler

NLMS

Normalized Least Mean Squares

NLP

Non-Linear Processor

OSS

Optimal Step-Size

PCM

Pulse Code Modulation

PDF

Probability Density Function

PESQ

Perceptual Evaluation of Speech Quality

PS

Power Spectrum

PSD

Power Spectral Density

RLS

Recursive Least Mean Squares

R-M SA

Robbins-Monro Stochastic Approximation

RMS

Root Mean Square

SD-LMS

Signed Data Least Mean Squares

SE-LMS

Signed Error Least Mean Squares

xiv

SNR

Signal to Noise Ratio

SPSA

Simultaneous Perturbation Stochastic Approximation

SS-LMS

Signed Signed Least Mean Squares

TDC

Time-Domain Constraint

VAD

Voice Activity Detector

VS-NLMS Variable-Step Normalized Least Mean Squares WAV

Windows Audio Video

ZFA

Zero-Forcing Algorithm

xv

Introduction In many areas of today’s communication environments we often encounter situations where clarity and intelligibility of speech is impaired due to ubiquitous noise. Since speech has always been one of the most important carriers of information for people it becomes a challenge to maintain its high quality. It is therefore desirable that noise and, as a matter of fact everything which is not pertinent to the speech information carried, is canceled or suppressed. The problem of noise suppression is sometimes complicated due to some psychoacoustic aspects of speech perception and the peculiarities of human hearing. There are many types of noise distortion that exist in the field of speech communication. At the same time there are many concepts of signal processing which assist in mitigating their effect. While noise reduction is an art that dates back two decades, it is still far from a perfect science and new approaches are still being proposed. Among all, adaptive signal processing concept offers some advantages over the other methods that makes it attractive for the developers since the 80’s. Sophisticated methods allowing efficient noise cancelation became feasible as the power of Digital Signal Processor (DSP) chips increased. This trend continues up to the present day.

The aim of the thesis In this thesis we try to achieve three goals: to analyze, to develop and to apply adaptive algorithms in noise cancelation of speech signals. By analyzing the structure and the performance of conventional adaptive algorithms such as the NLMS and the RLS we want to reveal the strong parts and the weaknesses of these methods and understand the crucial parts of the adaptive process. By developing brand new concepts or modifying the existing ones we seek to achieve better performance and increase the quality of the speech signal. For this purpose we specify several conditions and criteria (rate of convergence, computational complexity, residual error level, ability to track sudden changes of parameters) in order to determine which method yields the best results. Finally, we try to test the adaptive methods in certain applications that are often encountered in the field of speech communications. This includes system identification, background noise suppression and inverse filtering.

1

Original contributions of this thesis This thesis focuses on the use of adaptive algorithms developed for noise and echo cancelation in speech signals. The main objectives of the thesis may be summarized as follows: • Comparative analysis and evaluation of the NLMS algorithm and the RLS algorithm in terms of computational complexity, steady-state level of error and rate of convergence. • Mathematical description of the SPSA algorithm using stochastic perturbation principle based on the conventional NLMS algorithm. • Research and development of the OSS concept, analysis of its performance and experimental evaluation of the algorithm in three different applications. • Description of the ITU-T G.168 Echo Canceler Testing application with the graphical user interface and support for the most important tests in the recommendation.

Scope of chapters In this thesis we begin our discussion by outlining the basic properties of the speech enhancement system. We will show various models for modeling the speech signals as well as estimation criteria and cost functions which are commonly used to quantify the efficiency of adaptive algorithms. A short review of non-adaptive speech enhancement techniques at the end of the chapter should help readers in getting an inside into the problem of noise cancelation. In the second chapter we concentrate on adaptive algorithms, their properties and the concept of adaptive filtering for noise and echo cancelation. We provide a list of criteria and conditions that the adaptive methods should meet in order to be applicable in speech enhancement systems. Three typical applications are set apart and described in detail since they are used for the experimental evaluation of the methods. These are system modeling, background noise cancelation and inverse filtering. Finally, we introduce two well-known and widely used adaptive methods, namely the NLMS (Normalized LeastMean Square) and the RLS (Recursive Least-Square). At the end of the chapter we present some results of convergence rate analysis and computational complexity analysis to set a reference of performance the other methods would be compared with. In the third chapter we investigate the possibility to enhance the existing methods in order to achieve better performance. We first concentrate on alternative methods that stem from the LMS or NLMS algorithm. Most of these methods are nonlinear and are more or less simple. Then we analyze the FBLMS (Frequency Block LMS) algorithm, which is highly suitable to be used in practise since it is computationally efficient and yet provides satisfactory performance. At the end of the chapter we provide a description of a novel approach called SPSA which makes use of the principle of stochastic perturbation and its structure is the same as that of the NLMS method. We evaluate its performance and make a conclusion with respect to its efficiency.

2

In the last chapter, which is the crucial part of this thesis we present a novel approach to adaptive filtering which we have developed and entitled as the OSS (Optimal Step-Size). The algorithm is similar to Affine Projection Algorithm (APA) but it differs in several aspects and has been developed from a different point of view. In OSS we concentrate on the convergence process, its direction and speed. We try to achieve the best results even in environments with a sudden change of parameters. At the same time we try to keep the computational complexity as low as possible to make it possible for the method to be implemented in real-time. Finally, in the last part we make some remarks on the testing system of echo cancelers according to recommendation ITU-T G.168. We outline our own solution to the problem of off-line testing and describe an application that we developed for these purposes.

3

Chapter 1 Basic features of the speech enhancement system Before starting with a mathematical description of the adaptive algorithms, we must establish some framework for the speech enhancement system (noise cancelation system). In this chapter we try to describe its basic properties and we expand this topic in more detail in subsequent chapters. First of all we specify two popular criteria for speech quality assessment, namely the speech quality and speech intelligibility. These two mutually exclusive properties of a speech signal impose different requirements on the speech enhancement systems. Typically, we would like to achieve both the maximum quality and the maximum intelligibility. However, both of these conditions can rarely be satisfied at the same time. Therefore, a suitable compromise has to be achieved so as to attain the maximum efficiency. We also address the issue of noise and echo. We describe several types of noise that are frequently encountered in noise canceling applications. In particular, additive noise which is superimposed on the speech signal may represent everything that is unwanted with regard to speech e.g. rush, hum, engine noise, babble, murmur etc. Echo is a special type of noise signal, which has speech-like characteristics. It is encountered in classical telephony and hands-free communications. In many algorithms that are discussed here or in literature it is assumed that speech is modeled by a simplified system that is easy to implement and yet efficient. In this chapter we outline the Auto-Regressive (AR) modeling approach, which is adopted by a wide range of algorithms. We also try to describe some basic properties of the Hidden Markov Model (HMM) and the Gaussian Mixture Model (GMM). These two models play an important role in speech recognition. Recently, however, their relevance to speech enhancement was also approved.

1.1

Speech quality and intelligibility

Every environment in which people communicate has its own specifics that distinguishes it from other environments. Human auditory system is a complex sophisticated system, which processes sounds, i.e. speech, audio, etc. and extracts information for the brain.

4

Since humans have binaural hearing, the spectrum of information humans are able to extract is rather versatile. The character of the environment, the number of sources, their mutual positions and distances or the loudness of the sound, its clarity or impression - all these factors account to the quality of speech perception. In the case of speech communication we find, however, that a certain type of information may be more important than the other. It depends on the method and the purpose of communication. In particular, speech quality and speech intelligibility seem to be the two most important features since they represent both, the subjective and the objective quality of speech signals [18]. Speech quality The quality of speech signals is a subjective measure which reflects the way the signal is perceived by listeners. It can be expressed in terms of how pleasant the signal sounds or how much effort is required from the listeners in order to understand what is being said. Speech intelligibility Intelligibility, on the other hand, is an objective measure of the amount of information which can be extracted by listeners from a given signal, whether the signal is clean or noisy. A given signal may be of high quality and low intelligibility and vice versa. Hence, the two measures are independent of each other. Both the quality and the intelligibility of a given signal are evaluated based on the tests performed on human listeners. In real systems, however, we need a well-defined quantitative measure of either the quality or the intelligibility which would be easy to implement and easy to execute any time. Since no mathematical description of these measures is known up-to-date whether an exact formula or empiric equation, any measure implemented will be approximate. There is just a hope that it will give reasonable results correlated with the reality. Some of the objective measure used by speech enhancement algorithms will be discussed in section 1.3.

1.2

Additive noise and echoes

In the previous section we introduced the problem of speech enhancement and its necessity for a successful improvement of the speech quality and/or intelligibility. In this section we specify the types of noise corrupting the speech signal. We present some statistical properties of these signals and we try to develop some framework for its elimination.

1.2.1

Speech communication system

Any system involving transmission, acquisition, or generation of speech is subject to a wide range of disparate influences. These can include external interferences, e.g. the background noise in a recording, but can also extend to echoic effects or nonlinear distortion introduced by analog electro-acoustic devices or amplifiers [12]. A generic speech

5

background noise

amplifier noise

coding/ compression loss

quantization noise

LPF

ADC

Coding/ Compression

echo channel errors

LPF

background noise

DAC

amplifier noise

round-off noise

Communication channel

Decoding/ Decompression

processing loss

Figure 1.1: Stages of the speech communication system, which are subject to noise corruption and/or distortion

communication system connecting two speaking persons is shown in Fig. 1.1. We may mathematically describe the signal deterioration as x(n) = f (s(n), . . . , s(n − S), w(n), . . . , w(n − W ), x(n − 1), . . . , x(n − X)) ,

(1.1)

where s represents a clean speech signal, w represents the noise and x represents the noisy speech signal. This general formula encompasses several types of signal deterioration and distortion frequently encountered in speech enhancement applications.

1.2.2

Background noise

A microphone picking up the voice of a local speaker may be corrupted by various sources of noise which come from outside of the speaker’s proximity. An example may be the engine noise in a car interior or airplane cockpit, the wind, the rush on a busy street or in a building, the babble of other persons not participating in the communication and many others. Usually, background noise is considered to be additive, since at any time instant, the noise is simply added to the clean speech signal x(n) = s(n) + w(n).

1.2.3

(1.2)

Acoustic echo

Echo represents a special type of additive noise which has speech-like characteristics. For example, during hands-free calls or teleconferencing sessions due to acoustic feedback the

6

incoming voice leaks in the local microphone where it is combined (added) with the local speaker’s speech. The talker at the far end than hears its own voice with some delay. The signal he receives may be described as a combination of the local speech and the echo x(n) = sne (n) + sf e (n),

(1.3)

where sne is the near-end local speech and sf e is the far-end echo. A different situation arises when a single person is located in an acoustically closed environment. If, in this environment, the voice-acquiring device is located too far from the speaker, the recorded/transmitted signal is a linearly distorted (convolved) version of the original speech x(n) = s(n) ⊗ g(n). (1.4) From (1.4) it can be deduced that g(n) is a linear, all-zero system, even though it is only an approximation to the reality. In any case, acoustic echo represents a serious problem to the quality of speech communication even when the delay of the echo is several milliseconds.

1.2.4

Electric (line) echo

This type of echo has an additive effect (see eg. 1.3) in conventional telephony. It arises in a circuit, called hybrid, in which a four-wire transmission line is converted into a twowire line. Since the transducer is imperfect a portion of the incoming signal leaks in the outgoing direction and represents an echo which is audible and annoying. In contrast with the acoustic echo, the delay in this case rarely exceeds 100 ms.

1.2.5

Amplifier distortion

This is a highly nonlinear distortion introduced during signal amplification. In an extreme case, the speech signal becomes clipped. The distortion is severe when the amplification gain is high. Mathematically described, the noisy speech signal is a result of a nonlinear function applied to the clean speech signal x(n) = g(s(n)).

1.2.6

(1.5)

Multiplicative distortion

This type of noise is occasionally encountered in radar and sonar systems or in the connection with fading communication channels. Here, the clean speech signal is simply multiplied by the noise signal on a sample-by-sample basis x(n) = s(n) ∗ w(n).

(1.6)

The noise is usually of random character but can also be a constant or of impulsive type.

7

1.2.7

Combined distortion

The effects of noise and distortion may be combined, such as in the case of teleconferencing, where additive background noise is combined with the effects of acoustic echo. Fig. 1.2 shows some waveforms of typical types of noise signals encountered in speech communications.

1.3

Models of the speech signal

The treatment of speech as an output of a parametric model is one of the milestones in speech processing. ’Parametric modeling’ means that a given stochastic signal may be described (approximated) by a combination of a filter and a simplified input signal, called the excitation. The filter and the excitation are characterized by a finite set of parameters. The filter is described by its coefficients and the excitation either by a finite set of discrete samples or by some statistical parameters (Probability Density Function (PDF), mean, variance, etc.). In the case of white noise excitation the set of statistical parameters effectively reduces to the mean, the variance and the gain. Parametric modeling is possible since a speech signal contains a lot of redundancy which can be exploited in an effective way using the characteristics of the signal. Following is an introduction to some of the most important speech-modeling approaches which are adopted in this work.

1.3.1

Autoregressive (AR) model

One way how to model a speech signal is to represent it by an all-pole linear filter driven by the white noise [41, 55, 13]. The power spectrum of the filter output is given by the spectrum of the white noise which is constant multiplied by the squared magnitude of the filter transfer function [26]. Thus, various speech signals may be produced in this way by choosing appropriate filter coefficients. The sequence of values s(n), s(n − 1), . . . , s(n − p) represents a realization of an AR process of order p if they satisfy the difference equation s(n) + a1 s(n − 1) + · · · + ap s(n − p) = e(n),

(1.7)

where the constants a1 , a2 , . . . , ap are known as the AR parameters and e(n) is called the innovation process which is assumed to be of “white” character. The above equation can be written in a compact form as s(n) =

p X

ai s(n − i) + e(n)

(1.8)

i=1

or, in matrix notation, s(n) = aT s(n − 1) + e(n),

(1.9)

where a is a column vector consisting of the AR coefficients and s(n−1) = [s(n−1), s(n− 2), . . . , s(n − p)] is the column vector of the past speech samples. Fig. 1.3a shows the block diagram of the AR process represented by 1.8.

8

9

0.2

0.4

Time [s]

0.6

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

1

1

1

1

1

1

1

1

0

0 0 4000

2000

0 0 4000

2000

0 0 4000

2000

0 0 4000

2000

0 0 4000

2000

0 0 4000

2000

0 0 4000

0

0

0.6

0.6

0.6

0.6

0.6

0.6

0.6

-2

0.4

0.4

0.4

0.4

0.4

0.4

0.4

2000

0.2

0.2

0.2

0.2

0.2

0.2

0.2

2000

h) 0

-2 0 2

g) 0

-2 0 2

f) 0

-2 0 2

e) 0

-2 0 2

d) 0

-2 0 2

c) 0

-2 0 2

b) 0

-2 0 2

a) 0

4000

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.2

0.4

0.4

0.4

0.4

0.4

0.4

0.4

Time [s]

0.4

0.6

0.6

0.6

0.6

0.6

0.6

0.6

0.6

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

1

1

1

1

1

1

1

1

Figure 1.2: Common types of noise signals in speech communication systems (left - waveform, right - spectrogram), a) clean speech, b) white noise, c) pink noise, d) machine gun, e) F16 airplane, f) babble, g) standing car (Volvo), h) a car passing by on the street

Amplitude [-]

2

Frequency [Hz]

a)

b) s(n)

e(n) z-1 a1

A

z-1

~

a2

ap

e(n)

z-1 s(n)

z-1

Figure 1.3: Speech modeling using AR system driven by a random noise, a) block diagram, b) signal flow graph

Speech signals are known to be quasi-stationary [13] meaning they are short-term stationary but long-term non-stationary. Changes in the statistics of the speech are due to the changes in the vocal tract and variations in the characteristics of the excitation process and occur approximately every 10-100 milliseconds. Thus, it is a common strategy to process speech signal in short intervals (frames, segments), during which their statistical parameters are constant. In practise, for the most common case of 8kHz sampling frequency, the length of each frame is between 20 to 240 samples. In turn, every speech frame can be represented as a vector s(n) = [s(n), s(n − 1), . . . , s(n − N + 1)] of let’s say N samples. From now on, we will sometimes omit the discrete time index n or replace it by a frame number index m. Using (1.9) we may define the AR process for the whole speech frame as s(n) = As(n − 1) + e(n),

(1.10)

in which e = [e(n), e(n − 1), . . . , e(n − N + 1)] is the excitation vector, s(n − 1) = [s(n − 1), s(n − 2), . . . , s(n − N − p)] is the vector of N + p last speech samples and the matrix A is a N × (N + p) matrix consisting of the AR coefficients    A= 

a1 . . . a p 0 .. . 0

1.3.2

0

0

...

0 . a1 . . . ap 0 . . . .. . . . . . . . . . . . . . . . .. . . . . 0 0 a1 . . . ap

   . 

(1.11)

Probabilistic model

Another useful model of a speech signal, which is widely adopted in Bayesian speech enhancement framework, is the probabilistic (likelihood) model. It capitalizes on the fact that speech is a random signal, which may be characterized by its PDF. For Gaussian

10

AR processes, it is defined as ¶ µ 1 1 T −1 √ P(s) = exp − (s − µs ) Σs (s − µs ) , 2 (2π)N/2 det Σs

(1.12)

where Σs = σe2 (AT A)−1 is the covariance matrix of the AR process, µs is the mean vector and σe2 is the variance of the innovation process. Formally we may write P(s) = N (µs , σe2 ).

1.3.3

(1.13)

Damped complex sinusoidal model

Sometimes it is useful to model the speech signal by a linear model based on damped complex sinusoids. This model assumes that each N -dimensional speech vector s(n) can be represented as finite sum of complex exponentials s(n) =

K X

e(n − i)vi ,

K ≤ N,

(1.14)

i=1

where {e} is a zero-mean complex random process and v1 , . . . , vK are N -dimensional complex basis vectors consisting of damped exponentials ¡ ¢ −1 j(N −1)ωi T vi = 1, ρi ejωi , ρ2i ej2ωi , . . . , ρN e . i

(1.15)

In eq. 1.15, it holds that 0 ≤ ωi ≤ 2π is the frequency and 0 ≤ ρi ≤ 1 is the damping coefficient corresponding to the ith component. The model 1.14 can be rewritten in matrix notation as s(n) = Ve(n), (1.16) where V = [v1 , . . . , vK ] is an N × K matrix the rank of which is K.

1.3.4

Hidden Markov model (HMM)

Speech modeling based on HMM may be viewed as an effective technique that extends conventional spectral analysis of stationary signals to time varying signals [52]. It uses a Markov chain to model the changing statistical characteristics that are only probabilistically evidenced through the actual observations. Therefore, the hidden Markov process is a doubly stochastic process in which there is an unobservable Markov chain, each state of which is associated with its PDF. The Markov chain is specified by an exhaustive set of state transition probabilities. On the other hand, the probabilistic functions, designated as the observation PDF’s, are parametric representations. In the work [52], Poritz considered a single Gaussian autoregressive PDF per state. However, Juang et al. [28] further expanded the modeling approach by including a finite set of so-called mixtures. A mixture is a single realization of a Gaussian autoregressive PDF. The Gaussian mixtures gave name to a brand new class of models based on hidden Markov processes - GMM. An R-state Markov chain is characterized by the set of state transition probabilities t = {tr,q , r, q = 1, . . . , R}. Associated with each state r of the unobservable Markov chain 11

is a PDF function br (s) of the observed N -dimensional speech vector s. The observed sequence, which is a sequence of M consecutive speech vectors is usually defined as the set s = {sm , m = 1, . . . , M }. The PDF for s given the knowledge of the parameters of the HMM can be expressed as X Pλs (s) = Pλs (s, β) β

X

=

β

π

M Y

tβm−1 βm bβm (sm ).

(1.17)

m=1

In (1.17), λs is used to denote the parameter set of the HMM, λs = {π, t, b} where π is the set of initial state probabilities and b is the set of parameters {br (s), r = 1, . . . , R}. The summation in 1.17 is over all possible state sequences, which are characterized by β = {βm , m = 1, . . . , M }, βm ∈ {1, 2, . . . , R}. The strategy of Juang et al. [28] leads to the Gaussian AR observation density br (s) =

L X

cr,l br,l (s),

r = 1, . . . , R,

(1.18)

l=1

in which L is the number of mixture components, cr,l is the weight for the lth mixture component, and the double-subscripted function br,l is the basis PDF for the lth mixture component, all related to state r. On the other hand, the approach of Poritz [52] leads to a slightly different observation PDF XX Pλs (s) = Pλs (s, β, γ) β

=

γ

XX β

γ

π

M Y

tβm−1 βm cβm ,γm bβm ,γm (sm ),

(1.19)

m=1

in which cr,l has the meaning of probability of choosing the mixture r given the process is in state l. In both cases, the basis Gaussian AR PDF bβm ,γm (sm ) is given by 1.18. The parameter set λs of the HMM is usually obtained using training data sequences consisting of several independent utterances. The clean speech training material is presented to the model and an iterative reestimation algorithm is employed to achieve convergence. The most popular reestimation algorithms are the Baum algorithm and the segmental k-means algorithm [17] – more on this topic will be discussed in section 2.5.

1.4

Estimation criteria and cost functions

In signal estimation theory, which is basically the theory applied to speech enhancement, a good estimate of the clean speech from the noisy data is conditioned by some criteria. In this section we discuss some properties of signal estimators with the emphasis on frequently used distortion measures. These are applied in order to obtain a quantitative evaluation of the quality of signal estimation. In the rest of this section we will use the ˆ to denote estimates of the true values θ or θ, respectively. We use θ to notation θˆ or θ

12

represent any variable of interest, e.g. a signal (clean speech, noise), a statistics (mean, variance, power spectrum) or a set of model parameters. Unbiased estimator Since the estimator θˆ may be a random variable and may assume more than a single value, a “good” estimator should be unbiased. That means the mean of the estimate is equal to the true value ˆ = θ. E [θ] (1.20) Minimum variance estimator The fact that an estimator is unbiased does not necessarily guarantee that it is “good”. The second condition imposed on the estimator is the minimum variance. We say that a given estimator θˆ is the Minimum Variance (MV) estimator among all estimators θ0 if it satisfies ˆ ≤ var(θ0 ). var(θ) (1.21) A common estimation strategy, also adopted in speech enhancement, is to minimize a ˆ θ) or a risk function R(θ, ˆ θ) = E [C(θ, ˆ θ)]. In many problems, only the cost function C(θ, error θ − θˆ between the estimate and the true value is of interest. This is the case of the following three popular cost functions Squared error ˆ θ) = (θ − θ) ˆ2 C(θ,

(1.22)

ˆ θ) = |θ − θ| ˆ2 C(θ,

(1.23)

Absolute value of error

Uniform cost function ½ ˆ θ) = C(θ,

1, 0,

ˆ ≥ |θ − θ| ˆ < |θ − θ|

∆ 2 ∆ 2

(1.24)

For the three cost functions to be applicable r to vectors, we have to substitute the ˆ = P(θk − θˆk )2 . error term θ − θˆ by an Euclidean norm kθ − θk k

Minimum Mean Square Estimator The estimator which minimizes the risk function (1.22) is referred to as the Minimum Mean Square Estimator. It is the most widely used estimator used in signal processing and therefore in speech enhancement. Itakura-Saito distance measure

13

Another particulary useful distortion measure which is used in Linear Predictive Coding (LPC) 1 is the Itakura-Saito (IS) distance measure. This measure is applied to determine if a specific estimated AR model is sufficiently accurate. ! ! Ã Ã 2 X |Θk (ω)|2 |Θ (ω)| k ˆ θ) = C(θ, − log −1 (1.25) 2 2 b b | Θ (ω)| | Θ (ω)| k k k ˆ respectively. There b k (ω) are the kth components of the spectra of θ, θ, In (1.25) Θk (ω), Θ are several versions of the IS distance measure known in literature up to date. The one defined by (1.25) is motivated by the measure in [24, p. 191]. Another IS measure [71] is defined as ¶ X µ θk θk ˆ C(θ, θ) = − log . (1.26) θˆk θˆk k

1

A method for estimating the parameters of an AR model for speech signals

14

Chapter 2 Overview of the speech enhancement methods In this chapter we review some popular speech enhancement approaches that are known in literature up to date. Starting with spectral subtraction, which has attracted researchers and engineers for more than three decades, we describe the ideas of Wiener filtering and the Estimation-Maximization (EM) approach. These two methods employ some concepts known from the theory of signal estimation, namely the Maximum Likelihood (ML) method and the Maximum A-Posteriori (MAP) estimation method. Next, we step into the method of signal subspace decomposition which is well suited for the case of speech degradation by colored noise. Finally, we present a basic framework of the HMM speech enhancement system which is complex and complicated. However, it must deserve our special attention since the concept of modeling the speech signals by Markov chains and Gaussian mixtures seems to be the concept of the future.

2.1

Spectral subtraction

This is probably one of the most popular single-channel noise suppression techniques used in real-world applications. The basic idea here is to estimate the amount of additive noise in a noisy speech signal and filter it out in a frequency domain. The method was first proposed by Boll [4, 5] in 1979 and later expanded and generalized by McAulay and Malpass [42] in 1980 who performed spectral subtraction in power domain. In this method it is assumed that the additive noise is uncorrelated to the speech signal. Under these assumptions we can express (1.2) in the Power Spectrum (PS) domain |X(ω)|2 = |S(ω)|2 + |W (ω)|2 p |X(ω)|2 − |W (ω)|2 , (2.1) |S(ω)| = PN −1 2π where S(ω) = DFT {s(n)} = n=0 s(n)e−j N is the Discrete Fourier Transform (DFT) spectrum of the speech signal s(n). The spectra X(ω) and W (ω) are defined similarly. c (ω). After subSince the true spectrum W (ω) is unknown we must use its estimate W tracting this estimate from the power spectrum of the noisy speech signal we would like to reconstruct the “enhanced” speech signal sˆ(n). Since it is generally impossible to compute the true phase φs (ω) we must use its estimate. The best choice is to use the phase 15

|.|2

noise PS estimation |X( )|

_

|.|1/2

S( )

2

segmentation x(n)

phase extraction

w(n)

DFT-1

assembly x(

) ˇ

DFT

ˇ

X( ) windowing

s(n) s(n)

Figure 2.1: Schematic structure of the spectral subtraction method

of the noisy speech signal φx (ω). Wang and Lim [75] found that human ear is relatively insensitive to phase changes and thus this estimate is appropriate. The general form of the estimated speech spectrum can be written as ³ ³ ´´ 1 b c (ω)|2 , 0 2 .ejφx (ω) , S(ω) = max |X(ω)|2 − k|W

(2.2)

where k > 1 is used to overestimate the noise to account for its variance. The inner c (ω)|2 is limited to positive values, since the result of the subtraction term |X(ω)|2 − k|W may become negative. It is convenient to express the spectral subtraction operation as a filtering process. Specifically, the filter is implemented in the frequency domain as b S(ω) = X(ω)G(ω) Ã Ã

!! 21 c (ω)|2 |X(ω)|2 − k|W = X(ω) max ,0 |X(ω)|2 Ã Ã !! 21 c (ω)|2 |W = X(ω) max 1 − k ,0 . |X(ω)|2

(2.3)

Eq. (2.3) represents a common form of the spectral subtraction method as approved by McAullay and Malpass [42]. Fig. 2.1 illustrates the schematic structure of the typical spectral subtraction system. A major drawback of the above method is that it introduces a distortion, called “musical artifacts” to the enhanced speech signal sˆ(n). It has been found that by applying a noise floor according to Berouti et. al. [3] one can efficiently reduce this annoying distortion. The filter has two following form Ã  Ã !! 21   c (ω)|2 |W max 1 − k ,0 G(ω) = max ,α . (2.4)   |X(ω)|2 The spectrum of the noise during speech periods is basically unknown, however we can estimate it during speech pauses which are identified using Voice Activity Detector (VAD). c (ω)|2 can be computed using Welch’s periodogram method The noise power spectrum |W

16

[54] or Welch’s modified periodogram method [45]. To reduce the variance of the noise estimate, the spectrum is smoothed by exponential averaging cm (ω)|2 = λ|W cm−1 (ω)|2 + (1 − λ)|Xm (ω)|2 , |W

for m = 1, 2, . . . ,

(2.5)

where λ is called the noise forgetting factor and m is the frame number. It is noteworthy that (2.5) is performed for “noise-only” frames as detected by the VAD. The spectral subtraction method is a single-channel method which does not involve adaptive principles. Its advantage is inherent simplicity with relatively low computational complexity. The main drawback, though challenged in derived methods, is the introduction of musical artifacts and nonlinear distortion. Also, there are some assumptions on the noise process which are seldom satisfied in real-world applications. Nevertheless, its popularity led to several practical implementations in mobile telephony and speech recognition.

2.2

Wiener filtering

The structural concept of Wiener filtering of a noisy speech signal is similar to the spectral subtraction. However, the basic idea here is to minimize the difference between the estimated speech sˆ(n) and the uncorrupted speech s(n) in the optimal sense. The criterion used is the Minimum Mean-Square Error (MMSE) © ª ξ = E (s(n) − sˆ(n))2 , (2.6) where both s(n) and sˆ(n) are assumed to be long-term stationary and E{.} is an expectation operator. An optimal filter (most often referred to as a non-causal Wiener filter ) that would be able to achieve the MMSE minimum is given by H(ω) =

|S(ω)|2 , |S(ω)|2 + |W (ω)|2

(2.7)

However, in (2.7) S(ω) is exactly a quantity that we would like to know. Moreover, since s(n) and w(n) are assumed to be only short-term stationary, we must use the estimates c (ω) can instead of their true power spectra in (2.7). The noise power spectrum estimate W be calculated during periods of silence, e.g. using exponential averaging already defined by (2.5). However, the problem of speech PS estimation is not as trivial. An approach of Hansen and Clements [23] leads to an iterative estimation of speech PDF based on intermediate results of Wiener filtering. Specifically, an initial estimate of the enhanced speech power spectrum |S (0) (ω)|2 is given as b (0) (ω)|2 − |W c (ω)|2 , |Sb(0) (ω)|2 = |X

(2.8)

b (0) (ω)|2 and |W c (ω)|2 are assumed to be known a-priori. Then, in each iteration where |X i a Wiener filter H (i) (ω) is applied in the same way as in (2.7) but using the estimates c (ω). the filter has the following form Sbi (ω) and W H (i) (ω) =

|Sb(i) (ω)|2 , c (ω)|2 |Sb(i) (ω)|2 + |W 17

(2.9)

X( ) DFT

DFT-1

Wiener filter

assembly

ˇ

windowing

s(n)

ˇ z-1

x(n)

ˇ

ˇ

w(n) s(n)

noise PSD estimation

|W( )|2

|S(i+1)( )|2

LP analysis ai speech PSD s(n) estimation ˇ

|S(i)( )|2

segmentation

Figure 2.2: The principle of iterative Wiener filtering using speech modeling

and the speech spectrum estimate is updated by Sb(i+1) = H (i) (ω)X(ω).

(2.10)

The iterative process is repeated until some convergence criterion is satisfied. This method is called the iterative Wiener filtering. The speech spectrum estimation may even be improved using a constraint of the AR speech production model, see section 1.3.1. In particular, using Linear Prediction (LP) analysis we can estimate the AR coefficients of the model and correspondingly the PDF of the speech signal σe2 2 b |S(ω)| = (2.11) P 2, |1 + pi=1 a ˆi e−jωi | where a ˆi are the estimated AR coefficients and σe2 is the variance of the excitation signal which is the white noise (i.e. the excitation). A schematic description of the iterative Wiener filtering method using the speech constrain is shown in Fig. 2.2.

2.3

Estimation-Maximization (EM) approach

If the non-causal Wiener filter (2.7) was implemented in practise it would be necessary to know the PDF of the clean speech signal and the noise. In iterative Wiener filtering these two estimates are computed recursively on the basis of intra-frame averaging (for the calculation of noise power estimate) or inter-frame LP analysis (for the calculation of speech power estimate). An alternative approach proposed by Dempster et. al. [14] uses the EM method which solves an ML problem or a MAP problem to produce these two quantities as a by product. Specifically, the ML estimation problem involves a quantity called log-likelihood function which is defined as L(θ) = log P(s|θ), (2.12) where θ is a parameter vector. In our case θ comprises of ai , the AR coefficients of the speech production model, σe2 , the variance of the excitation noise and σw2 , the variance of the additive noise. The conditional probability in (2.12) signifies that the occurrence

18

of the clean speech data s depends on a specific state of the model, θ. The aim is to maximize the log-likelihood of the clean speech signal by changing θ. In (2.12), however, we don’t have a direct access to the clean speech signal s. Instead we must use a vector of observations x, which together with s constitutes a set of complete data (x; s). The total probability P(s|θ) may be expanded in terms of the observations x as X P(s|θ) = P(s|x, θ)P(x|θ). (2.13) x

The EM algorithm is an iterative procedure for maximizing L(θ). This is equivalent to maximizing the difference X L(θ) − L(θk ) = log P(s|x, θ)P(x|θ) − log P(s|θk ), (2.14) x

where θk denotes an estimate of the parameter vector in the k th iteration. Expanding the k) inner product of the summation in (2.14) by P(x|s,θ and applying the Jensen’s inequality P(x|s,θk ) theorem [47] we get µ ¶ X P(s|x, θ)P(x|θ) L(θ) ≥ L(θk ) + P(x|s, θk ) log . (2.15) P(x|s, θ )P(s|θ ) k k x Thus, we see that by maximizing the summation in the last equation we can improve the likelihood L(θ) which is what we want to achieve. The parameter estimate in k + 1st iteration can therefore be expressed as ( ) X θk+1 = arg max P(x|s, θk ) log P(s|x, θ)P(x|θ) θ

© x ª = arg max Ex|s,θk [log P(s, x|θ)] , θ

(2.16)

in which terms that are constant with varying θ were dropped off. The operator E [.] denotes statistical expectation. In (2.16) the expectation and maximization operations are apparent. Thus the EM algorithm consists of iterating the 1. E-step: Determine the conditional expectation Ex|s,θk [log P(s, x|θ)] 2. M-step: Maximize this expression with respect to θ. The algorithm is known to be iteratively convergent [6], since in each iteration we improve the log-likelihood (2.12), starting from the previous parameter estimate θk . For (θ) = θk , the right-hand-side term in (2.15) evaluates to zero and thus any θ which maximizes it also maximizes L(θ). This method has been first applied by Lim and Oppenheim [38] for parameter estimation of speech degraded by additive background noise. In their proposal, however, the quantity to be maximized was log P(θ|s, x) and thus they performed MAP estimation. Anyway, in order to solve either the ML or the MAP estimation problem by the EM approach, some Gaussian distribution assumptions must hold for both, e(n) and w(n). The general problem of statistical speech enhancement may also be solved within a fully Bayesian framework which has been extensively studied and applied by Vermaak et al. [73].

19

2.4

Signal subspace decomposition

This method represents the state-of-the-art in nonparametric speech enhancement. Since none of the methods discussed above, and perhaps none at all, was capable of improving both the quality and intelligibility of the noisy speech signal, the attempt here is to improve the overall quality of the speech signal while minimizing any loss in its intelligibility. The method was developed by Ephraim and Van Trees [16, 37] and its principle is to decompose the vector space of the noisy signal into a signal-plus-noise subspace and a noise subspace. The decomposition is theoretically performed by applying the Karhunen-Lo`eve Transform (KLT) to the noisy signal. The clean signal is estimated using two perceptually meaningful criteria in either time or spectral domain. A speech signal may be represented as a sum of complex damped sinusoids [60]. We have already introduced this model in section 1.3.3 and here we reproduce it again for convenience of presentation s =

M X

gi vi ,

M ≤N

i=1

vi =

¡

−1 1, ρi expjωi , ρ2i expj2ωi , . . . , ρN expj(N −1)ωi i

¢T

.

(2.17)

In 2.17, s is an N -dimensional speech vector, g = [g1 , . . . , gM ] is a vector of model coefficients and V = [v1 , . . . , vM ] is a N × M matrix of the complex basis vectors. The model 2.17 can be written as s = Vg.

(2.18)

Clearly, for M = N , such representation is always possible. However, for speech signals it is also possible when M < N . In this case an N -dimensional speech vector could be represented by a linear combination of only M -dimensional basis vectors. Thus, in an Euclidean vector space RN , a speech vector would occupy only a subspace since some of its coordinates would be zero. This subspace is called “the signal subspace”. The covariance matrix of the vector s is given by Rs = E{ssH } = VRg VH ,

(2.19)

where (.)H denotes an Hermitian transpose1 , and Rg denotes a covariance matrix of the vector g which is assumed positive definite. Hence, the rank of Rs is M and this matrix has only N − M non-zero eigenvalues. Let w = [w1 , . . . , wN ] denote an N -dimensional vector of an additive noise process which has corrupted a clean speech signal y = Vg + w.

(2.20)

The noise process is assumed zero mean, white and uncorrelated to the speech signal (Lev-Ari and Ephraim extended this approach to account also for colored noise [37]). The noise has the following covariance matrix Rw = E{wwH } = σw2 I. 1

(2.21)

for complex matrices, Hermitian transpose denotes simultaneous transposition and complex conjugation; for real matrices it denotes only the transposition

20

In 2.21 I is the identity matrix. In contrast with Rs , this matrix is full rank, i.e. it occupies the whole Euclidean space RN and has N non-zero eigenvalues. Thus the noise exists in the signal subspace as well as in a complementary subspace which is called “noise subspace”. The idea of the approach is to decompose the Euclidean space RN into the two above mentioned subspaces. It has already been approved by [72] that this operation may be carried out using the KLT transform. In particular, using the damped sinusoid model 2.17, we may express the noisy speech signal x as x = Vg + w

(2.22)

Rx = E{xxH } = VRg V + Rw .

(2.23)

and the covariance matrix of x as

Let Rx = UΛx U be the eigendecomposition of Rx . In this relation, U = [u1 , . . . , uN ] denotes an orthonormal matrix of eigenvectors ui each of which is column and has the length of N , and Λx = diag[λx (1), . . . , λx (N )] denotes a diagonal matrix of the eigenvalues of Rx . Since rank(Rs ) = M , the matrix Rs has M positive eigenvalues and N − M zero eigenvalues. On the other side, the matrix Rw has N eigenvalues, all equal to σw2 . Thus, the eigendecomposition of Rx may be expressed as Rx = UΛs UH + Uσw2 IUH · ¸ Λs,M + σw2 I ∅ = U UH , ∅ σw2 I

(2.24)

where ∅ denotes a null matrix of appropriate order and Λs,M denotes a M ×M submatrix of Λs . Correspondingly, let U = [UM , UN −M ] where UM = [ui : λs (i) > 0] UN −M = [ui : λs (i) = 0] .

(2.25)

Since U is orthonormal, we may decompose the noisy speech vector x using projection operators [16] H x = UM UH (2.26) M x + UN −M UN −M x, H where UM UH M x is the projection of x onto the signal subspace and UN −M UN −M x is the projection of x onto the noise subspace. The goal is to estimate a clean speech signal from the noisy speech signal. In this approach we look for a solution that would minimize signal distortion and yet maintain a low level of the residual noise energy. In particular, let ˆ s = Hx be a linear estimator of s, where H is a M × M estimation matrix. The residual signal obtained in this estimation is given by

r = ˆ s−s = (H − I)s + Hw ≈ rs + rw ,

21

(2.27)

ˇ

x windowing

s KLT

filter gain

IKLT

assembly

U ˇ

segmentation SVD

x(n)

s(n)

s 2 w

w(n) s(n)

noise estimation

Figure 2.3: Block diagram of the signal subspace method for speech enhancement with time-domain constraint; SVD denotes singular value decomposition

where rs represents the signal distortion and rw represents the residual noise. If the requirement is to minimize the signal distortion while maintaining the residual noise level below a certain threshold, Time-Domain Constraint (TDC) estimation is followed. Mathematically, the problem is described as follows min ²2s H

subject to:

1 2 ²w ≤ ασw2 , N

(2.28)

© ª where ²2s = krs k2 = tr (H − I)Rs (H − I)H is the energy of the signal distortion vector © ª rs , ²2w = krw k2 = σw2 tr HHH denotes the energy of the residual noise vector rw and α, which satisfies 0 ≤ α ≤ 1, is the spectral floor of the residual noise. The tr{.} is the trace2 operator. The optimal estimator in the sense of (2.28) can be found using the method of Lagrange multipliers HTDC = UM Λs,M (Λs,M + µσw2 I)−1 UH M,

(2.29)

where µ is the Lagrange multiplier which must satisfy α=

ª 1 © 2 tr Λs,M (Λs,M + µσw2 I)−2 . N

(2.30)

Hence in the signal subspace decomposition approach, the estimate of the clean speech signal ˆ sTDC = HTDC x is obtained by applying the KLT to the noisy signal, appropriately modifying the components by a gain function and applying an inverse KLT to the modified components. A schematic description of the signal subspace approach based on TDC is given in Fig. 2.3. 2

the trace of an N × N matrix A is defined as tr{A} =

22

PN i=1

aii

ix tu re s m

L

1

1

states

1

2

3 t=1

t=2

t=3

t=T

time

R

Figure 2.4: Schematic description of the GMM

2.5

HMM-based strategies in speech enhancement

A useful class of models for speech signals that overcome many of the shortcomings known in speech enhancement up-to-date are the Hidden Markov Models [53]. An HMM can be used for modeling not only the clean speech signals but also the noise. It was found by Juang and Rabiner [28] that, for the specific problem of speech enhancement and speech recognition, the best HMM is the GMM. In this model a speech signal is assumed to be in one of R predefined states. Moreover, every state in a given GMM is assumed to be represented by one of L so-called mixture components. A mixture component is essentially an output of a Gaussian AR process. Thus, in any time instant n, the GMM produces a vector of clean speech signal s that is a result of a Gaussian AR process of order p. However, the principal virtue of the GMM is the fact that it incorporates time evolution of the states. The evolution is a first-order Markov process which is usually characterized by a set of state transition probabilities t = {tr,q , r, q = 1, . . . , R}. An GMM may be fully characterized by a set of parameters λs = {π, t, c, A} where π = {πr , r = 1, . . . , R} is the set of initial state probabilities, c = {cl,r , l = 1, . . . , L, r = 1, . . . , R] is the set of probabilities of mixture selection given 2 the states, and a = {al,r (0), al,r (1), . . . , al,r (p), σl,r , l = 1, . . . , L, r = 1, . . . , R] is the set of coefficients of all gaussian AR models, including their variances (AR gains), given the states and the mixtures. In Fig. 2.4 there is a schematic description of the building blocks of the GMM model including their mutual relationship. The HMM parameters λs may be obtained through the ML estimation using training data set s = {sn,m , n = 1, . . . , N, m = 1, . . . , M } each of which consisting of N statistically independent utterances, each of which comprising of M K-dimensional frames of clean

23

speech material. The ML estimate is calculated using the Baum reestimation algorithm [2] or, alternatively, using the segmental k-means algorithm [56]. Specifically, the likelihood function to be maximized is given by log Pλs (s) =

N X

log Pλs (sn,m ),

m = 1, . . . , M ,

(2.31)

n=1

where the summation is performed on all utterances of the clean speech material. The algorithm generates a sequence of HMM’s with nondecreasing likelihood values (2.31). Each iteration of the Baum algorithm starts with an old set of parameters λs and estimates a new set of parameters λ0s by maximizing the following auxiliary function [17] max 0 λs

N X X X

Pλs (βn,m , γn,m |sn,m ) log Pλ0s (βn,m , γn,m , sn,m ),

m = 1, . . . , M

n=1 βn,m γn,m

PR 0 πr0 ≥ 0, πr = 1 Pi=1 R 0 0 tr,q ≥ 0, r=1 tr,q = 1 P L 0 , c0l,r ≥ 0, l=1 cl,r = 1 r, q = 1, . . . , R l = 1, . . . , L

subject to:

(2.32)

where βn and γn are, respectively, the sequences of states and mixtures corresponding to the n-the utterance of the training data. Alternately we may chose to maximize log Pλs (β, γ, s) =

N X

log Pλs (βn,m , γn,m , sn,m ),

m = 1, . . . , M ,

(2.33)

n=1

which is consistent with the segmental k-means algorithm. Here, however, it assumed that the set of all sequences {βn,m , γn,m , m = 1, . . . , M }, for nth utterance, is dominated by ∗ ∗ one unique sequence {βn,m , γn,m , m = 1, . . . , M }. Thus, given the most likely sequence ∗ ∗ {βn,m , γn,m , m = 1, . . . , M }, a new parameter set λ0s is obtained by maximizing the auxiliary function max 0 λs

N X X X

∗ ∗ δ(βn,m − βn,m , γn,m − γn,m ) log Pλ0s (βn,m , γn,m , sn,m ),

m = 1, . . . , M ,

n=1 βn,m γn,m

(2.34) subject to the constraints associated with (2.32), where δ(.) denotes a Dirac function. For the sake of simplicity, the noise process is assumed to be Gaussian AR with the parameter set λw . Notwithstanding, Sameti et al. [61] proposed a fully structured HMM model with Gaussian AR mixtures to incorporate nonstationarity of noise into the speech enhancement framework. The enhancement problem is that of estimating the sequence of clean speech vectors s = {sm , m = 1, . . . , M } from a sequence of noisy speech vectors x = {xm , m = 1, . . . , M } using the MAP estimation approach. This method yields the following estimate max log Pλs ,λw (s|x), s

24

(2.35)

which is consistent with the ML training procedure using Baum algorithm (2.32). Alternatively, the maximization max log Pλs ,λw (β, γ, s|x), (2.36) β,γ,s

which is the MAP estimation of the clean speech signal along with the unique sequence of states and mixtures, is consistent with the k-means segmental algorithm (2.34). Let s(k), k = 1, . . . denote current estimate of the speech signal. Using Jensen’s inequality [47] we find that the MAP estimation problem (2.35) may be solved through the EM algorithm by max {log Pλs ,λw (s(k + 1)|x) − log Pλs ,λw (s(k)|x)}

s(k+1)

≥ max

XX

s(k+1)

β

Pλs ,λw (β, γ|s(k)) log

γ

Pλs ,λw (β, γ, s(k + 1)|x) , Pλs ,λw (β, γ, s(k)|x)

which is the same as maximizing the auxiliary function XX Pλs ,λw (β, γ|s(k)) log Pλs ,λw (β, γ, s(k + 1)|x). φ(s(k + 1)) = β

(2.37)

(2.38)

γ

The maximum value of φ(s(k + 1)) may be found by setting its gradient to zero. This yields the following reestimation formula for the clean speech signal " #−1 XX −1 sm (k + 1) = qm (β, γ, s(k))Hγ,β x, m = 1, . . . , M , (2.39) β

γ

where

P qm (βi , γj , s(k)) =

P β: βm =βi

P P β

γ: γm =γj γ

Pλs ,λw (β, γ, s(k))

Pλs ,λw (β, γ, s(k))

,

(2.40)

is a conditional probability of being in state βi at time frame m and choosing a mixture component γi while in state βi . Note, that the sets β = {βm , m = 1, . . . , M } and γ = {γm , m = 1, . . . , M } denote the sequences of states and mixtures, respectively, whereas βi and γj are fixed values which denote a particular state and mixture, respectively. Moreover, Hγ,β in 2.39 is a Wiener filter for the output of a Gaussian process associated with the state β and the mixture γ. This filter can be formulated in the frequency domain, similarly to (2.7), as Γγ,β (ω) Hγ,β (ω) = , (2.41) Γγ,β (ω) + Γw (ω) For the MAP estimation problem (2.36) to be solved, we first need to estimate the most likely sequence of states and mixture components. It was found useful, in this case, to employ the Viterbi decoding algorithm [19]. This algorithm yields the sequences β ∗ , γ ∗ which dominate the set of sequences β, γ and which in turn are the most likely sequences of state and mixtures, respectively. Given the knowledge of these sequences, the MAP estimation problem, solved through the EM algorithm, results in the following maximization max {log Pλs (s|β ∗ , γ ∗ ) + log Pλw (x − s)}), s

25

(2.42)

the result of which is the new estimate of the clean speech signal s(k + 1). The two steps, i.e. the Viterbi decoding and MAP estimation are executed iteratively until some predefined convergence criterion is satisfied. The HMM approach capitalizes on the statistical modeling of the clean speech and the noise process using training sequences for both of them. HMM’s have been generally accepted as reliable models for speech signals in the speech recognition community. Their relevance to the speech enhancement framework was confirmed by several researchers, e.g. Yariv Ephraim, Lawrence Rabiner and Andreas Spanias, to name a few.

26

Chapter 3 Introduction to adaptive algorithms Historically, the concept of adaptive signal processing evolved from techniques developed to enable adaptive control of time-varying systems. In the 1960s, mainly due to work of Bernard Widrow and his colleagues, it began to be recognized as a separate category in digital signal processing. Adaptive systems refer to systems that are able to effectively change their own parameters and thereby “adapt” to the changes of the environment in which they operate. Hence, the basic nature of the problem is that some properties of the environment are changing in an unknown manner, and therefore must be tracked. Most of these problems have been formulated as linear filtering problems. An adaptive filter is said to be linear if the estimated quantity of interest, computed adaptively, is as a linear combination of the observable data. In this thesis we concentrate mainly on linear adaptive algorithms due to their inherent simplicity and efficiency. In this chapter we review some basic properties of adaptive systems and adaptive algorithms. We begin our discussion by describing some typical applications of today’s adaptive systems. This also provides some motivation for the research of new methods or an improvement of the older ones. Next we set-up an adaptive framework describe the notation used throughout the rest of the thesis. Finally, the key aspects of the adaptive process will be summarized showing also the general requirements of adaptive filters..

3.1

Applications

An adaptive filter is useful whenever the statistics of the input signals to the filter are unknown or time-varying. Since its parameters are changing according to the conditions of an ambient environment, little or no a-priori information is required about the input signals. Hence, it is also of interest to note that adaptive systems are capable of working with nonstationary signals, provided they can be considered stationary at least in a short interval. Numerous applications of adaptive filters have been proposed in literature. Some of the most noteworthy include: Channel equalization and interference suppression This application of adaptive filtering to channel equalization belongs to the oldest.

27

The problem rests in imperfect transmission of information over communication channels. It is known that these channels are dispersive and the effects of distortion may be modeled by a raised cosine filter. Thus, by applying a filter, which would be an inverse of the raised cosine, one could equalize the response and suppress (Inter-Symbol Interference (ISI)) [48, 40, 22]. Since the channels are time-varying the filter must be adaptive. The parameters of the channel are continuously estimated e.g. using a training sequences with known response. Teleconferencing and videoconferencing In this, more up-to-date, application a group of people is located in two rooms in a far distance. Each person willing to participate in a discussion may use its own microphone or a common array of microphones for all speakers. There is also a loudspeaker present in each room with high power which reproduces the speech of the opposite party. The system may be expanded to incorporate more loudspeakers in order to give the participants a spatial impression of the reproduced sound. If no filtering of the outgoing signals was applied the acoustic feedback between the loudspeaker and the microphone(s) would cause the people to hear their own echo. In the worst case, when the signal was amplified, there would be a well-known howling. The echo paths can be as long as 200 ms which further complicates the design of filtering algorithms, as will become apparent in the following text. Moreover, people talking in the background, computer fans, air conditioning and all the environmental noise are the sources of disturbances that degrade the quality and intelligibility of speech. A solution to both problems, i.e. echo cancelation and background noise reduction, can be found by employing an adaptive system. It has an indisputable advantage over other systems that it allows for application in a nonstationary environment, which is exactly the case of tele- and video-conference systems. Hands-free telephony This is a similar environment to that of teleconferencing systems. The hands-free mobile phones are usually used in cars to allow the driver simultaneously handling the steering wheel and communicating. The most common implementation of the hands-free system involves a mobile phone, a single microphone and a loudspeaker. The acoustic paths in cars are much shorter than those in teleconferencing systems. They seldom exceed 30 - 100 ms. However, despite the short duration, they cause an echo due to acoustic feedback between the loudspeaker and the microphone. This echo needs to be handled by an adaptive system as in the previous case. Here, however, the background noise is much more severe, since the engine, the wind, the tires or the in-car radio have higher power compared to additive noise in the teleconferencing systems. Voice control The possibility of controlling appliances by voice has started to receive considerable attention since few years ago. Voice control technology can be found in almost every area of digital electronic industry. Speech recognition systems, that are the hearts of these

28

d(n)

e(n) _ y(n)

x(n) filter

adaptive algorithm

Figure 3.1: Schematic diagram of an adaptive system, parametrized by θ and driven by a reference signal

applications are often trained with clean speech (without noise) due to the efficiency and reusability of speech databases. In real situations, however, due to the presence of ubiquitous noise, the efficiency of the systems decreases. In this case, it could be advantageous to preprocess the input signals in a way that noise is suppressed and the efficiency of speech recognition is increased. Adaptive solution is the commonly preferred method to accomplish this task, again due to its ability to deal with nonstationary signals. Hearing aids Adaptive noise cancelation techniques are also applied in hearing aids and cochlear implants. For hearing impaired persons, it is necessary to increase the volume of speech signals. An amplification of the signals is not a solution, since it does not increase the intelligibility and, in addition, increases the power of noise. A better strategy, which is also commercially available, is to employ two microphones and an adaptive system.

3.2

The basic adaptive system

The adaptive system continually compares an output signal of an adaptive filter to a desired output signal. By observing the error between the output of the adaptive filter and the desired output, an adaptation algorithm can update its coefficients with the aim to minimize an error function [76]. Fig. 3.1 shows the basic schematic diagram of an adaptive filter, where x(n), y(n), d(n), and e(n) are the input, output, desired output and error signals, respectively. For the convenience of presentation, the adaptive system considered in this chapter is assumed to be an Finite Impulse Response (FIR) filter with N coefficients such that the output of the adaptive filter can be expressed as

where

y(n) = wT (n)x(n)

(3.1)

w(n) = [w1 (n), w2 (n), . . . , wN (n)]T

(3.2)

29

d(n) x(n)

z-1

w1

z-1

z-1

w2

wN

y(n)

e(n)

Figure 3.2: Block diagram of an adaptive filter (tapped-delay line).

is a vector containing the coefficients of the FIR filter, and x(n) = [x(n), x(n − 1), . . . , x(n − N + 1)]T

(3.3)

is a vector containing the input samples. The error signal, which has been mentioned in the previous paragraph can be expressed as e(n) = d(n) − y(n).

(3.4)

The structure of the adaptive system is shown in Fig. 3.2. This structure is also known as the tapped-delay line, since the input is “tapped” through the delay elements. The coefficients of the FIR filter are sometimes called the tap-weights. We will see in this thesis that most gradient adaptive algorithms are more or less based on this simple structure.

3.3

Objective function

An adaptive algorithm tries to minimize an objective function, usually denoted as Jw . The choice of this function has a profound impact on the properties of the whole algorithm. It influences the rate at which the iterative adjustment process converges to its steady state. This state represents a point at which the objective function achieves its minimal value. Moreover, the objective function determines the robustness of the system and its stability. By carefully selecting the objective function we can condition its shape to have a continuous, convex character with a single “easy-to-track” minimum. Among the most popular objective functions that are used in adaptive systems are The mean-square error (MSE) Jw = E [e2 (n)] 30

The least-square error (LS) N 1 X 2 Jw = e (i) N i=1

The weighted least-square error (WLS) Jw =

N X

λ(N −i) e2 (i)

i=1

In some research papers we find that the objective function is sometimes referred to as the the error performance surface, since the signal e(n) usually denotes an estimation error. The index w in the definition of the functions above signifies their dependence on the tap-weight vector w. Since adaptive methods are iterative, the values of the tapweights are continuously updated based on the current value of the objective function. It is therefore one of the most useful quantities to describe the state of the filtering process.

3.4 3.4.1

Properties of adaptive algorithms Convergence rate

It is the primary parameter at which we look when comparing different adaptive algorithms together. It determines the number of iterations required to get to the vicinity of a steady-state solution. It is desirable to achieve the highest rate possible. Since in many applications the system has to meet stringent deadlines, the convergence must be fast enough to meet the deadlines and not to affect the performance. This is also the case of speech signal processing, since speech is considered stationary only in short frames not longer than 30 ms.

3.4.2

Computational complexity

Hand in hand with algorithm’s efficiency goes the computational complexity. This value shows the number of operations a DSP would have to perform to manage to keep the algorithm working. Since the power of DSP chips tends to increase over years, the number of additions and multiplications no longer represents a limitation for the design of an adaptive system. However, we can also see a tendency in research to develop more and more complex methods and an obvious interest is still to achieve the lowest possible complexity. The computational complexity is the determining factor of algorithm’s demand of resources. By resources we usually mean the time of processing and the memory for data storage. Adaptive algorithms are iterative methods that repeat their job all round. Thus, to estimate the computational power needed, we have to count the number of operations in a single iteration. For block-based algorithms this value determines the

31

number of operations needed to process N consecutive samples whereas for sample-based algorithm’s it is only for a single sample. Therefore, when comparing different algorithms together, the strategy is to count the number of operations for a block of N samples.

3.4.3

Misadjustment (steady-state error)

The fact that an algorithm converges to its steady-state solution does not guarantee that the enhanced speech signal is of high quality and/or intelligibility. The steady-state error signifies that there is always some misadjustment. The requirement is to keep the error as low as possible with minimal perceptual effects.

3.4.4

Robustness

This may be viewed as a combined requirement of maximum immunity against internal errors, such as quantization and round-off errors and insensitivity to external errors. Sometimes, however, it is better for the algorithm to be sensitive to certain changes of the environment, such as nonstationarity of speech signals and noise processes. The trade-off between sufficient sensitivity and relative robustness is often a difficult task to solve.

3.4.5

Structural efficiency

In the era of powerful computational facilities certain questions arise when choosing the filtering structure. Finite-duration filters were, until lately, the preferred structures mainly due to their simplicity and efficiency. What we see now, however, in the field of computer technology is the application of parallel processors and multitasking at the lowest level. Texas Instruments, one of the worldwide leaders in DSP technology, started to integrate multithreading capabilities into DSP development tools a few years ago and thus provided support for the implementation of algorithms with a parallel structure.

3.5

Experimental evaluation used throughout this thesis

To prove the functionality of adaptive techniques developed in this thesis we use three typical speech enhancement applications. • Identification of an impulse response of an unknown environment • Dual-channel suppression of background noise from the speech signal • Inverse filtering of a distorted speech signal These will be referred to in this thesis as EXPEV1, EXPEV2 and EXPEV3. A detailed description of each of the experimental scenarios is given further in this section.

32

4

4

4

2

2

2

][ 0 x2

0

0

-2

-2

-2

-4 -4

-2

0

2

4

x 1 [-]

-4 -4

-2

0 x 1 [-]

2

4

-4 -4

-2

0

2

x1 [-]

Figure 3.3: Scattered plots of the input vectors x(n) = [xn , xn−1 ], n = 1, . . . , 3000 corresponding to a) white noise, b) low-pass filtered white noise with fc = 0.6 f2s and c) low-pass filtered white noise with fc = 0.3 f2s

3.5.1

Input signals

In all experiments that we conduct we use some typical input signals that are encountered in literature. In the first trials with new approaches we usually apply the white noise signal with Gaussian PDF and a variance σ 2 , i.e. N (0, σ 2 ). To prove the efficiency of algorithms against colored input data we use another signal, in which the samples are mutually correlated. This signal is generated by filtering the white noise through a 20th order low-pass FIR filter with the cutting frequency of fc = 0.3. The problem of correlated input data is best illustrated on scattered plots in Fig. (3.3). In the first example, which corresponds to the white noise signal (uncorrelated), the cluster has approximately circular shape meaning the vectors are mutually independent. If we pass the white noise signal through the FIR filter described above with a cutting frequency of 0.6 we will notice that the shape deforms to an ellipse, as illustrated on the scattered plot in the middle. Further reduction of the cutting frequency to 0.3 deforms it even more (the right most scattered plot). Hence, the input vectors tend to be oriented more in a certain direction. A speech signal can be considered as a special case of the correlated input data. We will use it in various experiments to show the ability of an algorithm to deal with nonstationary signals. However, speech signals usually possess both, noisy-like parts, which are more or less uncorrelated, and periodic parts during which the signal is highly correlated. For the testing to be reliable we conducted several recordings in various acoustic environments to acquire some real speech signals. The recordings were not modified in any way. You can see a snapshot of a 7s-long sentence “This is a simulated conversation for a project about mobile hands-free installation at Aalborg University” in Fig. 3.4. This signal was narrated by a male colleague in a quiet room at the university.

33

0.6

This

is a

simulated

conversation

0.4 0.2 0 -0.2 -0.4 2000

4000

6000

8000

10000

12000

14000

Figure 3.4: A snapshot of the testing speech signal and its grammatical segmentation; Fs = 8000Hz, 16bits/sample, duration 1.8125s.

3.5.2

Models of unknown systems

In all of the experiments we try to estimate the parameters of an unknown systems. We use the term “unknown system” for any environment which lies between the input and the desired signal. It may be of any character, whether acoustic, electric or mechanic. However, acoustic environments are often encountered in real-world applications and therefore we concentrate our research on them without any loss of generality. A typical task that we solve in many of our experiments involves the estimation of parameters of a car interior acoustic system. This is to verify the functionality of algorithms in a session similar to hands-free communication. For this purpose we conducted few recordings in Ford Sierra. We use a Maximum-Length Sequence (MLS) sequences1 to estimate the impulse response and the magnitude spectrum of the car interior (see Fig. 3.5). It is noteworthy that, in general, the impulse response of the car interior is of infinite length (Infinite Impulse Response (IIR)). Since adaptive algorithms that we propose and present in this thesis use FIR filters a cation must be taken when choosing the number of filter coefficients. If N is low, we could expect large misadjustment between the estimation and the reality. On the other side if N is too high, the computational complexity can become extensive and even unbearable. In many situations, however, we use some artificially created unknown systems. First, we will make use of a simple FIR filter with a low-pass character and pre-specified cutting frequency. The coefficients of the filter are the parameters to be estimated. The impulse response and the magnitude spectrum of this model are shown in Fig. 3.6. In few cases it is more appropriate to use an FIR filter with a raised cosine character. For these purposes we use an RC filter with a cutting frequency 0.25 and a transition bandwidth 0.3. The impulse response and the magnitude spectrum are depicted in Fig. 3.7. The next model of an unknown system is a random impulse response with an exponentially decaying tail. A system with such impulse response is intended to closely resemble the characteristics of a closed acoustic environment (a small room or car inte1

MLS sequences have similar spectral properties to white noise signals but due to easier hardware implementation they are the preferred signals in practise

34

(a)

(b) 10

0.3

5

0.2

0 magnitude [dB]

wk [-]

0.1 0

-0.1

-5 -10 -15 -20 -25

-0.2

-30 -0.3 -0.4

-35 0

50

100

150

200

250

-40

k [-]

0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 normalized frequency [ rad/sample]

0.9

1

Figure 3.5: (a) Car interior impulse response estimated using FIR filter (M = 256) and MLS sequence; (b) Magnitude spectrum

rior). The main difference is that in our case it is constant (stationary) and it has finite duration. The impulse response is created by 4n

h(n) = 0.1e− M z(n),

n = 0, 1, . . . , M − 1,

(3.5)

where M is the required number of filter coefficients and z(n) is a zero-mean random signal with unit variance. The plot of the impulse response and its magnitude spectrum are shown in Fig. 3.8. The plot depicts a response with 20 coefficients.

3.5.3

EXPEV1 - Identification of an impulse response of acoustic environment

In the first experimental scenario we will try to estimate the parameters of an unknown system using adaptive algorithm. The estimation is required to be accurate and reliable. Therefore, in experiments involving the estimation of e.g. the impulse response of a car interior, at least 256 tap-weights are required in the adaptive filter. The setup of the EXPEV1 system is shown in Fig. 3.9. The type of the common input signal to both systems is chosen on individual basis, according to the objectives of the experiment. The signal v(n) in Fig. 3.9 is called the measurement noise and it represents all kinds of errors encountered in real-world measurements. In EXPEV1 it will be modeled as a white noise with normal distribution N (0, σv2 ). Unless otherwise specified the variance σv2 will be fixed at 0.0316 which corresponds to a level of −15dB.

3.5.4

EXPEV2 - Dual-channel suppression of background noise from the speech signal

This is a typical speech enhancement application in which the adaptive filter is used to cancel a background noise signal, a correlated version of which is available. The schematic

35

(a)

-20 magnitude [dB]

0.25

wk [-]

0.2 0.15 0.1

-40 -60 -80

0.05

-100

0

-120

-0.05

(b)

0

0.3

0

2

4

6

8

10

12 14

16

18 20

-140

k [-]

0

0.1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 normalized frequency [ rad/sample]

0.9

1

Figure 3.6: (a) Impulse response of the 20th order low-pass FIR filter with fc = 0.3; (b) Magnitude spectrum

description of the experimental scenario is shown in Fig. 3.10. Here, the clean speech signal s(n) is corrupted by additive noise, which is taken from the noise source v(n). Since the noise has to travel across the acoustic environment A (the unknown system to be identified), it changes to v 0 (n). It is assumed that a correlated version of v(n) is available (e.g. by recording the noise itself) and this version we denote as x(n) - it is the input to the adaptive system. The coupling between the unattainable v(n) and the recorded x(n) is through an unknown acoustic environment B. The dual-channel scenario which is employed in EXPEV2 experiments is in contrast with the single channel techniques of background noise cancelation introduced in section 2 such as spectral subtraction or Wiener filtering. These methods do not use the secondary channel and estimate the correlated version of the noise from the corrupted speech signal.

3.5.5

EXPEV3 - Inverse filtering of a distorted speech signal

The technique of inverse filtering is based on the idea that if a certain signal is distorted by a system with a transfer function H(f ) it can be restored back by passing it through 1/H(f ). The task is to find the inverse transfer function by means of adaptive processing. In Fig. 3.11 we can see what components does the EXPEV3 experimental system consist of. The delay introduced in the upper branch of Fig. 3.11 is approximately equal to the combined delay caused by the unknown system and the adaptive filter. The speech excitation used in EXPEV3 will be the same as in EXPEV1 and EXPEV2, see Fig. 3.4.

3.6

Reference Algorithm 1 - The Normalized Least Mean Square (NLMS)

The NLMS is definitely the most popular adaptive algorithm which has been used for almost 40 years. This algorithm conceptually stems from the Least Mean Squares (LMS)

36

0.25

filter coefficient value []

0.2

0.15

0.1

0.05

0

-0.05

0

2

4

6

8

10 12 samples []

14

16

18

20

22

Figure 3.7: (a) Impulse response of a raised cosine filter with fc = 0.25 and ∆b = 0.3; (b) Magnitude spectrum

algorithm, which has been first developed by Widrow and Hoff [77]. The LMS algorithm is a stochastic gradient method that updates iteratively the coefficients of its own adaptive filter (the tap-weights) in a direction of the gradient vector. This vector is calculated as a partial derivative of the Mean-Square Error (MSE) function with respect to the tap-weight [24, pp. 231-238]. The LMS algorithm is described by the following equations e(n) = d(n) − wT (n)x(n) w(n + 1) = w(n) + µLM S x(n)e(n).

(3.6)

The term x(n)e(n), which is incorporated in the tap-weight correction term in the second line of (3.6) is an “instantaneous estimate” of the gradient of the MSE function Jw . Specifically, the estimate is given by ∂Jw /∂wn = −2e(n)x(n) [24, p. 236]. Thus, the adaptation proceeds along the gradient of the MSE function until a minimum point of the function is reached. The iterative, weight-adjustment process of the NLMS algorithm is described by µ w(n + 1) = w(n) + x(n)e(n), (3.7) kx(n)k2 where µ is the step size controlling the convergence rate, stability, and misadjustment. According to Slock [65] it must be chosen in the range 0 < µ < 2 to ensure stable operation. The normalization term in 3.7 is the Euclidean norm2 . 2

The Euclidean norm of a vector a is defined as kak =

37

qP

K i=1

a21 + a22 + . . . + a2K

(a)

(b) -10

0.06 0.04

-15

0.02 magnitude [dB]

wk [-]

0 -0.02 -0.04 -0.06 -0.08

-20 -25 -30 -35

-0.1 -0.12

0

2

4

6

8

10 12 k [-]

14

16

18 20

-40

0

0.1

0.2

0.3 0.4 0.5 0.6 normalized frequency [

0.7 0.8 0.9 rad/sample]

1

Figure 3.8: (a) Exponentially decaying random impulse response; (b) Magnitude spectrum

3.6.1

Quadratic nature of the MSE function

To study the convergence rate of the NLMS algorithm one has to investigate the properties of the objective function Jw and its dependance on the tap-weight vector w. From the theory of optimal filtering we know that the optimal solution to the problem of MSE estimation is the Wiener solution [24, p. 104] wo = R−1 p,

(3.8)

where R = E[x(n)xT (n)] denotes an N -by-N correlation matrix of the inputs x(n), x(n − 1), . . . , x(n − N + 1) (see Fig. 3.2) and p = E[x(n)d(n)] is an N -by-1 cross-correlation vector between the inputs and the desired response d(n). Equation 3.8 is referred to as the Wiener-Hopf equations. From 3.3 we remember that Jw = E[e2 (n)]. Thus, Jw is a function of time, and for each time instant n it produces a single value of MSE. However, Jw is also a function of the tap-weight vector w [24] J(w) = σd2 − pT R−1 p + (w − R−1 p)T R(w − R−1 p).

(3.9)

Using the Wiener-Hopf equations (3.8), we may reformulate (3.9) as J(w) = Jmin + (w − wo )T R(w − wo ),

(3.10)

in which Jmin is an irreducible term which corresponds to the minimum MSE function. No solution can achieve a value of J which is lower than Jmin . This is clearly illustrated in Fig. 3.12, which shows the function J(w0 , w1 ) for the case of two tap-weights. Notice the shape of the function, which is quadratic with respect to the tap-weights. Generally, since MSE estimation is employed, the function J(w) will always be a multidimensional paraboloid with a single global minimum. The minimal point represents the Wiener solution wo . From the plot of w0 versus w1 we observe, that the loci are elliptic. This signifies that the input data are correlated. For the case of uncorrelated input signals such as white noise, the shapes will be circular. More on this topic will be discussed further in section 3.6.4.

38

v(n)

d(n)

Unknown system x(n)

e(n) y(n) FIR fitler

w(n)

Adaptive algorithm param.

Figure 3.9: Schematic diagram of the EXPEV1 system identification scenario

3.6.2

Experimental evaluation of the convergence performance

In this part we show the convergence behavior of the NLMS algorithm under EXPEV1 scenario (see section 3.5.3). The input signal is the white noise with a level of 0dB. The measurement noise has a level of −15 dB, which corresponds to a variance of 0.0316. Fig. 3.13 shows the MSE functions of NLMS algorithms with different step sizes. Notice the steady state values of all curves. The theoretically minimal steady-state error is the value of the measurement-error variance, i.e. σv2 = 0.0316. We see that by increasing the step size of the algorithm we achieve faster convergence but increase the steady-state error. By comparing the MSE curves of the NLMS algorithm with that of the LMS we see that it is faster (for the same value of the steady-state mean error). In the same experiment we can demonstrate a stochastic nature of the weight adjustment process. This is illustrated on the contour plot in Fig. 3.14. In this case, however, we have used a first-order low-pass FIR filter as the unknown system (see section 3.5.3). From the contour plot we see that when the step-size parameter is small, the adjustment is smooth whereas for a bigger one one it is more aggressive. This phenomenon is also shown near the optimum point w0 where for bigger step-size, the chaotic movement has a higher variance. The results of the convergence rate analysis presented here are merely restatements of a well-known theory developed throughout the years by several researchers, e.g. [24, 13, 29, 27].

39

s(n)

d(n)

e(n)

v’(n) y(n)

Unknown system A

v(n)

Unknown system B

x(n)

FIR fitler

Adaptive algorithm param.

Figure 3.10: Schematic diagram of the EXPEV2 dual-channel system for background noise suppression

3.6.3

Computational complexity analysis

The weight-adjustment (3.7) combined with (3.1) and (3.4) constitutes the NLMS algorithm. In a single iteration, N multiplications are performed to compute the output, and a further N multiplications are performed to compute the gradient vector. Last N multiplications are required by the Euclidean norm3 . We cannot also forget the division operation in the correction term of (3.7). Hence, for a block of K output samples, the total number of multiplications is 3N K plus K divisions. The computation of e(n) consumes one addition4 and N additions are required by the weight adjustment. Thus, for a block of K output samples we need K(N +1) additions. Up to now we have not yet considered the step-size multiplication of (3.7). This can be treated as a legal multiplication operation, but many DSP’s can do it more efficiently if they perform a shift operation instead. The condition is that the value of µ must be a power of 2. However, to simplify comparisons, we will treat the shift operations, as well as rounding, normalization, saturation, read-write and other operations simply as others. The results are summarized in table 3.1. It is a common practise today to estimate and compare an algorithm’s complexity in terms of the number of multiplications. However, this measure does not truly represent the demands of a given method since most DSP’s today have a built-in hardware support for acMAC operations. Usually they preform this operation in a single instruction cycle as well as the addition or the subtraction. In this thesis we are not trying to provide a precise analysis of a computational complexity. Rather, we concentrate on presenting rough 3

In fact, the Euclidean norm would require Multiply and ACumulate (MAC) operations, but today’s DSP’s are capable of performing MAC operations at the same rate as multiplications. 4 Here we only use the term addition for both arithmetic operations, the addition and the subtraction.

40

x(n- ) Delay e(n)

s(n)

Unknown system

y(n)

x(n)

FIR fitler

Adaptive algorithm param.

Figure 3.11: Description of the EXPEV3 experimental system used for inverse filtering of a distorted speech signal NLMS operation #MULT #DIV #ADD #OTH

for N 3N K K K(N + 1) K

for N = 64 49152 256 16640 256

for N = 128 98304 256 33024 256

for N = 1024 786432 256 262400 256

Table 3.1: Analysis of the number of operations in the NLMS algorithm for the frame length K = 256 and various number of filter coefficients

comparisons between different adaptive methods in terms of the number of operations listed in table 3.1.

3.6.4

Convergence rate deterioration due to correlated input data

In this part we show that the convergence rate, as examined in section 3.6.2, can significantly decrease when the input data is correlated. We will employ the same EXPEV1 setup with speech signal as an input. Under the conditions of the EXPEV1 scenario we show that the convergence rate of the NLMS algorithm, as exemplified in 3.6.2, may significantly deteriorate when the input data is correlated. For this purpose we will use the speech signal in Fig. (3.4) as the input. We compare the results of the convergence analysis for the same step-size parameter, i.e. µ = 0.3 and the same unknown system impulse response, 20th order, low-pass FIR filter, fc = 0.3. The results are summarized in Fig. 3.15. From the two plots presented, b) is the Mean Square Deviation (MSD) of the tap-weight vector, which is defined as [24, p.

41

a)

b) 2 6.7

4.9 9

1.5

3.5

6.7

14.9

10

2.4 3.5

0.5

8

2.4

w1

Objective function Jw

12

6

0

4

−0.3 −0.5

2

−1

0.9

1.5

4.9

0.4

1.5

wo=(wo0,wo1)

3.5

2.4

0.9 0.4

3.5 2.4

6.7

0 2

−1.5

w1

0 −2

−2

−2 −2

w0

1.5 2.4

3.5

2

0

0.9 1.5

9

−1.5

4.9

−1

−0.5

0.1 0.5 w0

1

1.5

2

Figure 3.12: MSE performance of the filter with two tap-weights a) quadratic function J(w0 , w1 ) with single global minimum, b) elliptic character of the loci of points with equal J(w)

325]

£ ¤ D(n) = E kw0 − w(n)k2 .

(3.11)

From both plots we see that due to statistical dependency of the adjacent input samples the convergence rate is significantly lower. Moreover, the MSE curve for the speech input is “periodic-like”, which is due to periodic components in the input signal. We can’t expect the same convergence behavior of the NLMS algorithm, since it depends on the character of the speech signal the algorithm converges on. For the noisy-like segments, the convergence will be much faster than for the voiced segments.

3.6.5

Convergence rate deterioration due to high number of filter coefficients

It is also a well-known fact that the NLMS convergence rate suffers when the number of coefficients increases. We have verified this phenomenon again under EXPEV1 conditions, for the case of N = 20, 40 and 100. The unknown system was again the 20th order, lowpass FIR filter and the step-size parameter was fixed at µ = 0.3. The results are depicted in Fig. 3.16. We see that the rate decreases when the filter length is increased. The steady state error remains at the same level for all of the three cases, which means that the redundant coefficients are adapted to zeros, which does not increase the steady state error. The problem of higher convergence time for long filters is crucial to acoustic echo cancelation. Acoustic echo is essentially a signal which needs to be canceled by an adaptive

42

0.25 NLMS, µ = 0.5 NLMS, µ = 0.7 NLMS, µ = 0.9 LMS, µLMS = 0.01

Mean−square error [−]

0.2

0.15

0.1

0.05 0.0316

0

variance of the measurement noise 0

20

40

60

80 100 time [samples]

120

140

160

180

Figure 3.13: Convergence of the MSE function of the NLMS algorithm for different step sizes

filter with a high number of coefficients. If the length of the filter does not correspond to the length of the impulse response of the acoustic environment, the suppression of echo is imperfect and the steady-state error is increased.

3.7

Reference Algorithm 2 - The Recursive Least Squares (RLS)

As an alternative to the NLMS method we can also employ an RLS method to solve the system of equations 3.1 and 3.4. The idea is to apply recursive updates in every iteration of the algorithm. It is very similar to the gradient-based methods but the choice of the objective function is different. It is the Least Squares (LS) objective function, listed in section 3.3. The objective function is similar to that of the NLMS algorithm but the difference has a profound impact on the final form of algorithm equations. Specifically, the input correlation matrix is now defined using time averaging Φ(n) =

n X

λ(n−i) x(i)xT (i) + δλn I,

(3.12)

i=1

where δ is a positive real number called the regularization parameter. This parameter is used to stabilize the recursive process as we will see later. The parameter λ is called the

43

2

5.4

2 3.

MSE function 3. 2 NLMS, µ = 0.1 NLMS, µ = 0.3 1. 7

1.7

1.5

0.7

1.7

1

w

1.7

5.4

w(2) [−]

3.2

0.5

0.7

0

0.

7

0

0.7

1.7

−0.5

−1

2

3.

1.7

3.2

5.

8.

4

7

3.2

w(0) −1.5

5.4

5.4 −2 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

w(1) [−]

Figure 3.14: Contour plot of the 2-dimensional tap-weight adjustment process of the NLMS algorithm.

exponential weighting factor or the forgetting factor and is a positive real number close to, but less than, unity. Accordingly, the N -by-1 cross-correlation vector z(n) between the input and the desired response is defined as z(n) =

n X

λ(n−i) x(i)d(i).

(3.13)

i=1

In both equations, i.e. (3.12) and (3.13), it is assumed that the data prior to time n = 1 are zero (prewindowing method). According to the method of least squares, the optimum value of the N -by-1 tap-weight vector w(n), for which the LS objective function attains its minimum value, is obtained by solving Φ(n)w(n) = z(n).

(3.14)

The system of equations 3.14 is called the normal equations and represents a counterpart to the Wiener-Hopf equations (3.8). The correlation matrix and the cross-correlation vector may be computed recursively by Φ(n) = λΦ(n − 1) + x(n)xT (n) z(n) = λz(n − 1) + x(n)d(n).

44

(3.15)

0.35

0.35 NLMS, input − white noise NLMS, input − speech Mean−square deviation (MSD)

Mean−square error (MSE)

0.3 0.25 0.2 0.15 0.1 0.05 0

0

200

400 600 time [samples]

800

0.25 0.2 0.15 0.1 0.05 0

1000

NLMS, input − white noise NLMS, input − speech

0.3

0

200

400 600 time [samples]

800

1000

Figure 3.15: Deterioration of the NLMS convergence performance due to correlated input data (speech); a) Mean-square value of the error signal (MSE); b) Mean-square deviation of the tap-weight vector (MSD)

In order to calculate the solution wo by the LS method in accordance with (3.14), we will have to know the inverse correlation matrix (Φ)−1 (n). By using a transform developed by Householder [25] to the first equation of (3.15) we obtain Φ−1 (n) = λ−1 Φ−1 (n − 1) −

λ−2 Φ−1 (n − 1)x(n)xT (n)Φ−1 (n − 1) . 1 + λ−1 xT (n)Φ−1 (n − 1)x(n)

(3.16)

The second term in 3.16 may be simplified by defining a gain vector [24, p.441] k(n) =

λ−1 Φ−1 (n − 1)x(n) . 1 + λ−1 xT (n)Φ−1 (n − 1)x(n)

(3.17)

Using this definition, the recursive formula for matrix inversion becomes Φ−1 (n) = λ−1 Φ−1 (n − 1) − λ−1 k(n)xT (n)Φ−1 (n − 1).

(3.18)

By rearranging the terms in (3.17) and using (3.18) we may simplify the defining equation of the gain vector as k(n) = Φ−1 (n)x(n). (3.19) We may now express the weight-adjustment of the RLS algorithm in a form that is known in literature [54, 24, 15]. By substituting (3.15) in the normal equations (3.14) and using (3.18) for Φ−1 (n), we get £ ¤ w(n) = w(n − 1) + k(n) d(n) − xT (n)w(n − 1) (3.20) = w(n − 1) + k(n)ξ(n), where ξ(n) = d(n) − xT (n)w(n − 1) is the a priori estimation error since it is computed using the old weight vector w(n−1). This completes the derivation of the RLS algorithm.

45

a)

b) 0.2 NLMS, N = 20 NLMS, N = 40 NLMS, N = 100

0.2

Mean−square deviation (MSD)

Mean−square error (MSE)

0.25

0.15

0.1

0.05

0

0

100

200 300 time [samples]

400

NLMS, N = 20 NLMS, N = 40 NLMS, N = 100 0.15

0.1

0.05

0

500

0

100

200 300 time [samples]

400

500

Figure 3.16: Deterioration of the NLMS convergence performance due to high number of filter coefficients; a) Mean-square value of the error signal (MSE); b) Mean-square deviation of the tap-weight vector (MSD)

3.7.1

Experimental evaluation of the convergence performance

Again, we would like to verify the theoretical results under EXPEV1 conditions, i.e. under the same conditions as for the NLMS algorithm. The comparison is made for an unknown system which is a low-pass FIR filter of 20th order. Based on a detailed study reported by Moustakides [44], the initial value of the correlation matrix is set to (Φ)(0) = δI, where δ is called the regularization parameter. According to Moustakides, the parameter δ should be assigned a small value for high Signal to Noise Ratio (SNR) of the input data and a large value for low SNR. In our experiments we have used δ = σx2 (1 − λ)α ,

(3.21)

where α has been set equal to unity to emphasize the high SNR case. For the other cases, i.e. when the SNR is lower (< 10dB), we may use 0 > α ≥ −1. For the worst SNR values (< −10dB) we usually use α ≤ −1. The results of the experiment are shown in Fig. 3.17. We see that the RLS algorithm exhibits better convergence compared to the NLMS. It is clearly exemplified by the MSE or MSD curves for RLS with λ = 0.99 which corresponds to the same steady-state level as the NLMS. Now we compare the results of the same experiment but with speech as the input signal - see Fig. 3.18. As seen from the plots, the performance of the RLS is superior to that of the NLMS in all aspects. Even in steady-state, the RLS attains a smaller level of the residual error.

46

0.35

0.25

Mean−square error (MSE)

0.3

Mean−square deviation (MSD)

NLMS, µ = 0.1 RLS, λ = 0.99, δ = 0.05 RLS, λ = 0.95, δ = 0.05

0.25 0.2 0.15 0.1 0.05 0

0

200 400 time [samples]

0.2

0.15

0.1

0.05

0

600

NLMS, µ = 0.1 RLS, λ = 0.99, δ = 0.05 RLS, λ = 0.95, δ = 0.05

0

200 400 time [samples]

600

Figure 3.17: Convergence rate analysis of the RLS algorithm under EXPEV1 conditions and comparison with the NLMS; the mean values were obtained by averaging over 100 independent trials

3.7.2

Computational complexity analysis

We now concentrate on estimating the approximative complexity of the RLS algorithm. We start by the gain vector calculation (3.17). For the reasons of numerical stability, the computation of the gain vector proceeds in two stages: first, an intermediate quantity π(n) = Φ−1 (n − 1)x(n) is calculated, followed by k(n) = (λ+xTπ(n) . The computation (n)π(n)) 2 of π(n) requires N multiplications and the computation of k(n) requires another N multiplications. Moreover, the gain vector requires 1 addition and 1 division. Next we consider the calculation of the a priori error vector ξ(n) in (3.20). This parameter requires N multiplications and N subtractions. The next step in the RLS algorithm is the weightadjustment equation (3.20). This burdens the processing by N multiplications and N additions. Finally, the recursive update of the inverse correlation matrix is done using Φ−1 (n) = λ−1 Φ−1 (n − 1) − λ−1 k(n)xT (n)Φ−1 (n − 1). This requires 2N 2 multiplications combined with 2N 2 other operations (for all multiplications by λ−1 ). It remains only N 2 subtractions between the old value of the inverse correlation matrix and the corrections. Finally, the output of the filter y(n) is obtained by performing N multiplications between the updated weight vector and the input vector. Hence, counting up the total number of each operation in a way that was followed in the NLMS complexity analysis (see table 3.1), we can come up with the results in table 3.2. The results presented in table 3.2 are calculated for a block of K = 256 output samples. By comparing the numbers in table 3.2 with those in table 3.1 we immediately see that the complexity of the RLS dramatically surpasses that of the NLMS. The biggest discrepancy is in the number of multiplications which depends on N 2 for the RLS. This is in deep contrast with the NLMS which depends only on N .

47

1.4

0.35

Mean−square error (MSE)

1.2

Mean−square deviation (MSD)

NLMS, µ = 0.3 RLS, λ = 0.995, δ = 0.05

1 0.8 0.6 0.4 0.2 0

0

200

400 600 time [samples]

800

0.25 0.2 0.15 0.1 0.05 0

1000

NLMS, µ = 0.3 RLS, λ = 0.995, δ = 0.05

0.3

0

200

400 600 time [samples]

800

1000

Figure 3.18: Experimental comparison of the convergence rate of the NLMS and RLS algorithms under EXPEV1 conditions with speech input; the mean values were obtained by averaging over 100 independent trials; the parameters of the algorithms were tuned to achieve the highest rate of convergence RLS operation #MULT #DIV #ADD #OTH

for N K(3N 2 + 4N ) K ∗N 2 K(N + 2N + 1) 2KN 2

for N = 64 3211264 16384 1081600 2097152

for N = 128 12713984 32768 4260096 8388608

for N = 1024 806354944 262144 268960000 536870912

Table 3.2: Analysis of the number of operations in the RLS algorithm for the frame length K = 256 and various number of filter coefficients

3.8

Conclusion

In this chapter we described the two most popular adaptive algorithms used for speech enhancement. The NLMS algorithm is used in its simplest form or a modified form in various applications including noise and echo cancelation. Adaptive systems are usually implemented on a DSP allowing for certain signal processing operations being performed efficiently. That’s why we concentrate our complexity analysis on multiplications, additions and other operations. From the convergence rate analysis presented it follows that the RLS algorithm outperforms the NLMS in all aspects. The MSE and MSD curves show that the RLS is faster and the excess error level is lower. On the other side, from the analysis of the computational complexity we see that RLS requires much more operations than the NLMS. Thus, there is a discrepancy as to which algorithm to choose for the speech enhancement applications. This is illustrated in Fig. 3.19. There is a hypothetical algorithm depicted which incorporates both, a low complex-

48

Conv. time 1

NLMS

0.5

?

RLS

N

N2

Comp. compexity

Figure 3.19: Discrepancy between the computational complexity and the convergence time of NLMS and RLS algorithms; the symbol “?” indicates an algorithm that would be ideal in terms of low complexity and high convergence rate

ity and a high rate of convergence (low convergence time). This is a motivation to the research of new adaptive algorithm that we address in the rest of this thesis.

49

Chapter 4 Alternative stochastic gradient adaptive algorithms Gradient-based adaptive algorithms make use of the fact that the error performance surface is a quadratic function of the tap-weight vector. Thus, for each tap-weight vector there exists a single value of the objective function which determines the state of the adaptive process. The idea is to iteratively update the tap-weights in such a way that every new instance lies in the direction of the gradient. Since the exact shape of the error surface is rarely known in advance and its computation would be extensive or even impossible, the algorithms use an instantaneous value of the gradient. This estimate is easily obtained by using only input data which is known up to the time of calculation. Since the data is usually samples of random processes (speech, seismic data, air pressure), such an estimate is stochastic rather than deterministic. Still, after certain number of iterations we can assume the tap-weights will approach the minimum point of the error surface which is the goal. In this chapter we present several methods, which we adopted and evolved in the intial phase of our research. The fundamentals of these methods are either known from literature or found by experimental analyses. We are trying to investigate the performance of these methods under inconventional conditions, such as speech input, abrupt change of parameters or high level of noise. The aim is not to provide an extensive analysis or comparison of methods but to show that certain changes of algorithms may lead to better results. First we consider simple methods derived from the LMS (NLMS) algorithm such as Leaky LMS, Dead-Zone LMS or Median LMS. Next we provide a discussion of a frequencydomain implementation of the LMS algorithm and finally we propose an approach using a principle of simultaneous perturbations which we named SPSA. In all experiments presented in this chapter we used 10 independent trials to obtain the ensemble averaged results.

50

4.1

Methods derived from the LMS (NLMS) algorithm

Since LMS has become a very popular adaptive algorithm, many researchers attempted to modify it to alleviate some of its most serious problems such as round-off errors and numerical instability [29, p. 88]. These variants include the leaky LMS (which combats the potential numerical problems during periods of very low SNR), the signed algorithms (which further simplify the numerical requirements), the dual-sign algorithm (and other quantized versions which attempt to simplify the numerics without trading convergence speed), the ’high-order’ algorithms (which minimize lp norms for p > 2), the median LMS and other statistic algorithms (which attempt to optimize the LMS for use in impulsive environments).

4.1.1

Description of algorithms

Leaky LMS The possible sensitivity to round-off errors and other disturbances exists in the LMS algorithm due to the fact that w(n + 1) = w(n) + µ.e(n)x(n)

(4.1)

is essentially an integrator. An introduction of a small leakage to the tap-weight vector w(n + 1) = (1 − αγ)w(n) + αe(n)x(n),

(4.2)

should protect the algorithm against such numerical problems. The parametr γ is called the leakage factor and it is chosen such that αγ is grater than but close to 0. The leakage can provide an additional degree of stability. However, by applying the leakage, (4.2) no longer corresponds to an MSE estimation problem. The objective function minimized by the Leaky-LMS is given by J(w) = |e(n)|2 + γkw(n)k2 .

(4.3)

Thus, the estimation of wo will be biased and this bias will be proportional to γ and also to the unknown wo . It is clear that the goal of the adaptive process is not only to minimize the mean-square error but also the energy (the Euclidean norm) of the tap-weight vector. The Leaky-LMS algorithm does not only help to solve the numerical problems of the LMS. It is also useful for improving the convergenmce properties when the input correlation matrix R(n) is ill-conditioned 1 . This is exactly the case of correlated input data (e.g. the voiced parts of speech signals). As a result, the input correlation matrix will exhibit unbalanced values on the main diagonal. From the theory of linear algebra [35] we know that such ill-conditioned matrices are hard to invert and thus prevent the solution to be accurate. By introducing the leakage factor γ in the Leaky-LMS algorithm 1

The condition number of a square matrix A is the ratio between the largest and the smallest eigenvalue and is denoted as χ(A).

51

(a)

(b)

0.3

0.4

0.2

0.3 0.2

Rij [-] 0.1

Rij [-]

0 -0.1 20

0.1 0

15 10 i [-]

5 0 0

5

10

15

20

-0.1 20

15 10 i [-]

j [-]

5 0 0

5

10

15

20

j [-]

Figure 4.1: Regulariztion of the input correlation matrix; (a) before regularization, χ(R) = 1.72; (b) after regularization with γ = 0.1, χ(R) = 12.07

we can incerase the energy of values at the main diagonal and therefore regularize the correlation matrix Rr (n) = R(n) + γI, (4.4) In 4.4 Rr (n) is the regularized matrix and I is the identity matrix. The effect of regularization can be observed in Fig. 4.1. Dead-zone LMS Small values of the error signal e(n) may represent disturbances or noise but may also result from numerical instability. The Dead-Zone Least Mean Squares (DZ-LMS) is designed to mitigate the problems of round-off errors. The algorithm applies a dead-zone nonlinearity and stops updating the tap-weight vector when the error signal falls below some defined threshold. The dead-zone nonlinearity is defined as    x − d, x > d > 0  −d < x < d , g(x) = 0, (4.5)   x + d, x < −d where d is a threshold. When applied to the error signal, it converts the LMS update equation (4.1) to w(n + 1) = w(n) + αg(e(n))x(n). (4.6) The “dead-zone” nonlinear function has the shape shown in Fig. 4.2. Signed-error LMS, Signed-data LMS and Signed-signed LMS Although LMS is computationally very efficient, there are always applications for which even 3N K (see section 3.1) multiplications are too many. In all of the signed variants of the LMS algorithm, the sgn(.) function is used to replace the multiplication

52

g(e(n))

2d -d

d

e(n)

Figure 4.2: Nonlinear function of the type “dead-zone” used in DZ-LMS algorithm

[43]. The sgn(.) function is defined as ordinarily   x > 0  1, x=0 . sgn(x) = 0,   −1, x < 0

(4.7)

The sgn(.) function (4.7) can be applied either to the error signal (Signed Error Least Mean Squares (SE-LMS)), the input data (Signed Data Least Mean Squares (SD-LMS)) or both (Signed Signed Least Mean Squares (SS-LMS)) SE − LMS : SD − LMS : SS − LMS :

w(n + 1) = w(n) + µ.sgn(e(n))x(n) w(n + 1) = w(n) + µ.e(n)sgn(x(n)) w(n + 1) = w(n) + µ.sgn(e(n))sgn(x(n)).

(4.8)

It is an intuitive feeling that the gradient estimates of the SE-LMS and the SS-LMS may become rather chaotic. This has also been approved by Classen and Mecklenbrauker [9] who noted that the directions of the updates can be significantly different from the true gradient direction. In the worst case which would arise when using signed data sgn(x(n)) instead of the true data, there is a possibility of divergence and thus instability. Therefore, caution must be taken when employing these methods in practise since they work only in specific environments. A precise analysis is not provided in this thesis and interested readers are referred to e.g. [29]. Nevertheless, we proved the convergence performance of all signed algorithms discussed above and the results are shown in Fig. 4.3. Least Mean Fourth (LMF) algorithm It is an algorithm closely related to the LMS. The LMF tries to minimize a higher power of the error signal, specifically the fourth order Jw = E [e4 (n)].

(4.9)

It is easy to show that the weight-adjustment formula for the LMF algorithm has the following form w(n + 1) = w(n) + µ.e3 (n)x(n). (4.10)

53

1.5 J(w) SD-LMS SE-LMS SS-LMS 1

w1

0.5

0

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

w0 Figure 4.3: Convergence performance of the signed algorithms - mutual comparison

Median LMS The performance of the LMS and its derivatives significantly degrades when subjected to input signals that are corrupted by impulsive noise. One modification developped to combat this problem is the median LMS [78] w(n+1) = w(n)+µ.medK {x(n)e(n), x(n − 1)e(n − 1), . . . , x(n − K)e(n − K)} . (4.11) The median function is interesting since it tends to reject single occurrences of large spikes of noise. These spikes would otherwise affect the estimates of w(n) and introduce impulsive errors.

4.1.2

Experimental evaluation of the convergence rate

As usually we try to evaluate the convergence rate of the proposed or discussed algorithms by conducting several experiments. In this case we present some results again for the EXPEV1 scenario. Note, that the convergence performance of the signed algorithms has been already evaluated in the previous section and the results are depicted in Fig. 4.3. In the EXPEV1 experiment we use a colored input signal obtained by filtering the white noise through a low-pass FIR filter with the following parameters, M = 20 and fc = 0.8. The unknown system has a random exponentially decaying impulse response (see section 3.5). The filter order is set to 20. Plots of MSE and MSD functions are given

54

(a)

(b) 0.25

NLMS RLS Leaky LMS Signed-error LMS

0.16

Mean-square deviation (MSD)

Mean-square error (MSE)

0.2

0.12 0.08 0.04 0

0

400

800 1200 iteration [-]

1600

2000

NLMS RLS LMF Median LMS

0.2 0.15 0.1 0.05 0

0

400

800 1200 iteration [-]

1600

2000

Figure 4.4: Results of the EXPEV1 experiment for the alternative stochastic gradient algorithms; (a) Mean-square error; (b) Mean-square deviation

in Fig. 4.4. The parameters of the algorithms are set as follows µ = 0.1, δ = 1.10−4 µ = 0.01 µ = 200 µ = 0.5, K = 50.

Leaky LMS : Signed-error LMS : LMF : Median LMS :

4.1.3

(4.12)

Computational complexity analysis

Since some of the algorithms were designed to decrease the complexity of the conventional NLMS method it would be interesting to carry out a comparative analysis. The aim of this analysis is merely to estimate the number of operations involved in one iteration of the proposed algorithms. A deep investigation into e.g. the ANSI-C code or Assembler code of the methods may reveal some properties that can lead to significantly different results than those presented herein. However, this is not an objective of this thesis. For the Leaky-LMS algorithm the difference with the conventional LMS is only in the weight-adjustment equation (4.2). Here, additional N “other” operations are needed to account for the leakage. In the DZ-LMS, a nonlinear function is applied on the error signal. This operation could be implemented for example as a look-up table. Another solution is to employ 2 conditions, the overhead of which takes approx. 0 − 5 “other” operations and 3 additions/subtractions. A similar situation arises for all of the signed algorithms, where sgn(.) function is used to perform the nonlinear operation. However, the operation takes just 1 or N “other” operations depending whether e(n) or x(n) is the argument of the sgn(.) function. The LMF algorithm requires additional 2 multiplications for the computation of e3 (n). The Median LMS is the worst case of the above algorithms, since it incorporates the “median” function which is intuitively not easily implementable. This operation would most probably be programmed using a cycle(s) and a huge amount

55

Oper. #MULT #DIV #ADD #OTH

Leaky-LMS 3N K 0 K(N + 1) 2N K

DZ-LMS 3N K 0 K(N + 4) K(N + 5)

SE-LMS 3N K 0 K(N + 1) K(N + 1)†

LMF K(3N + 2) 0 K(N + 1) N K‡

Median LMS 3N K ‡ 0‡ K(N + 1)‡

Table 4.1: Summary of the computational complexity analysis of the adaptive algorithms discussed in section 4; N denotes the filter order and K denotes the number of output samples; †- 2KN for SD-LMS, K(2N + 1) for SS-LMS; ‡- the med(.) operation is not included

of overhead. The interested reader is referred to the documentation of DSP vendors, such as Texas Instruments, since they usually equip their products with free libraries of useful algorithms including median function. The summary of the computational complexity analysis is given in table 4.1.

4.1.4

Conclusion

Algorithms presented so far were designed to overcome some of the difficulties known from applications with the LMS or NLMS algorithm. We don’t provide an extensive discussion to these methods since they are not the true point of research. They were introduced in this thesis to point out some “tips and tricks” one can use in stochastic gradient weight-adjustment. We have used some of these techniques in the development of a novel adaptive approach based on optimal step-size strategy. From the plots presented in Fig. 4.4 it is clear that the Leaky-LMS algorithm outperforms the conventional NLMS. The performance of the Median-LMS is also comparable with that of the NLMS and the Leaky-LMS due mainly to the improved estimate of the gradient vector. The signed algorithms will always exhibit suboptimal convergence performance when compared to conventional NLMS since they tend to reject some information in exchange for lower complexity. From the summary of the computational complexity analysis in table 4.1 it may be found that the signed LMS algorithms together with the Leaky LMS outperform the other algorithms since they require the least amount of operations. On the other side, the Median LMS is probably the most intensive method as the actual procedure for its computation (which is not considered in table 4.1) is an algorithm of its own.

4.2

Fast Block-LMS (FBLMS) algorithm

In the normalized version of the LMS algorithm described in section 3.6, the tap-weights (free parameters) of the FIR filter are adapted in the time domain. Recognizing that a Fourier transform maps time-domain signals into the frequency domain we see that it is possible to perform the adaptation of the filter coefficients in the frequency domain. The first remarks of this new idea have been published by Walzman and Schwartz in 1973 [74].

56

In a block-based adaptive filter, the incoming data sequence x(n) is sectioned into L-point blocks where L is the number of samples in each block. Subsequent processing, i.e. the Fourier transform and the NLMS filtering process are then applied on a block-byblock basis. The input data for a block k is defined by the set (x(kL + i))L−1 i=0 , which can be written in matrix form as follows z   T A (k) =  

L

x(kL) x(kL − 1) .. .

}| x(kL + 1) x(kL) .. .

{

· · · x(kL + L − 1) · · · x(kL + L − 2)   . .. ...  . x(kL − M + 1) x(kL − M + 2) · · · x(kL + L − M )

(4.13)

The output produced by the filter consists of a set of L consecutive samples. Expressed as a vector it is defined as y(k) = A(k)wT (k) (4.14) or, equivalently y(kL + i) =

M X

wj (k)x(kL + i − j),

i = 0, 1, . . . , L − 1.

(4.15)

j=1

In (4.15) we can clearly recognize the operation of linear convolution. Next we turn to the tap-weight adjustment process (last line of (3.7)) and express it in a block-based notation L−1

µX w(k + 1) = w(k) + x(kL + i)e(kL + i). L i=0

(4.16)

Note that the weights are updated once per a block and the gradient vector is computed as a sum of the products u(kL + i)e(kL + i) over all possible values of i. Compared to the weight-adjustment equation of the NLMS algorihtm (3.7) we see that the normalization term has been dropped of and we now have a more accurate (averaged) estimate of the gradient vector. We may express the weighted sum in (4.16) in terms of the data matrix A φ(k) = AT (k)e(k), (4.17) where e(k) = [e(kL), e(kL + 1), . . . , e(kL + L − 1)]T is the error signal vector for the kth block. Equivalently we may write φj (k) =

L−1 X

x(kL + i − j)e(kL + i),

j = 0, 1, . . . , M − 1.

(4.18)

i=0

The computation of (4.18) involves the operation of linear correlation. The FBLMS algorithm discussed here solves the question of how to implement the NLMS algorithm in a computationally efficient manner. It recognizes that in frequency domain the operations of linear convolution (4.15) and linear correlation (4.18) may be computed more efficiently than in thetime domain. We follow the method of overlap-save [45] for “fast-calculating” these these two operations.

57

4.2.1

Calculation of the linear convolution term y(k) = A(k)wT (k)

According to the overlap-save method, it is necessary to build frames with the length L > N because the first N samples of the fast convolution method correspond to circular convolution and must be discarded. Haykin [24, p. 349] suggests to use L = 2N whereas Jones [11] reports 4N ≤ L ≤ 8N as the best choices of L. We follow the Haykin’s suggestion and use L = 2N , i.e. the frames consist of two consecutive blocks of input data. Thus, let X(k) = FFT[x(kN − N ), . . . , x(kN − 1), x(kN ), . . . , x(kN + N − 1)]T {z } | {z } |

(4.19)

kth block

(k−1)th block

denote a 2N -element vector constructed from two consecutive blocks of input data. The symbol k is used to denote the block number and is related to the original time by n = kN + i, k = 1, 2, . . . , i = 1, . . . , N − 1. Accordingly, let the 2N -by-1 vector · ¸ w(k) W(k) = FFT (4.20) 0 denote the Fast Fourier Transform (FFT) coefficients of the zero-padded, tap weight vector w(k). Hence, by applying the overlap-save method to the linear convolution term in (3.1) we get an N -by-1 vector © ª y(k) = lastN IFFT[XT (k)W(k)] , (4.21) where lastN {.} operation discards the first N samples of the argument.

4.2.2

Calculation of the linear correlation term φ(k) = AT (k)e(k)

From the theory of discrete-time signal processing [45] we know that linear correlation is basically a “reversed” version of linear convolution. Thus, let the 2N -by-1 vector · ¸ 0 (4.22) E(k) = FFT e(k) denote the FFT of the zero-padded error signal vector. Note that the zeros must precede the error vector e(k). An application of the overlap-save method to (4.18) yields the M -by-1 vector © ª (4.23) φ(k) = firstN IFFT[XH (k)E(k)] , where firstN {.} operation dicards the last N samples of the argument. Finally, we may utilize the Fourier version of the gradient vector and the tap-weights and transform (4.16) into frequency domain as · ¸ µ φ(k) . (4.24) W(k + 1) = W(k) + FFT 0 L This completes the derivation of the FBLMS algorithm, a schematic diagram of which is shown in Fig. 4.5.

58

x(n)

y(n)

d(n)

_ DFT-1

segmentation

assembly

+ e(n)

insert zeros

y(k)

x(k) X(k)

e(k) DFT E(k)

W(k)

DFT

z-1

W(k+1)

(k)

correlation constraint

UH(k) conjugate

Figure 4.5: Schematic flow graph of the FBLMS algorithm; “the correlation constraint” involves the operations of DFT, replacement of the last L samples by zeros and inverse DFT, in that order

4.2.3

Normalization of the tap-weight correction term

Even though the weight-adjustment (4.24) is used in many applications there is a modified version which leads to better convergence properties. The modification incorporates a normalization to the correction term. The idea of normalization is attributed to Sommen et. al. [66] and Lee and Un [36]. We have already known that the convergence rate strongly depends on the power of the input signal (a problem addressed by the NLMS algorithm in section 3.6). Thus, there is an idea to assign each frequency bin with its own adaptation constant (step-size) that would depend on the estimate of the power at that bin. Specifically, we define µi =

α , Pi

i = 0, 1, . . . , 2N − 1,

(4.25)

where α is a constant and Pi is an estimate of the signal power in the ith frequency bin. In nonstationary environments, however, the power fluctuates and varies with time. We may solve this problem by calculating an average power instead using the following simple recursion Pi (k) = γPi (k) + (1 − γ)kXi (k)k2 , i = 0, 1, . . . , 2N − 1. (4.26) Here, Xi (k) is the ith value of the frequency-domain vector X(k) and γ is a forgetting factor chosen in the range 0 < γ < 1. In the experiments that we present in the following section we employ the FBLMS algorithm in the form described above.

4.2.4

Experimental evaluation of the convergence performance

The FBLMS method is a first method in this thesis which is block-based. That means for the computation of a single output sample a block of input samples is needed. The block-based processing is handtailored for the implementation on a DSP since only the

59

DSP’s provide powerful hardware and software capabilitites for block-based operations. The performance of this algorithm is evaluated under EXPEV1 scenario. In the first case, the FBLMS method is used to identify an unknown system which is a random exponentially decaying impulse response (see 3.5.2). The input to the system is the colored signal (K = 20, fc = 0.8) and the additive noise has a level such that the SNR is 20 dB. The impulse response of the unknown system is subject to a sudden change of parameters (sign inversion) during the experiment occuring at n = 1000. The results are shown in Fig. 4.6. We tuned the parameters of the FBLMS algorithm so as to achieve the best convergence/misadjustment properties. The length of the input frames was set to M = 2 ∗ N , where N is the filter order, which is 20 for this case. From all plots in Fig. 4.6 we see that the convergence rate of the FBLMS method is roughly similar to that of the NLMS method. We can notice staircase character of the curve. This is due to the block-based processing nature of the FBLMS. We keep the values of the tap-weight vector fixed during a single block. In Fig. 4.6b the final values of the tap-weight vector for both parts of the experiment are compared with the true values. It can be seen that the model fits the original very well. In Fig. 4.6c we show what happens with the performance if the SNR is decreased by 5dB. The FBLMS performs worse compared to the NLMS in the second part of the experiment. Finally, in Fig. 4.6d we show how colored data may decrease the convergence speed and misadjustment of the FBLMS method. In this case we decreased the cut-off frequency of the low-pass FIR filter used for generating the colored noise to fc = 0.3.

4.2.5

Computational complexity analysis

The most important virtue of the FBLMS algorithm is its computational complexity, which is significantly lower compared to the NLMS. The FBLMS utilizes fast techniques to solve the problem of calculating the linear convolution and the linear correlation. These two operations are performed in the frequency domain using FFT transform, which saves a lot of processing time compared to performing convolution in the time domain [64]. As seen from Fig. 4.5 there are exactly five frequency transforms (FFT’s or IFFT’s) that need to be performed (including the correlation constraint). Each N-point FFT requires approximately N2 log2 N multiplications and N log2 N additions [11]. Since we work with sequences having the length L = 2N , the total number of multiplications due to frequency transforms is 5N log2 2N and the total number of additions is 10N log2 2N . In addition, the computation of the frequency-domain output vector y(k) requires 2N multiplications and 2N additions and the error vector e(k) takes another N additions. The gradient vector involves the same amount of operations as the output vector, Thus, it requires 2N multiplications and 2N additions. Finally, the weight-adjustment process consists of performing 2N multiplications by the step-size, which are treated as “other” operations and 2N additions. Counting down the number of operations involved in a single iteration of the FBLMS algorithm we find that: the total number of real multiplications is 10N + 5N log2 N , the total number of additions is 16N +10N log2 N and the total number of “other operations” is 2N . However, a single run of the FBLMS produces N samples of output data. To compare the FBLMS with NLMS we will have to recalculate these numbers in terms of

60

(b) 0.1 NLMS, µ=0.5,δ=0.0001 RLS, λ=0.999,δ=0.2

0.1

model

FBLMS, δ=0.1,γ=0.95,α =0.3 0.08 0.06 0.04

tap-weight value [-]

Mean-square deviation (MSD)

0.12

true

0

for n=1000 -0.1

0

5

10

15

20

0.1 model

true

0

0.02

for n=2000 0

0

500

1000

1500

2000

-0.1

0

15

20

0.16

NLMS, µ=0.05,δ=0.0001 RLS, λ=0.999,δ=0.9

0.35

FBLMS, δ=0.1,γ=0.95,α =0.05

0.3 0.25 0.2 0.15 0.1

Mean-square deviation (MSD)

0.4

Mean-square deviation (MSD)

10

tap-weight number [-] (d)

iteration [-] (c) 0.14

NLMS, µ=0.5,δ=0.0001 RLS, λ=0.999,δ=0.2

0.12

FBLMS, δ=0.1,γ=0.95,α =0.3

0.1 0.08 0.06 0.04 0.02

0.05 0

5

0

0

500

1000

1500

2000

iteration [-]

0

500

1000

1500

2000

iteration [-]

Figure 4.6: Experimental convergence analysis of the FBLMS algorithm under EXPEV1 scenario; colored input (K = 20, fc = 0.8), random exp. decaying impulse response, sign inversion of the parameters at n = 1000; (a) Mean-square deviation, (b) comparison of the final values of tap-weight estimates with true values in both parts of experiment, (c) MSD for decreased SNR (5 dB), (d) MSD for more colored input data (fc = 0.3)

K to get the amount of operations for K output samples. The results are shown in table 4.2. A graphical comparison of the computational complexity of the above discussed algorithms is shown in Fig. 4.7. In this figure we have purposely used K = N for all plots to illustrate that the LMS algorithm may be more efficient than the FBLMS in terms of multiplications for block lengths shorter than approximately 15. However, as the length increases above this limit, the LMS performance quickly falls behind that of the FBLMS and for N = 1024, the FBLMS is almost 50 times faster than the LMS, again in terms of multiplications. The NLMS is less efficient due to its normalization operation.

61

FBLMS operation #MULT #DIV #ADD #OTH

for N = 32 10K + 5K log2 N −− 16K + 10K log2 N 2K

for N = 64 10240 −− 19456 512

for N = 128 11520 −− 22016 512

for N = 1024 15360 −− 29696 512

Table 4.2: Computational complexity analysis of the FBLMS algorithm for the frame length K = 256 and various number of filter coefficients

5000

2500

2000

3000 LMS

2000

Number of additions

Number of multiplications

NLMS

4000

1000

0

10 20 Block length N

1500

1000

500

FBLMS

0

FBLMS

0

30

LMS, NLMS

0

10 20 Block length N

30

Figure 4.7: Comparison of the number of multiplications and additions of the LMS, NLMS and FBLMS algorithms depending on the block length N (for K = N in table 4.2)

4.2.6

Conlusion

The FBLMS algorithm performs adaptation in the frequency domain to take the advantage of calculating the convolution and correlation operations with low complexity. It also assigns an individual value of the step-size parameter to each frequency bin to speed up the convergence rate. From the results of experimental evaluation we see that the convergence performance of the FBLMS algorithm is roughly equal to that of the NLMS algorithm. The exception is the case with low SNR (5 dB). Here the convergence is worse in the second part of the experiment (after a sudden change of system parameters). Another deterioration of performance comes up when the input data is correlated (low-pass filtered white noise with fc = 0.3). In this case the steady-state misadjustment is almost 1.5 times higher than that of the NLMS and RLS methods. The complexity analysis revealed an indisputable virtue of the FBLMS algorithm which is its low complexity. The method exploits the properties of the FFT which is essentially an operation depending on the logarithm of the number of samples N . Compared to the LMS and the NLMS algorithms we see that e.g. for N = 32 the method needs approx.

62

4 times less multiplications than the NLMS and 2 times less multiplications than the LMS. For higher values of N the discrepancy is even bigger. Finally, the FBLMS is a method which is a perfect candidate for implementation on a DSP (block-based processing, FFT’s and IFFT’s, low complexity).

4.3

Simultaneous perturbation stochastic approximation (SPSA) algorithm

In this section we propose a new algorithm for adaptive speech enhancement. We call it SPSA to denote the method used for gradient calculation. The SPSA method itself has been developed by J. C. Spall in 1988 [67, 68] and was primarily intended for use in nonlinear control applications. In the conventional NLMS approach developed in section 3.6 the gradient vector ∂Jw ≈ of the objective function has been approximated by an instantaneous estimate ∂w(n) x(n)e(n). Note, that such estimate is noisy since the measurements of the input data are samples of a random process. The algorithm considered here also uses some form of stochastic approximation (SA) which, in contrast with NLMS, uses more than one instantaneous estimate of the gradient vector. However, there is some confusion in literature in the fact that the approach is termed as “gradient-free” to emphasize that it does not rely on direct measurements or estimates of the gradient vector (it only uses the measurements of the objective function). Regardless of what measurements are needed, the goal is to find the most accurate estimate of the gradient vector and the term “gradient-free” is therefore misleading and will not be used in this thesis.

4.3.1

Robbins-Monro stochastic approximation (R-M SA)

In accordance with the theory of stochastic approximation [32] the classical method is the Robbins-Monro Stochastic Approximation (R-M SA) [57] developed in 1951. It is a gradient-based algorithm and it represents a generalization to the the well-known method of “steepest descent” [24, pp. 203-230] and Newton-Raphson. It relies on direct measurements of the gradient function with respect to the parameters being optimized. The algorithm has the following form w(n + 1) = w(n) − a(n)Y (w(n)),

(4.27)

where Y (w) represents the measurement of the gradient function g(w) and a(n) is a nonnegative gain sequence which must satisfy certain conditions. The measurements are allowed to include noise with zero mean and fixed variance. The convergence conditions of the R-M SA algorithm are usually given in terms of the gain sequence a(n). The most popular choice for a(n) is the scaled harmonic sequence a(n) = a/(n + 1), a > 0 which assures sufficient noise P∞ suppression near the optimum point wo (a(n) → 0) and asymptotic convergence i=0 a(n) = ∞. The optimal choice of a is obtained by experimental evaluation. When studying the asymptotic distribution of the iterates in (4.27), Ruppert found [59] that, under appropriate conditions, dist

α

n 2 (w(n) − wo ) −→ N (0, Σ), 63

(4.28)

dist

when k → ∞, where −→ denotes convergence in distribution (see e.g. [34]) and Σ is some covariance matrix. The constant α introduced in (4.28) governs the decay rate of the SA gain sequence µ(n) = a/(n + 1)α , a > 0, α < 1. (4.29) The choice of α < 1 is not optimal in the sense of asymptotic theory, but most researchers and analysts found it to yield superior performance over α = 1.

4.3.2

Kiefer-Wolfowitz finite-difference SA algorithm

There has been a growing interest in stochastic optimization theory to have an algorithm that would not depend on direct measurements of the gradient function. In section 3.6 we have introduced a popular method (NLMS) in which the gradient of the objective ∂Jw function has been approximated by ∂w(n) ≈ x(n)e(n). Intuitively, the gradient-based methods will converge faster than those using objective function measurements and certain approximation. This results from the fact that usually neither the exact gradient nor its measurements are available and every estimate of its value is a random variable with certain statistics. However, using measurements of the objective function, one can estimate the gradient in such a way that the convergence w → wo is asymptotically equivalent to that of the R-M SA. The first SA method using only objective function measurements is the KieferWolfowitz algorithm [30] which can be expressed as w(n + 1) = w(n) − a(n)g(w(n)),

(4.30)

where gi (w(n)) =

J(w(n) + c(n)ei ) − J(w(n) − c(n)ei ) , 2c(n)

i = 1, 2, . . . , N.

(4.31)

In (4.30) g(w(n)) = [g1 (w(n)), g2 (w(n)), . . . , gN (w(n))] is an estimate of the gradient vector ∂Jw /∂w at the iterate w(n) based on the above mentioned measurements of the objective function. Furthemore, ei denotes a unit vector with a “1” in the ith place and zeros elsewhere and c(n) denotes a small positive number which usually gets smaller as n gets larger. We see that the gradient estimate is based on measurements collected at finite differences from the actual parameter estimate w(n). Hence the algorithm described is sometimes called as the Finite Difference Stochastic Approximation (FDSA) but more often as Kiefer-Wolfowitz FDSA.

4.3.3

Simultaneous perturbation

The simultaneous perturbation differs from the finite differences in the fact that all elements of w(n) are perturbed together to obtain two measurements of J(.). The differences are random. Each component of g(w(n)) is formed from a ratio involving the individual components of the perturbation vector and the difference in the two corresponding measurements. Hence, we have gi (w(n)) =

J(w(n) + c(n)∆(n)) − J(w(n) − c(n)∆(n)) , 2c(n)∆i (n) 64

i = 1, 2, . . . , N ,

(4.32)

b) Influence of the γ parameter

0.006

difference parameter c(n)

α = 1, opt. α = 0.9 α = 0.8 α = 0.7 α = 0.602, min.

0.008 step−size a(n)

0.6

0.004 0.002 0

0

1000 2000 3000 iteration [−]

4000

γ = 1/6, opt. γ = 0.15 γ = 0.13 γ = 0.11 γ = 0.101, min.

0.5 0.4 0.3 0.2 0.1

0

1000 2000 3000 iteration [−]

d) Stabilization constant A A = 50 A = 100 A = 300 A = 700 A = 1500

difference parameter c(n)

step−size a(n)

0.015

0.01

0.005

0

0

1000

step−size a(n)

a) Influence of the α parameter 0.01

2000 3000 iteration [−]

4000

4000

c) Influence of the gain constant a 0.4 a = 0.1 a = 0.3 0.3 a = 0.7 a = 1.5 a = 5.0 0.2

0.1

0

0

1000 2000 3000 iteration [−]

4000

e) Difference influencing constant c 1.5 2 c = 0.1 c2 = 0.2 2

1

c = 0.4 c2 = 0.8 c2 = 1.2

0.5

0

0

1000

2000 3000 iteration [−]

4000

Figure 4.8: Influence of various parameters on sequences a(n) and c(n); a) Influence of the parameter α on the sequence √ a(n) (a = 0.16, A = 100); b) Influence of the parameter γ on the sequence c(n) (c = 0.3); c) Influence of the gain a on the sequence a(n) (α = 0.602, A = 100); d) Influence of the stabilization constant A on the sequence a(n) (a = 0.16, α = 0.602); e) Influence of the gain c on the sequence c(n) (γ = 0.101)

where ∆(n) = [∆1 (n), ∆2 (n), . . . , ∆N (n)] is a random vector that has to satisfy certain distribution conditions. Thus, the SPSA method uses only two measurements of the objective function at random distances from the current estimate of the tap-weight vector w(n). Under reasonably general conditions SPSA and FDSA achieve the same level of statistical accuracy for a given number of iterations, even though SPSA uses N times fewer function evaluations than FDSA [69]. Note, that this result must be interpreted in an asymptotic sense meaning it may not hold for a finite number of iterations. We will try to verify these statements by an experimental evaluation. Before that, however, we must support our algorithm with precise definition of the sequences a(n), c(n) and ∆(n). The goal is to minimize the objective function Jw over all possible values of w which is a vector having N components. The SPSA algorithm works iteratively from an initial guess of wo . The iterative process depends on the above-mentioned simultaneous perturbation approximation of the gradient g(w). The main conditions imposed on the parameters and sequences introduced above are discussed below:

65

Generation of the simultaneous perturbation vector ∆(n) According to [68] the N -dimensional perturbation vector must be generated by Monte Carlo simulations. Each of the N components of ∆(n) should be independently generated from a zero-mean probability distribution satisfying the conditions in [68]. A much simpler choice of each component of ∆(n) is to use a Bernoulli ±1 distribution with the probability of 1/2 for each ±1 outcome. Spall [68] also notes that uniform and normal random variables are not allowed since they violate some regularity conditions discussed in the introduction (they have infinite inverse moments). The gain sequences a(n) and c(n) These two sequences are critical to the performance of the SPSA algorithm. Spall [70] and many other researchers use the following gain sequences a (A + n + 1)α , c c(n) = (n + 1)γ

a(n) =

(4.33)

where the value α has the same meaning as in the R-M SA algorithm (see 4.29). The constant γ determines the rate at which the differences between the actual and the perturbed measurements of the objective function decay to 0. In practise the values for α and γ are α = 0.602, γ = 0.101, (4.34) respectively (the asymptotically optimal values of 1.0 and 1/6 may also be used) [70]. The difference influencing constant c The choice of c determines the distance of the current parameter estimate from the perturbed one in the N -dimensional parameter space. It is best to set c to a level approximately equal to the standard deviation of the measurement noise in Jw [70]. Note, that the exact value of the objective function is assumed to be unknown and its estimate produces some estimation error. The value of c2 should not exceed the variance of this error. The standard deviation can be estimated by collecting several measurements of Jw at the initial guess w(0). A precise estimate is not required. The step-size constant a and the stability constant A The constant A is typically not discussed in the SA literature but Spall has shown [70] that it may improve the stability of the SPSA algorithm in early iterations. It allows to use a larger a in the numerator of (4.32) and therefore more aggressive steps in early iterations. After certain amount of time A may become insignificant against n and its impact will be minimized. A guideline suggested by Brennan and Rogers and also by Spall [8, 70] is to set A such that it is approx. 10% (or less) of the maximum number of expected/allowed iterations. Subsequently, the constant a is chosen such that the movement of the tap-weight vector w(n) is sufficiently aggressive in early iterations.

66

4.3.4

The Bernoulli movement

The evolution of sequences a(n) and c(n) and the impact of the constants α, γ, a, c and A are shown together in Fig. 4.8. Note, that in each plot several values of a specific parameter are shown each of which corresponding to a single curve. The optimal values are denoted as opt. and minimal values as min.. As we see from the plots therein either sequence decays to zero as n approaches infinity. In case of a(n) this ensures that in steady state the gradient noise will become negligible compared to the excess error. In case of c(n), the differences tend to decrease to obtain a more accurate gradient estimates near the optimum. In the case of Bernoulli distribution for the ±1 vectors we have verified the behavior for the 2-dimensional case. First we tried to simulate the movement of a 2-dimensional vector w = [w0 , w1 ]T in the R2 space according to Bernoulli updates. For the movement to be visible, every adjustment is modified by a random perturbation, which is Gaussian with zero mean and the variance of 0.2. Thus, we have w(n + 1) = w(n) + ∆B (n) + 0.2t(n),

(4.35)

where ∆B = [∆0 , ∆1 ]T and each component ∆0 , ∆1 ∈ (+1, −1) and t(n) = [t0 , t1 ]T is a random vector with normal distribution N (0, 1). The Bernoulli movement is shown in Fig. 4.9a. In the second plot, i.e. 4.9b there is a histogram of a random Bernoulli process. The process consists of performing Bernoulli trials to generate 100-dimensional vectors. The plot shows the relative number of successes, i.e. the outcomes of +1, in 1000 attempts. The last plot shows a situation of generating 2-dimensional Bernoulli vectors with the outsomes [+1, +1], [+1, −1], [−1, +1] and [−1, −1] by pseudo-random generators with normal distribution.

4.3.5

Experimental evaluation of the convergence performance

We evaluate the performance of the SPSA algorithm on the application of system identification, i.e. under the EXPEV1 scenario. As an input we use the colored noise signal, obtained by filtering the white noise thourgh a low-pass FIR filter - see section 3.5.1. The eigenvalue spread of the colored noise is approximately 6.0. First we illustrate the the SPSA tap-weight vector movement. We will compare this movement to that of the NLMS algorithm for the case of N = 2, i.e. for 2 tapweights. On the countour plot of Fig. 4.10 we see the effort of the SPSA and the NLMS algorithms to approach the optimal point of the objective function. It is clear that the updates of the SPSA algorithm take place in exactly 2N directions in every step. In this example we identify an unknown system which is a second-order low-pass filter with the coefficients wo = [0.6125, −0.5124]. The output of the unknown system is corrupted by an additive noise from the distribution N (0, 0.03). The parameters were set to achieve the highest possible rate of convergence while maintaining the variance of the steady-state excess error under 0.05. As seen from the countour plot 4.10a the SPSA algorithm starts updating the parameter vector in big steps that get smaller as it approaches the minimal point of the surface. Since the paramters are fine-tuned for the experiment conducted, the convergence rate of the SPSA is better than that of the NLMS. The highest influence on the convergence performance is due to the variable step-size sequence a(n). We also

67

a) Bernoulli movement in parameter space

b) Histogram of random Bernoulli process relative occurence in 1000 attempts [−]

20 15

w1 [−]

10 5 0 −5 −10

−30

−20 −10 w0 [−]

0

250

200

150

100

50

0 30

40

50

60

70

number of successes in 100 Bernoulli trials [−]

relative probability [−]

0.8 0.6 0.4 0.2 0 3

3 2

2 1 ∆w1 [−] 0

1 −1

c) Probability distribution of 2−dimensional Bernoulli difference vectors

−1 −2

0 ∆w [−] 0

−2 −3

−3

Figure 4.9: Properties of the Bernoulli random process for a 2-dimensional case; a) Bernoulli movement in the parameter space (starting point [0, 0], transition of each wi at approx. [+1,-1] steps, subject to white noise perturbation with the variance of 0.2; b) Histogram of [+1] successes in 100 Bernoulli trials c) Probability distribution of Bernoulli differences [+1,+1],[+1,-1],[-1,+1] and [-1,-1] generated by a pseudo-random generator with normal distribution

conducted several experiments with a fixed step-size aF ∈< 0.3, 0.6 > and the averaged squared error was still better than that of the NLMS method. Computational complexity analysis The analysis of the operations involved in the SPSA algorithm depends significantly

68

SPSA operation #MULT #DIV #ADD #OTH

for N 3KN 2 + 9KN 0 3KN + 6K KN

for N = 64 3293184 0 50688 16384

for N = 128 12877824 0 99840 32768

for N = 1024 807665664 0 787968 262144

Table 4.3: Analysis of the number of operations in the SPSA algorithm for the frame length K = 256 and various number of filter coefficients

on the application of interest. Here we consider the problem of system identification as introduced before. First, both sequences a(n) and c(n) can be generated in advance and stored in memory. The same applies for the Bernoulli perturbation vector c(n)∆(n). Again, we would like to count the operations for a block of K output samples. It is interesting to note here that in the SPSA algorithm there is no need to calculate neither the output of the filter y(n) nor the error signal e(n). However, for the computational analysis presented here we assume these two signals are necessary for further processing and are calculated by the SPSA algorithm. The computation of y(n) takes KN multiplications and the computation of e(n) takes K subtractions. The two perturbed parameter vectors, w+ and w− require 2KN additions together. The most intensive part involves the estimation of the objective function. According to Haykin [24, p. 107] the MSE objective function is given by J(w) = σd2 − wT p − pT w + wT Rw,

(4.36)

in which R = E[xT (n)x(n)] is the input correlation matrix and p = E[x(n)d(n)] is the cross-correlation vector. In the SPSA algorithm both, the correlation matrix and the cross-correlation vector, are estimated using the instantaneous input vector. The correlation term xT (n)x(n) requires KN 2 multiplications whereas the cross-correlation term x(n)d(n) only KN multiplications. Each objective function estimation (4.36) then requires K(N 2 +3N ) multiplications and 3K additions. With the two estimated objective function values, the gradient vector (4.32) takes K subtractions and KN divisions by a predefined values which may be treated as “other” operations. Finally the weightadjustment equation involves KN multiplications and KN subtractions to complete the computational complexity analysis of the SPSA algorithm. Again, the analysis performed so far gives only a rough measure of an algorithm’s demand. Tab. 4.3 summarizes the results. A glimpse on the number of multiplications per one iteration cycle reveals that the SPSA is actually an O(N 2 ) algorithm. This signifies that the number of multiplications is a quadratic function of the number of filter coefficients N . On the other side, the NLMS is an O(N ) algorithm as it depends only on N .

69

a) Loci of w0(n) versus w1(n) for the SPSA algorithm 2

3

2.2

1.6

1.6

1.6

1.6

1.1

1.1

w [−]

1

0

3

1.1

2.2

3.9

−1

3 3.9

−1.5

−2 −2

2.2

−1.5

2.2

3.9

2

100

200 iteration [−]

300

0.5

1

1.5

2

6 4 2 0

400

0.7

c) Squared error convergence for the SPSA algorithm 8 squared error [−]

4

0.7 0.7

1.1

0 w0 [−]

6

0.4

0.7

1.1

b) Squared error convergence for the NLMS algorithm 8

0

0.4

1.6

−0.5

0.7 0.4

0.7

1.6 2.2

3

−1

0.2

0.4

0.7

1.6

3

0.2

0.4

1.1

3.9 5.1

0.7 1.1

0.7

0.2

0.2

1.6

3

5.1

0.4

0.4

1.1

2.2

0.4

0.2 0.7

1.1

0.7

0.2

1.1

2.2

1.6

1.1

0.7 0.4

0.2

0.7

1.6

squared error [−]

0.7

0.4 3

2.2 1.6

1.1

0.4

0.2

0.2

1.6

−0.5

0

1.1

0.4

3

2.2 1.6

0.7

0.2

0.4

1.6

0.7

0.7

1.6

3

2.2

1.1

0.4

0.4

2.2

3.9

1.6

0.7

0.7

0.5

3

1.1 0.7

3.9

2.2

1.1

1.1

1.6

5.1

3

1.6

2.2

1

3.9

2.2

1.6 1.1

5.1

3

2.2 2.2

2.2

1.5

3.9

3

2.2

2.2 2.2

0

100

200 300 iteration [−]

400

500

Figure 4.10: Convergence analysis of the SPSA algorithm and comparison with the NLMS; SPSA parameters: N = 2, α = 0.602, γ = 0.101, c = 1e − 3, A = 200, a = 0.66, NLMS parameters: N = 2, µN = 0.01; a) loci of w0 (n) versus w1 (n); b),c) squared-error performance

70

Chapter 5 Optimal step-size strategy in adaptive filtering In chapters 3 and 4 we have shown some adaptive algorithms that may be employed to solve the speech enhancement problem. In experiments conducted we analyzed the performance of these methods in terms of convergence rate, steady-state error level and computational complexity. The partial conclusions that we came to may be summarized as follows. The LMS method and its derivatives all exhibit a very low computational complexity since all of them need only O(N ) operations per iteration cycle. On the other side, their convergence rate and also the misadjustment is sometimes poor compared to the method of RLS, which is more complex. In section 4.3 we proposed a novel stochastic gradient approach based on the SPSA approximation. This method uses a variable stepsize in every iteration. The problem is that it is necessary to know the number of iterations in advance which is impractical in real-time applications. However, the idea of applying a variable step motivated our research into a new class of adaptive algorithms, which we call optimal step-size (OSS). In this chapter we describe the optimal step-size strategy and develop an OSS algorithm based on the LMS method. We compare its performance with the methods of NLMS and RLS which are considered as a reference. Finally we show some experimental results in a typical speech enhancement applications.

5.1

Affine projection algorithm (APA)

In terms of convergence rate and computational complexity, the affine projection algorithm represents an “intermediate stage” between the stochastic gradient algorithm (NLMS) and the least-squares method (RLS). However, its inventors, Ozeki and Umeda in 1984 [46] meant it to be a generalization to the well-known NLMS algorithm with a single goal - to improve the convergence rate. The major problem that is solved by the APA is the slow convergence for colored input data, a drawback of all stochastic gradient methods. This is particularly useful in speech enhancement applications, since speech signals are known to be highly correlated. The APA algorithm serves as a motivation to the research of our proposed method.

71

Therefore it would be desirable to make ourselves familiar with its structure and some key properties.

5.1.1

The zero-forcing principle

The conventional APA algorithm updates the filter coefficients (weights) on the basis of multiple, unit delayed, input signal vectors. This is in contrast with the NLMS where only one input vector is used in each iteration. We start our discussion with the conventional NLMS algorithm described in section 3.6. Its adaptation proceeds according to (3.7) which is reproduced here for convenience of presentation in a slightly modified form w(n + 1) = w(n) +

µ x(n)e(n). kx(n)k2

(5.1)

If the step-size parameter µ is equal to one, 5.1 updates the tap weights such that the a posteriori estimation error is zero. That is, e˜(n) = d(n) − w(n + 1)x(n) = 0.

(5.2)

This behavior leads to an interpretation that the algorithm tries to force a zero in the a posteriori error. Cabuk [10] expands on this topic and establishes a new class of algorithms named appropriately - Zero-Forcing Algorithm (ZFA). The graphical interpretation of the zero forcing principle is illustrated in 5.1. Here, for the case of two-dimensional tapweights, every new estimate of the weight vector is projected in the direction of the input vector, e.g. from w(0) to w(1) along x(1). Moreover, the destination of the projection lies on a line which is perpendicular to the input vector x(1) and still contains the optimal tap-weight vector w0 . It is obvious that by proceeding according to such principle we will come to the solution. The number of iterations will never be less than N but can be significantly larger than N in cases when the input vectors are be oriented more or less in the same direction. This happens when the input data will be highly correlated and that is actually why the convergence rate of the NLMS is worse for speech-like signals.

5.1.2

The basic APA algorithm

To ameliorate the problem of slow convergence of the zero-forced NLMS algorithm, APA expands the number of input vectors to R, where R is called the rank. The error signal, defined previously in (3.4), now takes on a different form e(n) = d(n) − XT (n)w(n),

(5.3)

d(n) = [d(n) d(n − 1) . . . d(n − R + 1)]T X(n) = [x(n) x(n − 1) . . . x(n − R + 1)]

(5.4)

where

are the desired response vector and input sample matrix, respectively. The dimension of the matrix X(n) is N × R. The rank R is usually much smaller than the number of filter

72

x(4)

x(3) w(5) w0

x(2) w(4)

x(1) w(3) w(2) w(1) w(0)

x(5)

Figure 5.1: Geometrical interpretation of the zero-forcing principle in the NLMS algorithm. The estimate of the weight vector is projected on the line which is perpendicular to the input vector and contains the optimal filter vector w0 . Taken from [58]

coefficients N , especially in the case of long filters (acoustic echo cancelation). In the typical case we have N = 500 ÷ 2000 and R = 10 ÷ 50. The most important part of the APA algorithm, the weight-adjustment equation, is defined as follows w(n + 1) = w(n) + µX(n)g(n),

(5.5)

where g(n) is the prewhitened error signal £ ¤−1 g(n) = XT (n)X(n) + δI e(n)

(5.6)

since a multiplication by an inverse correlation matrix acts as a whitening filter. To control stability, convergence and residual error, a step size µ is introduced where 0 < µ < 2 [62]. To improve robustness and immunity against noise, diagonal matrix δI is introduced to regularize the inverse matrix in the APA algorithm. The idea of moving the vector of coefficients according to the zero-forcing principle finally lead us to the research of the optimal step-size algorithms, which is the topic of the next section.

73

5.2 5.2.1

Optimal step-size strategy Variable step algorithms

Our research into optimal step size methods stems from a recent work of Shin and Sayed [63]. They introduced several techniques that use variable step-size, specifically for the control of the weight-adjustment process of NLMS and APA algorithms. For the case of the NLMS algorithm, the weight update is given by x(n) e(n) kx(n)k2 kp(n)k2 µ(n) = µmax kp(n)k2 + C x(n) p(n) = αp(n − 1) + (1 − α) e(n). kx(n)k2

w(n + 1) = w(n) + µ(n)

(5.7) (5.8)

We see that the step-size varies from iteration to iteration. If kx(n)k2 becomes large, µ(n) goes to µmax . On the other hand, when kx(n)k2 is small, the step-size is small. Thus, depending on kx(n)k2 , µ(n) varies from 0 to µmax . To guarantee stability, µmax is chosen less than 2. The variable p(n) which is crucial to the step-size calculation is called the projected error and is given by £ ¤−1 p(n) = UT (n) U(n)UT (n) e(n), (5.9) where U(n) is a K-by-N input matrix constructed from K subsequent input vectors and e(n) is the K-by-1 error vector. The Euclidean norm of the projected vector kx(n)k2 is averaged by a simple exponential filter. The authors have shown that the performance of the proposed algorithm, which they termed Variable-Step Normalized Least Mean Squares (VS-NLMS), is better than that of the conventional NLMS. In another research paper, published by Kwong and Johnston [33] the step-size parameter µ(n) is dependent on a squared instantaneous error µ(n + 1) = αµ(n) + γe2 (n).

(5.10)

The main drawback of this principle is that it is sensitive to additive noise. To improve the noise immunity against Gaussian noise, Aboulnasr and Mayyas [1] used squared autocorrelation of errors at adjacent time µ(n + 1) = αµ(n) + γp2 (n) p(n) = βp(n − 1) + (1 − β)e(n)e(n − 1).

(5.11)

Probably the most advanced approach may be attributed to Pazaitis and Constantinides [51] who propose to use the fourth order cumulant of instantaneous error ¢ ¡ e µ(n) = µmax 1 − e−αC4 (n) C4e (n) = f (n) − 3p2 (n) (5.12) 4 f (n) = βf (n − 1) + (1 − β)e (n) (5.13) 2 p(n) = βp(n − 1) + (1 − β)e (n). (5.14)

74

It is clearly seen that the step-size is driven by an exponential function, the slope of which is determined by a kurtosis. The performance of all the approaches described above and also of the approach proposed by us is determined by how accurately they can estimate how far the filter is from the optimal performance. Another issue of concern is of course the robustness and the immunity against additive noise.

5.2.2

The mathematical description of the OSS principle

We start our description with the objective function of the conventional LMS algorithm (4.36) which is reproduced here for convenience J(w) = σd2 − wT p − pT w + wT Rw.

(5.15)

In the stochastic gradient approach, the tap weights of the filter are updated along the noisy gradient vector which is calculated as ∇J(w) = −2x(n)e(n).

(5.16)

Thus, it uses the instantaneous values of the input correlation matrix and the crosscorrelation vector (input signal vs. desired response). In every step the tap weights move in a direction determined by the noisy gradient vector. The length of the movement is controlled by a fixed step-size parameter µLM S . We propose to calculate the step-size in such a way that the tap weights will be moved to a position where the objective function will achieve its local minimum value. This is called the optimal step-size. The line along which the movement is carried out is calculated in advance using an averaged correlation matrix and an averaged crosscorrelation vector. The situation is explained in Fig. 5.2 for the case of 2 tap weights. If the gradient vector was known precisely we should come into the optimal solution in exactly N steps. This is not the practical case, however. As mentioned in the introduction, the optimal step size is calculated in every iteration. To provide a general analysis of the Optimal Step-size Strategy (OSS), we assume the weight adjustment is carried out according to the following equation w(n + 1) = w(n) + ro (n)t(n),

(5.17)

where t(n) represents a vector along which the optimal step ro (n) is searched for. Unless otherwise specified, from now on, we will omit the time index n from the quantities that depend on it. Then, in every iteration we perform the following minimization µ ¶ ∂J(w) ro = min . (5.18) r ∂r By substituting (5.15) to (5.18) we recognize that the partial derivative consists of the following three terms ∂ ∂ ∂ ∂J(w) = − (wT p) − (pT w) + (wT Rw). ∂r ∂r ∂r ∂r 75

(5.19)

5 4 w(1)

3 w(0) 2 1

w(3)

w1 0

w0

w(2)

-1 -2 -3 -4 -5 -5

-4

-3

-2

-1

0 w0

1

2

3

4

5

Figure 5.2: Optimal step-size weight adjustment; every line of the movement is calculated using averaged values of R and p

The variance of the desired response σd2 does not depend on r and thus its derivative is zero. The first term of (5.19) may be written as o ∂ ∂ n T T (w p) = [w + rt] p ∂r ∂r = tT p. (5.20) The second term is similar to the first one and its derivative is therefore ∂ T (p w) = pT t. ∂r

(5.21)

The last term of (5.19) is calculated as o ∂ ∂ n (wT Rw) = [w + rt]T R [w + rt] ∂r ∂r = tT Rw + wT Rt + 2rtT Rt.

(5.22)

By substituting (5.20), (5.21) and (5.22) to (5.19) we get ∂J(w) = −tT p − pT t + tT Rw + wT Rt + 2rtT Rt. ∂r 76

(5.23)

In order for a function to achieve its minimum point, its derivative must be equal to zero. Thus, by setting (5.23) to zero we may calculate the optimal step-size ro as 1 tT (p − Rw) + (pT − wT R)t 2 tT Rt tT (p − Rw) = . (5.24) tT Rt From the last equation we may recognize that the second term closely resembles the gradient vector ∇J(w) which is defined as ro =

∇J(w) = −2p + 2Rw.

(5.25)

Using (5.25) we may rewrite (5.24) as 1 tT ∇J(w) . (5.26) 2 tT Rt Note, that the gradient vector is computed using am instantaneous value of the tap weight vector w(n). Finally, by substituting (5.26) back to (5.17), we obtain the OSS tap-weight update equation 1 tT ∇J(w)t w(n + 1) = w(n) − . (5.27) 2 tT Rt The OSS strategy does not require any a priori knowledge of the characteristics of the input data. Moreover, it is not controlled by any parameters that would have to be set before adaptation. The convergence rate, however, is still subject to the character of the input signal, especially its eigenvalue spread. This will be illustrated by the results of experimental analyses given further in the thesis. ro = −

5.2.3

Exponential averaging of the correlation matrix and the cross-correlation vector

The OSS weight adjustment process (5.27) requires the knowledge of the correlation matrix R and the cross-correlation vector p. Since instantaneous values of these quantities are stochastic with high variance, we propose to apply exponential averaging to improve the estimates over time R(n + 1) = λR(n) + (1 − λ)x(n)xT (n) p(n + 1) = λp(n) + (1 − λ)x(n)d(n),

(5.28)

where λ is a so-called forgetting factor. It determines the amount of memory the averaging filter has and how many past samples it uses. It is chosen in the range 0 < λ < 1 and the closer it is to one the less past samples it uses. The introduction of exponential averaging requires the input signal to be stationary and ergodic. We known that speech signal is stationary only in a short interval not longer than few milliseconds. This corresponds to setting the forgetting factor to 1 λ