Semantic Scholar

2 downloads 0 Views 7MB Size Report
Sep 30, 2003 - and Variational Bayes HMM toolbox (Section 2.4.4). ...... paper [Donoho∗, 1991], Donoho proposed estimation of threshold based on Stein's.
Electrocardiogram Signal Processing using Hidden Markov Models Ph.D. Thesis Daniel Nov´ak 30th of September, 2003 Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Technick´a 2, 166 27, Praha 6 Czech Republic

Abstract The processing of electrocardiogram signals (ECG) using Hidden Markov Models (HMM) methodology is presented. The proposed framework enables complex ECG signal processing with all the necessary steps resolved by HMM approach. Firstly, general description and comprehensive survey of actual HMM state of art is carried out. Secondly, the ECG processing methods as noise removal, characteristic points detection and baseline wandering are proposed. Next, further methods as ECG modelling, classification and clustering are carried out. The methods are applied both to artificial data and on MIT/BIH Arrhythmia ECG database. The results are particularly efficient promising that this new approach can be useful both for modelling, denoising, detection of important patterns, classification and clustering of biological signals. Finally in conclusions the up-to-now realized work is summarized and several future directions are suggested for further work. However, perhaps the main value of the thesis is its catholic presentation of Hidden Markov Models theory and its applications in the field of biological signal processing. Keywords: Hidden Markov Model, Electrocardiogram, Denoising, Baseline Removal, Characteristic Points Detection, Classification, Clustering

Acknowledgements I would like to express my deep gratitude to my supervisor Vladim´ır Eck who contributed with many valuable remarks and comment observations in the area of biomedical engineering and who also helped to correct and improve the thesis. I am deeply indebted to David Cuesta, Technical University of Valencia, for his contribution, for his many constructive inputs, providing invaluable help in the field of signal processing. I thank Ludˇek M¨ uller, Pilsner Technical University, for his carefully proof-reading of the proposal and for his valuable recommendations and helpful suggestions. I would like to acknowledge Vladim´ır Maˇr´ık and Lenka Lhotsk´a, Czech Technical University in Prague, for their help with the thesis revision. I am particulary grateful to Pau Mic´o, Technical University of Valencia, for discussions which in one way or another have influenced parts of the thesis. Thanks very much to Alˇca, for her continuously support and patience. Last but no means least, thanks to Ruben for his MTB activities, Carlos for his good humor and Luis for his computer skill.

Symbols and abbreviations Important symbols F M N L K T π κi A = aij ηij B = bj (O) djm µj %j , U j Cj ζj , Vj σjm λ = {A, B, π} Λ = {λ1 , λ2 , . . . , λK } Ot O = O1 , O2 , . . . , OT O = {O1 , . . . , OL } S = (S1 , S2 , . . . , SN ) qt Q = q1 , q 2 , . . . , q T α = P (O|λ) ωj L(Λ|O) D G P P(k) PK (O) W

number of features number of Gaussian mixtures number of hidden states number of observation sequences number of clusters (HMM models) length of observation sequence (number of feature vectors) initial state distribution hyper parameter of Dirichlet cdf corresponding to πij state transition probability distribution hyper parameter of Dirichlet cdf corresponding to aij emission probability density function in state j mixture coefficient from the mth mixture in state j mean vector in state j hyper parameters of Normal cdf corresponding to µj covariance matrix in state j hyper parameters of Wishart cdf corresponding to Cj variance vector for the mth mixture component in state j HMM model set of K HMM models – partition observation (feature vector) in time t observation sequence data set individual state set state in time t state sequence - development of Markov chain in time sequence-to-likelihood measure. Probability of the observation sequence, given the model (forward component of Forward-Backward procedure) weight of jth mixture in finite mixture model likelihood of partition Λ DTW dissimilarity measure DTW dynamic programming matrix alignment path penalizing function finite mixture model with K components hierarchical clustering matrix dissimilarity measure

i

ii

Abbreviations AI AIC ADG BCA BCI BIC BSM Classical cdf CHMM DHMM DP DTW ECG ED EEG EMG EM FHV FIR FSA FT GA GMM HMM HMT iid KL LDS MAP MDL MIT/BIH ML MLP MML MRF NA NN pdf PC PCA pdf RMS SD SNR SRG VAR WT

Artificial Intelligence Akaike’s Information Criteria Acyclic Directed Graphs Bayes Clustering Approach Brain Computer Interface Bayesian Information Criterion Bayes Selection Method ”Classical” Hidden Markov Model learning continuous density function Continuous Hidden Markov Model Discrete Hidden Markov Model Dynamic Programming Dynamic Time Warping Electrocardiogram signal Evidence Density Electroencephalogram Electromyograph Expectation-Maximization algorithm Fuzzy Hyper Volume Finite Impulse Filter Finite State Automata Fourier Transform Genetic Algorithm Gaussian Mixture Model Hidden Markov Model Hidden Markov Tree independent and identically distributed (density function) Kullback-Leibler divergence (measure, information, criteria) Linear Dynamical System Maximum A Posteriori learning Minimum Description Length Massachusetts Institute of Technology/Beth Israel Hospital Maximum-Likelihood approach Multilayer Percepton Minimal Message Length Markov Random Field Not Available Neural Network probability density function Partition Coefficient Principal Component Analysis probability density function Root Mean Squared error Standard Deviation Signal to Noise Ration Stochastic Regular Grammar VARiational Bayes learning Wavelet Transform

Contents 1 Introduction 1.1 Goals of the thesis . . . . . . 1.2 Overview of the thesis . . . . 1.3 A note on software . . . . . . 1.4 A note on bibliography . . . . 1.5 Declaration of previous work

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Theory of Hidden Markov Models 2.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . 2.2 Markov chain . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Hidden Markov Model . . . . . . . . . . . . . . . . . . . 2.4 Learning techniques . . . . . . . . . . . . . . . . . . . . 2.4.1 Baum-Welch (EM) algorithm . . . . . . . . . . . 2.4.2 Gradient search-on line method . . . . . . . . . . 2.4.3 Segmental k-means algorithm (Viterbi learning) . 2.4.4 Bayes adaptation . . . . . . . . . . . . . . . . . . 2.4.5 Genetic algorithm optimization . . . . . . . . . . 2.4.6 Simulated annealing . . . . . . . . . . . . . . . . 2.5 Analogies . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Graphical probabilistic models . . . . . . . . . . 2.5.2 Stochastic grammars . . . . . . . . . . . . . . . . 2.5.3 Linear dynamical systems . . . . . . . . . . . . . 2.5.4 Boltzmann networks . . . . . . . . . . . . . . . . 2.5.5 Markov random fields . . . . . . . . . . . . . . . 2.6 HMM enhancement . . . . . . . . . . . . . . . . . . . . . 2.6.1 Factorial HMM . . . . . . . . . . . . . . . . . . . 2.6.2 Input-Output HMM . . . . . . . . . . . . . . . . 2.6.3 Hidden Markov decision trees . . . . . . . . . . . 2.6.4 Fuzzy HMM . . . . . . . . . . . . . . . . . . . . 2.6.5 HMM and neural nets . . . . . . . . . . . . . . . 2.6.6 Auto-regressive HMMs . . . . . . . . . . . . . . . 2.6.7 Buried Markov Models . . . . . . . . . . . . . . . 2.6.8 Mixed-memory Markov models . . . . . . . . . . 2.6.9 Coupled HMMs . . . . . . . . . . . . . . . . . . . 2.6.10 Hierarchical HMMs . . . . . . . . . . . . . . . . . 2.6.11 Variable-duration (semi-Markov) HMMs . . . . . 2.6.12 Segment models . . . . . . . . . . . . . . . . . . 2.6.13 Markovian objects with acyclic structure . . . . . 2.6.14 Another enhancement . . . . . . . . . . . . . . . 2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Which method? . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Evaluation criteria . . . . . . . . . . . . . . . . . iii

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

1 1 2 2 3 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 5 6 8 8 10 10 11 16 16 17 17 18 19 19 20 20 21 22 23 23 24 25 25 25 27 27 27 28 28 29 30 31 32

CONTENTS 2.8.2 2.8.3 2.8.4 2.8.5

iv Initialization methods Data description . . . Results . . . . . . . . Conclusion . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

33 34 34 39

3 Signal Preprocessing 3.1 Wavelet transform . . . . . . . . . . . . . . . . 3.1.1 Wavelet theory . . . . . . . . . . . . . . 3.1.2 Wavelet-domain Hidden Markov Models 3.2 ECG characteristics points detection . . . . . . 3.3 Baseline wandering removal . . . . . . . . . . . 3.4 Noise Filtering . . . . . . . . . . . . . . . . . . 3.4.1 Basic concepts . . . . . . . . . . . . . . 3.4.2 Which threshold method? . . . . . . . . 3.4.3 Which wavelet? . . . . . . . . . . . . . . 3.4.4 Wavelet-domain HMMs denoising . . . . 3.4.5 Detection of noise . . . . . . . . . . . . 3.4.6 Experimental set-up . . . . . . . . . . . 3.4.7 Results . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

40 40 40 42 44 46 50 51 52 54 55 57 57 58

4 Modelling concept 4.1 Generation of the ECG signal 4.2 HMM ergodic model . . . . . 4.3 HMM full model . . . . . . . 4.3.1 Trace segmentation . . 4.3.2 Full model: results . . 4.4 Conclusion . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

60 61 61 63 63 65 68

5 Classification concept 5.1 Method . . . . . . . . . . 5.2 Results . . . . . . . . . . . 5.2.1 Synthetic data set 5.2.2 Electrocardiogram 5.3 Conclusion . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

69 69 69 70 72 74

6 Clustering concept 6.1 Clustering methodology . . . . . . 6.1.1 Feature selection . . . . . . 6.1.2 Proximity measure . . . . . 6.1.3 Criterion function . . . . . 6.1.4 Search control structure . . 6.2 Time series clustering . . . . . . . 6.2.1 Proximity based clustering 6.2.2 Feature based clustering . . 6.2.3 Model based clustering . . . 6.3 Method . . . . . . . . . . . . . . . 6.3.1 Number of clusters selection 6.3.2 Dissimilarity measure . . . 6.3.3 Bayes clustering approach . 6.3.4 Hierarchical approach . . . 6.4 Results . . . . . . . . . . . . . . . . 6.4.1 HMM Specification . . . . . 6.4.2 Pre-clustering phase . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

75 75 76 76 76 77 78 79 79 81 81 82 89 92 94 95 95 95

. . . . .

. . . . .

CONTENTS

6.5

v

6.4.3 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.4 Electrocardiogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7 Conclusions and Outlook 7.1 Mapping achievements to goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Contribution of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104 105 106 107

Bibliography

108

A Implementation B Update formulas B.1 Classical HMM . . . . . . . . . B.2 Maximum a posteriori HMM . B.3 Variational HMM . . . . . . . . B.4 Numerical stability and errors . B.4.1 Scaling . . . . . . . . . B.4.2 Viterbi algorithm . . . . B.4.3 Covariance singularities

i

. . . . . . .

iii iii v vi vii vii vii viii

C Distributions C.1 Dirichlet distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Wishart distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix ix x

Index

xi

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19

Two-state Markov chain . . . . . . . . . . . An example of three state discrete HMM . . HMM in Bayesian framework . . . . . . . . An ADG for a first-order HMM . . . . . . . Boltzmann chain . . . . . . . . . . . . . . . Markov Random Field . . . . . . . . . . . . Difference between HMM and FHMM . . . Input-Output HMM Architecture . . . . . . Hidden Markov decision tree . . . . . . . . Simple Hybrid HMM/NN architecture . . . An auto-regressive model . . . . . . . . . . Buried Markov Models . . . . . . . . . . . . Mixed-memory Markov models . . . . . . . Coupled HMM . . . . . . . . . . . . . . . . Variable-duration HMM . . . . . . . . . . . Segment HMM . . . . . . . . . . . . . . . . Likelihood development . . . . . . . . . . . Development of mean and variance cdf . . . Development of two first Dirichlet cdf states

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

6 7 13 17 19 21 21 22 23 24 25 26 26 27 28 29 36 37 38

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18

Example of continuous wavelet analysis . . . . . . . . . Wavelet tiling of the scale-time plane . . . . . . . . . . . Statistical dependencies of wavelet transform . . . . . . Probabilistic model for an individual wavelet coefficient ECG points detection . . . . . . . . . . . . . . . . . . . Detection of significant ECG points . . . . . . . . . . . . Baseline over-fitting . . . . . . . . . . . . . . . . . . . . Suitable baseline fitting . . . . . . . . . . . . . . . . . . Detection of the approximation level . . . . . . . . . . . Another example of level determination . . . . . . . . . Frequency bands for the wavelet decomposition . . . . . Basic denoising concept . . . . . . . . . . . . . . . . . . Soft and hard tresholding . . . . . . . . . . . . . . . . . Scheme of adapted denoising . . . . . . . . . . . . . . . Wavelet coefficient behaviour . . . . . . . . . . . . . . . Bayesian wavelet-based denoising . . . . . . . . . . . . . The detection of noise level . . . . . . . . . . . . . . . . Wavelet denoising results . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

41 42 43 43 46 47 48 48 49 49 50 51 51 55 56 56 58 59

4.1 4.2 4.3

Representative cycle of ECG signal . . . . . . . . . . . . . . . . . . . . . . . . . Ergodic HMM model of ECG cycle . . . . . . . . . . . . . . . . . . . . . . . . . Viterbi inference of an artificial ECG . . . . . . . . . . . . . . . . . . . . . . . .

60 62 62

vi

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

LIST OF FIGURES

vii

4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12

An artificial ECG generated by HMM-12 states . . . . . . . Trace segmentation process . . . . . . . . . . . . . . . . . . Trace segmentation algorithm demonstration . . . . . . . . HMM model of ECG cycle, N = 15 states . . . . . . . . . . Viterbi inference of the ECG generated by HMM-15 states . HMM model of ECG cycle-25 states . . . . . . . . . . . . . Viterbi inference of the ECG generated by HMM-25 states . Likelihood throughout training . . . . . . . . . . . . . . . . An artificial ECG generated by HMM, 25 states . . . . . .

. . . . . . . . .

63 64 64 65 66 66 67 67 68

5.1 5.2

Scheme of ECG classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . ECG classes used for the classification . . . . . . . . . . . . . . . . . . . . . . .

70 73

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11

Example of polygonal approximation . . . . Ambulatory ECG cluster selection results . Holter ECG cluster selection results . . . . Synthetic data set for cluster selection . . . DTW example . . . . . . . . . . . . . . . . DTW Matrix characterization . . . . . . . . Symmetrized log-likelihood distance matrix S measure dendrogram . . . . . . . . . . . . KL measure dendrogram . . . . . . . . . . . BP measure dendrogram . . . . . . . . . . . DTW matrix . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . .

. 80 . 83 . 86 . 87 . 90 . 91 . 96 . 97 . 98 . 99 . 102

A.1 Holter ECG application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

List of Tables 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Simulated annealing . . . . . . . . . . . . . . . . . . . . Methods comparison-Data Set I: Percentage difference . Methods comparison-Data Set II: Percentage difference . Methods comparison-Data Set I: Likelihood . . . . . . . Methods comparison-Data Set II: Likelihood . . . . . . Methods comparison-Data Set I: Distance measure . . . Methods comparison-Data Set II: Distance measure . . . Method comparison-Data Set I-II: Iteration . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

17 34 35 35 35 35 36 36

3.1 3.2 3.3 3.4

Equivalent FIR filters coefficients . . . . . . . . . Comparison of different denoising techniques . . Control parameters of adapted wavelet denoising Comparison of different denoising techniques . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

45 54 55 58

5.1 5.2 5.3 5.4 5.5

Data Set I classification results . Data Set II classification results . ECG classification results, N=10 ECG classification results, N=15 ECG classification results, N=25

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

71 71 73 74 74

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

Cluster selection synthetic data results . . . Partition initialization . . . . . . . . . . . . HMM learning procedure . . . . . . . . . . Synthetic Data Set I-II clustering results . . Electrocardiogram clustering results, N=10 Electrocardiogram clustering results, N=15 Electrocardiogram clustering results, N=25 Best clustering results . . . . . . . . . . . . Clustering results comparison . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

87 93 93 97 99 100 100 100 101

. . . . .

. . . . .

. . . . .

viii

. . . . .

. . . . .

Chapter 1

Introduction Don’t be sad, don’t be angry, if life deceives you! Submit to your grief – your time for joy will come, believe me. Eugene Onegin, Alekandr Sergeyevick Pushkin (1799-1837) Computer-based methods for analysis and interpretation of biological signals have been subject of intense research. Applications that perform signal processing and diagnostic interpretation of signals are widely spread. One example is a visual analysis of long term electrocardiogram (ECG) signal (sometimes called Holter signal) or analysis of sleep electroencephalogram (EEG) that is tedious time-consuming and operator dependent. It is obvious that automated systems for biological signals processing such as noise removal, patterns detection, arrhythmias classification and clustering considerably reduces the amount of time the medicine needs to spend. One solution to the above problem offers a stochastic theory of Hidden Markov Models (HMM). Neither the theory of HMM nor its applications to biological field is new. The first mention about a Markov probabilistic scheme is due to Markov [Markov, 1907] at the beginning of the twenty century. Markov developed what later become called the Markov framework in order to analyze the text of his fellow Russian Pushkin’s masterpiece Eugene Onegin. HMMs were introduced by Baum and collaborators [Baum and Petrie, 1966] and have had their greatest applications in speech recognition and, to a lesser extent, statistical language learning [Jelinek, 1999], as well as in sequence identification (such as DNA [Baldi and Brunak, 2000], [Krogh and et al., 1994]). However, widespread understanding and application of the theory of HMM to medical sciences, especially to biological signal processing, is very new and only few attempts to use HMM theory by the biological signal processing community have appeared just within the past several years.

1.1

Goals of the thesis

The main objective of the thesis is the complex processing of electrocardiogram signals using Hidden Markov Models. The proposed thesis framework sets the following research goals: 1. State of the Art. A comprehensive review of the state of the art of HMM methodology will be carried out. 2. Electrocardiogram Preprocessing. This goal consists of the three most important ECG preprocessing steps: denoising, characteristic points detection and baseline wandering. At least in one of those steps the theory of HMM would be applied. 3. Electrocardiogram Modelling. Illustrate the usefulness of HMM in context of ECG modelling.

1

1.2. OVERVIEW OF THE THESIS

2

4. Electrocardiogram Classification. Show that the HMM framework is enable to solve the classification task. 5. Electrocardiogram Clustering. Demonstrate that HMM can be also applied on unsupervised learning.

1.2

Overview of the thesis

The rest of the thesis is organized as follows. Chapter 2 reviews state of the art and the application scope of Hidden Markov Model theory. We will discuss the standard HMM, its enhancement and analogies mainly across the area of machine learning. Different approaches about how to optimize HMM are described as well. Finally, a comparative study of different learning techniques and initialization methods is carried out. Before any sophisticated analysis is developed, it is worth considering signal preprocessing. This stage can considerably contribute to results improvement of further steps as classification or clustering. In Chapter 3 the most important stages in ECG preprocessing are addressed: baseline removal, ECG significant complex detection and denoising. All stages take advantage of the wavelet transform approach. Moreover, in case of denoising, a technique combining wavelet transform and HMM is described. Classification is obviously less demanding task than unsupervised learning. In Chapter 5 the main concept is explained along with the experiments evaluation both on synthetic data set and real-life ECG signals. Major concern of biological signal analysis is to reveal the organization of patterns into sensible clusters (groups), which will allow us to discover similarities and differences among patterns and derive useful conclusions about them. This idea is met in many fields, such as life sciences (biology, zoology), social sciences (sociology, archeology), earth sciences, engineering and medical science. The last field is that of our interest, we will focus mainly on the clustering of important events in biological signals such as ECG signal. In Chapter 6, the clustering algorithms and their implementation will be described. Moreover, this chapter also addresses the difficult task of determining the number of underlying clusters in partition. One of the most important steps in the clustering process is the results evaluation. How can we trust that the clustering method has detected the number of clusters correctly and grouped similar events? One possible solution to this questions would be to use an approach that enables in addition to unsupervised learning also modelling capabilities. In that case each cluster would represent a model that makes the results interpretation much easier. We will cover this problem in Chapter 4 where the problem of ECG modelling is analyzed. Finally Chapter 7 includes discussions, concluding remarks and future directions.

1.3

A note on software

Many of the algorithms and examples in this thesis have been implemented using my Biological Processing Toolbox for Matlab (BPT). This is open-source and is freely available from hpk.felk.cvut.cz/~xnovakd1/PhDthesis/bpt.html. Nevertheless, many of the algorithms used throughout the thesis are based on the Matlab codes of different authors. In this section I would like to give them many thanks. Especially I would like to thank Jan Macek (Czech Technical University in Prague) for providing the software support for characteristic ECG points detection (Section 3.2). Many thanks to Robert Nowak (Rice University) for letting me use the Wavelet HMM denoising toolbox (Section 3.4.4). Also I am pleased to give my acknowledgment to Stephen Roberts and Iead Rezek (Oxford University) for giving me their approval to modify their Maximum a Posterior (Section 2.4.4) and Variational Bayes HMM toolbox (Section 2.4.4).

1.4. A NOTE ON BIBLIOGRAPHY

1.4

3

A note on bibliography

We have tried to include all basic and necessary theoretic material in the thesis to save the reader from looking for further bibliography. Nonetheless, we realize that in many cases is crucial to explore some topics and go into details in the given references. Since many of the references are available on free basis and can be found more or less easily on the Internet we decided to provide these references in pdf format on an enclosed CD. Such references are marked by asterisk, for example [Rabiner∗ , 1989] which will afterwards appear in Bibliography Section as: [Rabiner∗ , 1989] Rabiner∗ , R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77.

1.5

Declaration of previous work

This thesis is based on the following previously published material: • Novak, D., Cuesta-Frau, D., Tormos, P. M., and Lhotska, L. (2003). Number of arrhythmia beats determination in holter electrocardiogram: How many clusters? In 25th Annual International Conference of the IEEE Engineering in Medicine And Biology Society, September,17-21.9, 2003, Cancun, Mexico. • D. Cuesta, P. M., Aboy, M., Novak, D., Brezny, R., Samblas, L., Pastor, D., and Sancho, S. (2003). Biosignal laboratory: A software tool for biomedical signal processing and analysis. In 25th Annual International Conference of the IEEE Engineering in Medicine And Biology Society, September,17-21.9, 2003, Cancun, Mexico. • Lhotska, L., Fejtova, M., Macek, J., and Novak, D. (2003). Biological data preprocessing: A case study. In Intelligent and Adaptive Systems in Medicine, EUNITE Workshop 2003, Prague. • Cuesta, D. and Novak, D. (2002). Automatic extraction of significant beats from a holter register. In The 16th international EURASIP conference BIOSIGNAL 2002, pages 3–5, Brno, Czech Republic. • Cuesta, D., Novak, D., Perez, J., and Andreu, G. (2002). Feature extraction methods applied to the clustering of electrocardiographic signals. a comparative study. In International Conference on Pattern Recognition, ICPR-2002, Quebec, Canada. • Novak, D. (2002). Wavelet image denoising-a comparative study. In Technical Report, Department of Cybernetics, Czech Technical University in Prague. • Novak, D., Cuesta, D., Perez, J., and Andreu, G. (2000b). Denoising electrocardiogram signal using adaptive wavelets. In The 15th international EURASIP conference BIOSIGNAL, Brno, Czech Republic. • Cuesta, D., Novak, D., Perez, J., and Andreu, G. (2000a). Baseline wandering removal using wavelets. In The 15th international EURASIP conference BIOSIGNAL, Brno, Czech Republic.

Chapter 2

Theory of Hidden Markov Models In this chapter we will try to cover the basic theory of HMM and its related issues. First we will have a look at the underlying HMM theory-Section 2.2 and 2.3. We will discuss in more details the important problem of HMM learning in Section 2.4 followed by Section 2.5 where we deal with finding throughout literature relationship between other architectures such as stochastic grammars or Boltzmann machine. The graphical nature probabilistic representation of HMMs call for further HMMs’ structure development as we discuss in Section 2.6. Moreover, some bibliographical remarks are presented in Section 2.7. Finally, some practical issues are covered in Section 2.8, where comparison study of different learning and initialization methods is carried out. But firstly we will describe very briefly the theory behind Gaussian Mixture Models (GMM) which will also appear throughout the thesis. For example, in Section 6.3.1 the GMM are the corner stone for development of a method that enables correct determination of the number of clusters in partition.

2.1

Gaussian Mixture Models

Mixture models, in particular mixtures of Gaussians, have been a popular tool for density estimation, clustering, and unsupervised learning with a wide range of applications in statistics, machine learning, pattern recognition ([Moerland∗ , 2000]) and data mining (see for instance ([Duda et al., 2001], [Bishop, 1995], [McLachlan and Peel, 2000], and the references therein). Mixture models are one of the most useful tools for handling incomplete data, in particular hidden variables. For Gaussian mixtures the hidden variables indicate for each data point the index of the Gaussian that generated it. Thus, the model is specified by a joint density between the observed and hidden variables. Let X = [X1 , . . . , Xd ] be a d-dimensional random variable, with x = [x1 , . . . , xd ] representing one particular outcome of X . It is said that X follows a K-component finite mixture distribution if its probability density function can be written as

x|Θ) = p(x

K X

x|Θk ) p(k)p(x

(2.1)

k=1

where p(1), . . . , p(K) are the mixing probabilities, each Θk is the set of parameters defining the kth component and Θ = {Θ1 , . . . , Θk , p(1), . . . , p(K)} is the complete set of parameters needed to specify the mixture. In this paper, we assume that all the components have d-variate Gaussian distributions (2.2), with each one characterized by Θk = {µk , σk }. Throughout the paper we assume that all data/the feature vectors x n are independent on each other. Therefore the covariance matrix degenerates into a variance vector σ.

4

2.2. MARKOV CHAIN

5

xn |Θk ) = p(x

d Y



i=1

(xn − µk,i )2 1 exp − i 2 2σk,i 2πσk,i

(2.2)

x1 , . . . , x N }, Given a set of N independent and identically distributed (iid) samples X = {x the log-likelihood corresponding to a k-component mixture is

P (X |Θ) = L(Θ|X ) = log

N Y

xi |Θ) = p(x

i=1

K X

xi |Θk ) p(k)p(x

(2.3)

k=1

The standard method used to fit finite mixture models to observed data is the expectationmaximization (EM) algorithm, which converges to a maximum likelihood (ML) estimate of the mixture parameters [McLachlan and Krishnan, 1997]. The algorithm will be described in Section 2.4.1. The updates for a mixture of Gaussians can be accomplished by iterative application of the following equations for all components k ∈ {1, . . . , K}

p(k|xi ) =

p(k)p(xi |Θk ) P (X |Θ)

(2.4)

p(k) =

N 1 X p(k|xi ) N

(2.5)

i=1

µk =

N X p(k|xi )xi

σk =

N X p(k|xi )(xi − µi )2 i=1

2.2

(2.6)

N p(k)

i=1

(2.7)

N p(k)

Markov chain

Markov chain deals with a class of random processes that incorporate a minimum amount of memory. Let S1 , S2 , . . . , SN be a sequence of random variables taking their values in the same finite alphabet X = {1, 2, . . . , c}. If nothing more is said, then Bayes’ formula applies P (S1 , S2 , . . . , SN ) =

N Y

P (Si |S1 , S2 , . . . , Si−1 )

(2.8)

i=1

The random variables are said to form a Markov chain [Jelinek, 1999], if P (Si |S1 , S2 , . . . , Si−1 ) = P (Si |Si−1 )

for all values of i

(2.9)

As a consequences, for Markov chains P (S1 , S2 , . . . , SN ) =

N Y

P (Si |Si−1 )

(2.10)

i=1

Such a random process thus has the simplest memory, the value at time t depends only on the value at the preceding time and on nothing that went before. The Markov chain is time invariant or homogeneous if regardless of the value of the time index i 0

0

P (Si = s |Si−1 = s) = p(s |s)

0

for all s, s ∈ X

(2.11)

2.3. HIDDEN MARKOV MODEL

6

0

0

p(s |s) is called the transition function and it can be presented as a c × c matrix. p(s |s) must satisfy the usual conditions for all s ∈ X 0

X

p(s |s) = 1,

0

0

p(s |s) ≤ 1,

s ∈ X.

(2.12)

s0 ∈X

P(St=s|St-1=s’) P(St=s|St-1=s)

P(St=s’|St-1=s’)

P(S0=s)

S

S’ P(S0=s’) P(St=s’|St-1=s)

Figure 2.1: Two-state Markov chain.

We can think of the value of Si as states and thus of the Markov chain as finite state process 0 called a discrete Markov process with transition between states specified by the function p(s |s). We can see one example of Markov chain in Figure 2.1, where state space is of size c = 2.

2.3

Hidden Markov Model

AN HMM is a stochastic Finite State Automata. The first order HMM is used, where the Markov property that the state of a system at a particular time t is only dependent on the state of the system at the immediate previous time point (2.9), is fulfilled. An HMM is characterized by the following [Rabiner∗ , 1989],[Psutka, 1995]: 1. N , the number of states in the model. Although the states are hidden, for many practical applications there is often some physical significance. We denote the individual states as S = (S1 , S2 , . . . , SN ), and the state at time t as qt . 2. M , the number of distinct observations symbols per state or the number of mixtures in Gaussian pdf. 3. The state transition probability distribution A = {aij }, of size N × N , defines the probability of transition from state i at time t, to state j at time t + 1. aij = P (qt+1 = Sj |qt = Si ),

1 ≤ i, j ≤ N.

(2.13)

4. The initial state distribution π = {πi }, defines the probability of any given state being the initial state of the given sequences, where πi = P (q1 = Si ),

1 ≤ i ≤ N.

(2.14)

5. emission probability which we can further divide into two categories depending whether the observation sequence is discrete or continuous:

2.3. HIDDEN MARKOV MODEL

7

- discrete emission probability B = {bj (k)} with M distinct observation symbols per state V = {v1 , v2 , . . . , vM }, i.e., the discrete alphabet size. bj is observation symbol probability in state j 1 ≤ j ≤ N, 1 ≤ k ≤ M

bj (k) = P (vk at t|qt = Sj )

(2.15)

The observation symbols correspond to the physical output of the system being modelled. We will call these models discrete HMM (DHMM). One example of HMM model with discrete emission probability is shown in Figure 2.2. - continuous emission probability B = {bj (Ot )}, where O = O1 , O2 , . . . , OT , the emission probability density function of each state is defined by a finite multivariate Gaussian mixture (2.16):

bj (Ot ) =

M X

djm N (Ot , µjm , Cjm ),

1≤j≤N

(2.16)

m=1

where Ot is the feature vector of the sequence being modelled, djm is the mixture coefficient from the mth mixture in state j and N is a Gaussian probability with mean vector µjm and covariance matrix Cjm for the mth mixture component in state j. We will refer to these models as a continuous HMM (CHMM).

a22

v3 v1 v2 b b b22 23

a21

a11

v1

b12 b13 v2

v3

b24

21

S2

a12

a31

S1 b11

v4

b14

a13

v4

a32 a23

b31 v1

a33

S3 b32 b33 v2

v3

b34 v4

Figure 2.2: An example of three state discrete HMM with four discrete symbols per state.

A complete specification of an HMM requires specification of two model parameters (N and M ), specification of the three probability measures A, b, π. We will refer to all HMM parameters using the set λ = {A, B, π}

(2.17)

An important aspect of the probabilistic description is that the stochastic constraints of the HMM parameters must be satisfied, namely N X i=1

πi = 1

(2.18)

2.4. LEARNING TECHNIQUES

8

N X

aij = 1,

1≤i≤N

(2.19)

j=1 M X

bj (k) = 1,

1≤i≤M

(2.20)

k=1

Given the form of the HMM, there are three basic problems of interest that must be solved: Problem 1: Given the observation sequence O = O1 O2 , . . . , OT and a model λ, how do we compute P (O|λ), the probability of the observation sequence, given the model-see [Rabiner∗ , 1989]. Problem 2: Given the observation sequence O = O1 O2 , . . . , OT and a model λ, how do we choose a corresponding state sequence Q = q1 q2 , . . . , qT which best ”explains” the observations? This problem is solved the so-called Viterbi algorithm [Viterbi, 1967]. In machine learning terminology this problem is called inference. Problem 3: How do we adjust the model parameters λ = {A, b, π} to maximize P (O|λ)? It can be also called a learning procedure. The solution of the above mentioned problems is included in Appendix B.1. Given this HMM structure a natural approach is to try to learn the parameters of the model using standard HMM estimation technique, to maximize the likelihood [Rabiner∗ , 1989]. We will discuss the learning process in the next Section 2.4.

2.4

Learning techniques

We will present several different methods for HMM training which can be divided into two groups: (i) hill-climbing algorithms (EM, k-means and gradient search) and (ii) global searching algorithm (genetic approach and simulated annealing). The first group depends quite strongly on the initial estimate of the model parameters. Any arbitrary estimate of the initial model parameters will usually lead to a sub-optimal model in practice. Although this problem can be avoided by the segmental k-means segmentation [Juang and Rabiner, 1990], this procedure is computationally intensive. The second group are able to escape from the initially guess and find the optimal solutions due to the global searching capability.

2.4.1

Baum-Welch (EM) algorithm

The Expectation-Maximization (EM) algorithm is a general method of finding the maximumlikelihood (ML) estimate of the parameters of an underlying distribution from a given data set when the data is incomplete or has missing values. On each iteration of the EM algorithm there are two steps-called the expectation step or the E-step and the maximization step or the M-step. This name was given by Dempster [Dempster et al., 1977] in his fundamental paper. There are two main applications of the EM algorithm. The first one occurs when the data indeed has missing values, due to problems or limitations of the observation process. The second one occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. The latter application is more common in the computational pattern recognition community [Bilmes∗ , 1998]. At this point we will recall the definition of the maximum-likelihood (ML) estimation problem. We have a density function p(x|Θ) that is governed by the set of parameters Θ (e.g., p might be a set of Gaussians and Θ could be the means and covariances). We also have a

2.4. LEARNING TECHNIQUES

9

data set of size N , supposedly drawn from this distribution, i.e., X = {x1 , x2 , . . . xN }. That is, we assume that these data vectors are independent and identically distributed (iid) with distribution p. Therefore, the resulting density for the samples is p(X |Θ) =

N Y

p(xi |Θ) = L(Θ|X )

(2.21)

i=1

This function L(Θ|X ) is called the likelihood of the parameters given the data, or just the likelihood function. The likelihood is thought of as a function of the parameters Θ where the data X is fixed. In the maximum likelihood problem, our goal is to find the Θ that maximizes L. That is, we wish to find Θ∗ where Θ∗ = argmax L(Θ|X )

(2.22)

Θ

Often we maximize log L(Θ|X ) instead because it is analytically easier. Now we will describe very briefly two basic steps of EM methods according to [McLachlan and Krishnan, 1997]. As before, we assume that data X is observed and is generated by some distribution. We call X the incomplete data. We assume that a complete data set exists Z = (X , Y) and also assume a joint density function p(z|Θ) = p(x, y|Θ) = p(y|x, Θ)p(x|Θ). With this new density function, we can define a new likelihood function, L(Θ|Z) = L(Θ|X , Y) = p(X , Y|Θ), called the complete-data likelihood. Note that this function is in fact a random variable since the missing information Y is unknown, random, and presumably governed by an underlying distribution. The original likelihood L(Θ|X ) is referred to as the incomplete-data likelihood function. The EM algorithm first finds the expected value of the complete-data log-likelihood log p(X , Y|Θ) with respect to the unknown data Y given the e observed data X and the current parameter estimates. That is, we define the following Q function e Q(Θ, Θ(i−1) ) = E[log p(X , Y|Θ)|X , Θ(i−1) ]

(2.23)

The evaluation of this expectation is called the E-step of the algorithm. Notice the meaning e of the two arguments in the function Q(Θ, Θ(i−1) ). The first argument Θ corresponds to the parameters that ultimately will be optimized in an attempt to maximize the likelihood. The second argument Θ(i−1) corresponds to the parameters that we use to evaluate the expectation. The second step (the M-step) of the EM algorithm is to maximize the expectation we computed in the first step. That is, we find e Θ(i) = argmax Q(Θ, Θ(i−1) )

(2.24)

Θ

These two steps are repeated as necessary. Each iteration is guaranteed to increase the loglikelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function. e A modified form of the M-step is to, instead of maximizing Q(Θ, Θ(i−1) ), we find some Θ(i) (i) (i−1) (i−1) e e such that Q(Θ , Θ ) > Q(Θ, Θ ). This form of the algorithm is called Generalized EM (GEM) and is also guaranteed to converge. Now we can proceed to apply the introduced theory of EM algorithm to HMM case [Bilmes∗ , 1998], where the EM algorithm is usually known as the Baum-Welch algorithm [Baum and Petrie, 1966]. Baum and his collaborators formulated this algorithm long before Dempster [Dempster et al., 1977] and established convergence properties for this algorithm.We consider O = {O1 , O2 . . . OT } to be the observed data and the underlying state sequence Q = {q1 . . . qT } to be hidden or unobserved. The incomplete-data likelihood function is given

2.4. LEARNING TECHNIQUES

10

e function thereby P (O|λ) whereas the complete-data likelihood function is P (O, Q|λ). The Q fore is e λ0 ) = Q(λ,

X

0

P (q|O, λ)logP (O, q|λ )

(2.25)

q∈Q 0

where λ are our initial (or guesses, previous) estimates of the parameters and where Q is the space of all state sequences of length T . 0 Given a particular state sequence q, representing P (O, Q|λ ) is 0

P (O, Q|λ ) = πq0

T Y

aqt−1 qt bqt Ot

(2.26)

t=1

If we now substitute (2.26) into (2.25) the Q function will split into three terms, each of them can be optimized individually using the technique of Lagrange multiplies with respect to the constraints (2.18-2.20). For further details, see [Bilmes∗ , 1998] or Rabiner [Rabiner∗ , 1989], page 265. Lucke [Lucke, 1996] in his paper gave a comprehensive unifying framework for original HMM corresponding variants. He shows that an algorithm similar to Baum’s in the graph context (see Section 2.5.1) exists only if a certain graph-theoretical criterion, the chordality, is satisfied. Considering computation complexity of EM algorithms, each training sequence requires one forward and one backward propagation. Thus the EM algorithm scales like O(LT 2 ) operations, where L is the number of training observation sequences and T is the length of observation sequence.

2.4.2

Gradient search-on line method

The proposed EM algorithm works in a batch mode. It must be noted, however, that the on-line use of the EM algorithm can be problematic. This is because the EM algorithm, unlike gradient descent, does not have a learning rate. The EM algorithm can take large steps in the descent directions generated by each training example in isolation, converging towards poor local minima of Q-function (2.25). This ”jumping” effect can be avoided with gradient-descent learning, as long as the learning rate is small. The gradient-descent equations on the negative log-likelihood − log P (O|λ) can be derived instead of using the Lagrangian technique with the normalization constraints, as above, by an equivalent useful reparametrization [Baldi and Chauvin, 1994]. We get for the parameters λ the classical delta rule of gradient-descent, thus the learning can be carried out on-line. Batch equations can be calculated by summing over all training sequences. Just like EM, the gradient-descent equations require one forward and one backward propagation. Therefore O(LT 2 ) operations must be performed per training cycle. Unlike EM, on-line gradient-descent is a smooth algorithm. Another approach on how to estimate on-line HMM parameters can be found in [Stiller and Radons, 1999].

2.4.3

Segmental k-means algorithm (Viterbi learning)

In this method the parameters of the model λ = {A, B, π} are adjusted to maximize P (O, I|λ) where I is the optimum sequence as given by the Viterbi algorithm [Viterbi, 1967]. In the previous method (equation 2.25) we calculated P (O|λ) summing up P (O, I|λ) over all possible state sequences I. We assume we have a training set of observation sequences. Let ω be the number of such sequences available. The algorithm then consists of the following steps: 1. Randomly choose N observations symbols and assign each of the ωT observations vectors to one of these N vectors from which its euclidian distance is minimum. Hence we have formed N clusters each called a state (1 to N ).

2.4. LEARNING TECHNIQUES

11

2. Based on this state segmentation, updated estimates of the aij coefficients can be obtained by counting the number of transitions from state i to state j and dividing it by the number of transitions from state i to any state (including itself). 3. Calculate mixture parameters for each state cˆjm = number of vectors classified in cluster m of state j divided by the number of vectors in state j µ ˆjm = sample mean of the vectors classified in cluster m of state j ˆjm = sample covariance matrix of the vectors classified in cluster m of state j U 4. Find the optimal state sequence I ∗ by means of Viterbi learning for each training sequence ˆ i = {Aˆi , B ˆi , π using λ ˆi } 5. If any vector is reassigned a new state in Step 4, use the new assignment and repeat Step 2 through Step 5; otherwise, stop.

2.4.4

Bayes adaptation

In the classical approach the parameter λ, is thought to be unknown, but fixed, quantity (point estimate). In the Bayesian approach λ is considered to be a quantity whose variation can be described by a probability function called the prior distribution. This is a subjective distribution, based on the experimenter’s belief, and could be formulated before the data are seen. The parameter λ updating is done with the use of Bayes’ Rule, hence the name Bayes adaptation. In this section we will explain two methods based on Bayes approach. Firstly, the maximum a posteriori (MAP) estimation is described followed by variational learning estimation (VAR). MAP learning The MAP estimation framework provides a way of incorporating prior information in the training process, which is particularly useful for dealing with the problem posed by sparse training data for which the ML approach gives inaccurate estimates [Gauvain∗ and Lee, 1994]. The difference between MAP and ML estimation lies in the assumption of an appropriate prior distribution to be estimated from the observation sequence O with probability density function P (O|λ), and if P0 (λ) is the prior density function of λ then the MAP estimates is defined as λM AP = argmaxP (λ|O) ≈ argmaxP (O|λ)P0 (λ) λ

(2.27)

λ

where we used the Bayes theorem and O = {O1 , O2 , . . . , OL } is the data set at hand. If λ is assumed to be fixed but unknown, then there is no knowledge about λ, which is equivalent to assuming a non-informative prior P0 (λ) = constant. Under such assumption, equation (2.27) then reduces to the familiar ML formulation. Given the MAP formulation three key issues remain to be addressed: (i) the choice of the prior distribution family, (ii) the specification of the parameters for the prior densities and (iii) the evaluation of the maximum a posteriori. These problems are closely related, since an appropriate choice of the prior distribution can greatly simplify the MAP estimation process [Gauvain∗ and Lee, 1994]. For mathematical tractability, conjugate priors are often used in Bayesian adaptation. A conjugate prior for a random vector is defined as the prior distribution for the parameters of the probability density function of the random vector, such that the posterior distribution P (λ|O) and the prior distribution P0 (λ) belong to the same distribution family for any sample observation sequence O [Casella and Berger, 2002].

2.4. LEARNING TECHNIQUES

12

Regarding the MAP estimate, it follows that the same iterative procedure of EM algorithm described in Section 2.4.1 can be used to estimate the mode (mode is argmax) of the pos0 terior density by maximizing the auxiliary function R(λ, λ ) at each iteration instead of the 0 maximization of Q(λ, λ ) in conventional ML procedure [Dempster et al., 1977]. 0

0

R(λ, λ ) = Q(λ, λ ) + log P0 (λ)

(2.28)

As we have already mentioned, the selection of prior densities is a very important step. The EM algorithm can be applied to the MAP estimation problem if the prior density belongs to the conjugate family of the complete-data density. First consider that the mixture gains for each mixture density (2.16) have the joint distribution in the form of a multinomial distribution. Then a practical candidate to model the prior knowledge about the mixture gain parameter vector is the conjugate density such as a Dirichlet density M Y

Dir(d|ν) ∝

dνmm −1

(2.29)

m=1

where M is the number of Gaussian mixtures, d is a mixture weight and ν is a hyperparameter of Dirichlet density. In case of the initial and transitions probabilities, a Dirichlet density can be also used for the initial probability vector π and for each row of the transition probability matrix A.

Dir(π|κ) ∝

Dir(A|η) ∝

N Y

πiκi −1

i=1 N Y N Y

(2.30)

η −1

aijij

(2.31)

i=1 j=1

Finally for the mean vector and covariance matrix of Gaussian mixture the conjugate densities [Rezek∗ et al., 2002] are as follows: For the mean is F -dimensional Normal density and for the covariance F -dimensional normal-Wishart densities (for more information on Dirichlet and Wishart density, see Appendix C).

N (µ|%, U ) ∝ e

−1 (µ−%)T U (µ−%) 2

(2.32)

ζ− F 2+1 −tr(V C)

W i(C|ζ, V ) ∝ |C|

e

(2.33)

We collect the prior of Gaussian mixture (here we will denote it as g(θi )) of state i into one equation assuming independence between the parameters of the individual mixture components (2.30), (2.31) and the set of mixture weights (2.29): g(θi ) = Dir(di )

M Y

N (µm )W i(Cm )

(2.34)

m=1

Moreover if we extend the independence assumption onto initial and transition matrix we can express the prior density for all HMM parameters as PO (λ) ∝

N h Y i=1

πiκi −1 g(θi )

N Y

η −1

aijij

i

(2.35)

j=1

Note that for clear notation reasons we have dropped the state i and mixture m index in previous equations, i.e., instead of µim (the mean is state i and of mixture m), di (the weight for state i) we have used µ, d, resp. Having substituted the prior distribution (2.35) into (2.28)

2.4. LEARNING TECHNIQUES

13

we can now apply the standard EM procedure to estimate hyper-parameters of the posterior density PO (κi , ηij , νi , %im , Uim , ζim , Vim , ). It can also be shown [Gauvain∗ and Lee, 1994] the EM reestimation formulas for the MAP and ML approaches are asymptotically similar. Thus 0 as long as the initial estimates of λ are identical, the EM algorithms for MAP and ML will provide identical estimates with probability one when L → ∞ (e.g., infinity data set). The Figure 2.3 summarizes the probabilistic relationships between variables in a Hidden Markov Model. The hyper-parameters are shown in rectangles.

kt

Dirichlet

ht

Dirichlet

C

t=1..T C

St-1

St

Ot

m

CM

m

z ,V Wishart

mM

r ,U

Normal

Figure 2.3: Directed graph of an HMM with a Gaussian observation model for observation Ot and hidden state St in Bayesian framework.

The convergence of MAP estimation is measured by the actual value of the total loglikelihood (2.36) which consists of the two terms: the log-likelihood of the total data sequence set O and the log-prior density log P0 (λ). P (λ|O) = P (O|λ) + log P0 (λ)

(2.36)

The reestimation formulas can be found in Appendix B.2. Variational learning Variational methods, which have been used in Bayesian machine learning for several years ([MacKay∗ , 1997]),([Attias∗ , 2000]), provide another approach to the design of approximate inference and learning algorithms. Chapters on Variational calculus can be found in many mathematic text books, yet, their use in statistics is relevantly recent (see [Jordan∗ et al., 1997b] and [Jaakkola∗ , 2000] for excellent tutorials). First, we will describe variational framework in general case of graphical models and then we extend this framework to HMM. It is helpful to start explanation of variational approach with graphical models because as it will turn out in Section 2.5.1 the HMM could be seen as one particular example of graphical models. The problem of probabilistic inference in graphical models is the problem of computing a conditional probability distribution over the values of hidden nodes H and the value of observed nodes E. P (H|E) =

P (H, E) P (E)

(2.37)

2.4. LEARNING TECHNIQUES

14

We wish to approximate the conditional probability P (H|E) because the exact algorithms might not provide a satisfactory solution to inference and learning problems due to the time or space complexity. We introduce an approximating family of conditional probability distributions, Q(H|E, ψ), where ψ are variational parameters. From the family of approximating distributions Q, we choose a particular distribution by minimizing the Kullback-Leibler (KL) divergence, D(Q k P ), with respect to the variational parameters   ψ ∗ = argmax D Q(H|E, ψ) k P (H|E)

(2.38)

ψ

where the KL divergence for any probability distribution Q(V ) and P (V ), V = {H, E} is defined as follows D(Q k P ) =

X

Q(V ) ln

{V }

Q(V ) P (V )

(2.39)

The minimizing values of the variational parameters, ψ ∗ , define a particular distribution, Q(H|E, ψ ∗ ), that we treat as the best approximation of P (H|E) in the family Q(H|E, ψ). One simple justification for using the KL divergence as a measure of approximation accuracy is that yields the best lower bound on the probability of observation P(E) (i.e. likelihood) in the family of approximation Q(H|E, ψ) [Jordan∗ et al., 1997b]. This is an important conclusion because the likelihood is our interest in learning problems as we saw in Section 2.4.1. We bound the logarithm of P (E) using Jensen’s inequality as follows

ln P (E) = ln

X

P (H, E)

{H}

= ln

Q(H|E)

P (H, E) Q(H|E)

Q(H|E) ln

P (H, E) Q(H|E)

X {H}



X {H}

(2.40)

The difference between the left and right hand sides of this equation is easily seen to be KL divergence D(Q k P ). Thus, by the positivity of the KL divergence, the right-hand side of (2.40) is a lower bound on P (E). Moreover, by choosing ψ according to (2.38), we obtain the tightest lower bound. Now we make a link between this lower bound and parameter estimation via EM algorithm [Beal∗ and Ghahramani, 1995]. We have just showed that the function L(Q, ψ) =

X

Q(H|E) ln P (H, E|ψ) − Q(H|E) ln Q(H|E)

(2.41)

{H}

is a lower bound on the log likelihood for any probabilistic distribution Q(H|E). Suppose now that we allow Q(H|E) to range over all possible probability distributions on H and minimize the KL divergence. This suggests the following algorithm. Starting from an initial parameter vector ψ 0 , we iterate the E and M step. First we minimize the bound L(Q, ψ) with respect to probability distribution Q. Second, we fix Q and minimize the bound L(Q, ψ) with respect to the parameters ψ. More formally, we have

(E step) :

Q(k+1) = argmin L(Q, ψ (k) )

(2.42)

Q

(M step) :

ψ (k+1) = argmin L(Q(k+1) , ψ) ψ

(2.43)

2.4. LEARNING TECHNIQUES

15

Coming back to HMM case the key point restriction is that we can assume that the hidden variables H = {H} are independent Q(H|E) =

T Y

Q(Hi )

(2.44)

i=1

where Hi is the notation of being in time step i, i = 1, . . . , T . P There is an obvious constraint that the distribution sums to unity, that is Q(Hi ) = 1. The assumption (2.44) is known as the ”mean-field” assumption. Under the mean-field assumption, the distribution Q(Hi ), which minimizes the KL divergence can be shown to be [Rezek∗ and Roberts, 2002] Q(Hi ) =

X 1 ¯i , E) ln P (Hi |H ¯i , E) Q(Hi |H exp Z

(2.45)

{Hi }

¯i = H \ Hi is the set of all hidden variables H excluding Hi , and Z is just a where H normalisation constant. This equation provides clue in solution of EM variational algorithm in E (2.42) and M (2.43) step. The hidden information consists of state sequence S and all HMM (variational) parameters ψ. With regard to our previous discussion about graphical models we can quite generally handle HMM parameters as additional nodes in a graphical model [Jordan∗ et al., 1997b]. Thus Hi = {Si , ψi } are hyper or variational parameters of initial probability π, matrix transition probability A and observation probability b for state j. Moreover we treat all HMM parameters as independent variables ψ = {ψ1 , . . . , ψN } and all hidden state variables as one set S = {S0 , . . . , ST }. It makes P (H, E|ψ) tractable. Regarding notation we denote state sequence as S instead of Q as in Section 2.4.1 and we note that observations O = E. Thus we can write Q(H|O) ≈ Q(S|O)Q(ψ) = Q(S0 )

T Y

Q(St |St−1 )

t=1

N Y

Q(ψj )

(2.46)

j=1

After the assumption of models independence the full posterior is given by P (S, ψ, O) = P (S0 |π0 )

T Y

P (St |St−1 , A)P (Ot |St )P (π)P (A)P (b)

(2.47)

t=1

The resulting KL divergence then becomes

D(Q k P ) =

X ln

...

X

Q(S)Q(π)Q(A)Q(b)

Q(S)Q(π)Q(A)Q(b) P (S, A, π, b, O)

(2.48)

Minimising the KL divergence (2.48) with respect to individual distributions Q(S), Q(π), Q(A), Q(b) using (2.45) results in update equations for the HMM parameters. The last question to answer is the choice of the family of Q functions. To get the HMM conjugate prior distribution we can use the same priors (2.29-2.35) as in the MAP case. The convergence of variational algorithm is measured by the actual value of the KL divergence (2.48) and the estimation is terminated when (2.48) no longer changes significantly. Considering terminology sometimes a variational learning is also known as ensemble learning because it was originally introduced as a way of fitting an ”ensemble” of neural networks to data, where each setting of the parameters can be thought of as a different number of the ensemble ([Jordan∗ et al., 1997a]).

2.4. LEARNING TECHNIQUES

16

Finally we can note that the ordinary EM algorithm is a special case of variational learning in which we constrain the approximation distribution to a point estimate (i.e., Dirac delta function) [Beal∗ and Ghahramani, 1995]. The update formulas variational learning can be found in Appendix B.3.

2.4.5

Genetic algorithm optimization

Genetic algorithm (GA) is a stochastic search method that can perform global search within the defined searching space and hopefully obtains the global optimal solutions. In [Kwong∗ et al., 2001] a new training method based on GA and Baum-Welch algorithms to obtain an HMM model with optimised number of states is presented. By using global searching capability and non-problem specific property of GA, the GA from HMM training can find the optimal number of states and also the optimal model parameters in a single step. Generally speaking each GA consists of several steps: encoding mechanism, the fitness evaluation, the selection mechanism, and the replacement mechanism. Next we will describe very briefly these basic steps. During encoding mechanism, each chromosome represents one HMM model, where each gene expresses one HMM parameter. As described in Section 2.3 the likelihood P (O|λ) is an appropriate criterion used in the fitness function to determine the quality of the chromosome. Selection mechanism is one of the most common-the roulette wheel selection. Only genetic operations, mainly state mutation, allow us to change number of states of the HMMs in the chromosomes. They aim to explore the fitness of the chromosomes in different number of states. Finally the steady-state reproduction is used as the replacement strategy. The main advantage of the proposed method [Kwong∗ et al., 2001] is that it has found the optimal topology in all cases.

2.4.6

Simulated annealing

Simulated annealing is a well known general heuristic approach to combinatorial optimisation. The basic idea is related to that of an interchange heuristic, in which at each iteration, one moves from a feasible solution to another in its neighbourhood with a better objective value, stopping when a locally optimal solution is found. Since there may be many local optima, it is usual to run an interchange heuristic many times from randomly chosen starting points. In contrast, in simulated annealing, one allows the objective value to worsen occasionally with a certain decreasing probability, in order to escape getting trapped in a local optimum. 0 Let Q and Q be two neighbouring solutions and f (Q) be the objective value of solution Q, then the general outline of the simulated annealing algorithm can be presented as follows [Hamam∗ and Al-Ani, 1996]. Given the observation sequence Q, a state sequence is generated at random and the logarithm of probability P (O|λ) of generating O is considered to be the objective value f (Q) to be minimized. The solution structure is based on the choice of a state trajectory. The various building blocks were proposed as follows: (i) The initial solution is obtained simply by generating a random state trajectory. (ii) The initial temperature should be high (close to 1) to allow virtually all transitions to be accepted. Thereafter, the temperature is decreased at each iteration by a factor β = 0.9. (iii) The number of trials at each temperature should progressively increase with the decrease in temperature. (iv) Defining the neighbourhood structure. Moving from one solution to the next is obtained simply by choosing at random a state at a randomly chosen instant and affecting it randomly to another state. (v) Updating the objective value. The objective function to be minimised is the overall probability of the observation sequence. Since only one state is changed at one period, cost updating is performed by calculating the probability differential.

2.5. ANALOGIES

17

Table 2.1: Simulated annealing. T, α, β are the temperature (control parameter), iteration rate and temperature reduction factor, resp.

Generate an initial feasible solution Initialise T and kmax repeat for k = 1 to kmax 0 Generate Q from Q 0 if f (Q ) ≤ f (Q) 0 f (Q)−f (Q )

T elseif exp( 0 then Q = Q

)

> rand[0, 1)

end kmax = kmax ∗ α T =T ∗β until termination

2.5

Analogies

Since HMM is a representative of a graphical model, it is not complicated to find throughout literature some common elements with other architectures such as dynamical systems or neural networks. In the following sections we will explore some analogies into more details.

2.5.1

Graphical probabilistic models

Let U = {U1 , . . . , UN } be a set of random variables representing for example, features in a pattern recognition problems. Let p(U ) represent the joint distribution for U . The central idea behind graphical models is to represent the independence structure in p(U ) by an annotated graph. The nodes of the graph are in one-to-one correspondence with the variables in U and the edges of the graph reflect no independence structure in p(U ). Annotation of the graph is achieved by factoring the underlying probability model p(U ) into conditional tables (for directed graphs) or potentional functions (for undirected graphs) [Smyth∗ et al., 1997]. In an artificial intelligence (AI) context [Pearl, 1988] independently developed a substantial body of theory for constructing and manipulating conditional independence relations using directed graphical models called belief networks.

B

q S1

b Y1

a

S2

b

a

S3

b Y2

Y3

SqT

b YT

Figure 2.4: An ADG for a first-order HMM.

Graphical models fall into two general classes, those based on acyclic directed graphs (ADGs) and those based on undirected graphs (UGs). For ADGs the factors are local conditional probabilities, for UGs they are local clique functions. The directed ADG formalism is primarily used in AI and statistics where cause-effect relationships can be made explicit by the use of directed arcs in the graph. The undirected UG formalism is popular in the statistical

2.5. ANALOGIES

18

physics and image processing communities where associations between variables (particles or pixels) are considered correlational rather than casual. Among variation of UGs we can find Boltzmann machines (Section 2.5.4), Markov random field (Section 2.5.5) or Markov networks. ADGs are often referred to as Bayesian networks, belief networks or recursive graphical models. Since we can represent an HMM as a simple graphical model – see Figure (2.4), in [Smyth∗ et al., 1997] authors show that the forward-backward algorithm and Viterbi algorithms are in fact directly equivalent to the Pearl, i.e., these algorithms were developed completely independently in both communities. Thus the graphical model algorithms (inference and MAP problems) are perfectly general and can handle arbitrary extensions to the standard ”first-order” model. No additional effort is required in terms of deriving new inference procedures for more complicated models, since the inference algorithms follow directly from the general specifications of Pearl [Pearl, 1988].

2.5.2

Stochastic grammars

Next we review the rudiments of the theory of formal grammars and the connection to HMMs. Grammars are natural tools for modeling string of letters and have been used extensively in the analysis and design of computer languages and compilers [Hopcroft and Ullman, 1979]. A formal grammar is a set of rules for producing all the strings that are syntactically correct, and only those strings. The formal grammar G consists of an alphabet A of letters, called the terminal symbols; alphabet W of variables called nonterminal symbols and a set R of production rules. Among the nonterminal symbols, there is a special s=start variable. Each production rule R consists of a pair (α, β), commonly denoted α → β. The language L(G) generated by the grammar G, is the set of all terminal strings that can be derived from the start state s. We will describe one of the simplest classes of the grammars-the regular grammar. A grammar G is regular if all the production rules are of the form u → Xv, or u → X, where u and v are single non-terminal symbols. A language is regular if it can be generated by a regular grammar. Now we can make connection between finite state automata and regular language. Without going into details, regular languages correspond to FSA, with typically one state per nonterminal symbol in the grammar, as in HMMs. In such an automata, there is no storage facility apart from the states themselves. So far we have considered deterministic grammars. Stochastic grammars are obtained by superimposing a probability structure on the production rules. P Specifically, each production rule α → β is assigned a probability P (α → β), so that β P (α → β) = 1. A stochastic ˜ and can be viewed as a probabilistic grammar therefore is characterized by a set of parameters λ generative model for the corresponding language. HMMs can be viewed exactly as stochastic regular grammars (SRGs). To see this, it suffices to replace the transition from a state sj to a state si in an HMM, together with the emission of the alphabet letter V with the SRG production rule sj → V si with associated probability bj (i) (2.15). Regarding computation aspects of expressing likelihood and learning technique, the expressions are very similar to those obtained in case of HMMs. In case of the best path in HMM context, the most likely parse tree of a string according to an SRG can be found by a similar version of dynamic programming, which generalizes the Viterbi algorithm of HMMs [Baldi and Brunak, 2000]. However, the complexity of SRGs is O(N 3 ) comparing with complexity of Viterbi algorithm O(N 2 ). In case of learning algorithms, we estimate model ˜ like in the HMM case: by maximizing the likelihood or the posterior through parameters λ some form of the EM algorithm, although one could also use other methods, such as gradient descent [Baldi and Brunak, 2000].

2.5. ANALOGIES

2.5.3

19

Linear dynamical systems

HMMs and Linear Dynamical Systems (LDSs) are based on the same assumption: a hidden state variable, of which we can make noisy measurements, evolves Markovian dynamics. Both have the same independence diagram and consequently the learning and inference algorithms have the same structure for both. The only difference is that HMM uses a discrete state variable with arbitrary dynamics and arbitrary measurement while LDS uses a continuous state variable with linear Gaussian dynamics and measurements. It is shown in [Minka∗ , 1999] how the forward-backward equations for the HMM, specialized to linear-Gaussian assumptions, lead directly to Kalman filtering (forward procedure) and Rauch-Tung-Streibel smoothing (backward procedure). Very good survey on the problematic of linear models and their connection not only to the HMMs (principal component analysis or vector quantization is included as well) can be found in [Roweis∗ and Ghahramani, 1999].

2.5.4

Boltzmann networks

The Boltzmann network is a stochastic machine whose composition consists of stochastic neurons. A stochastic neuron resides in one of two possible states in a probabilistic manner. The stochastic neurons of the Boltzmann machine partition into two functional groups: visible and hidden [Haykin, 1999]. During the training phase of the network, the visible neurons are all clamped onto specific states. The hidden neurons, on the other hand, always operate freely. The network described here represents a special case of the Boltzmann machine. It may be viewed as an unsupervised learning procedure for modeling a probability distributions that is specified by clamping patterns onto the visible neurons with appropriate probabilities. By doing so, the network can perform pattern completion. The state of Boltzmann network is described by the energy function (also called partition functions), from an analogy with thermodynamics.

..

..

.. .

...

t=1

2

visible units

..

..

..

..

...

... ....

...

...

3

4

Tf -1

Tf

Figure 2.5: Boltzmann chain.

HMM can be ”unfolded” in time to yield a trellis. A Boltzmann network with the same trellis topology, a Boltzmann chain, can be used to implement the same classification as the corresponding HMM [MacKay∗ , 1996]-see Figure 2.5. The discrete hidden states are grouped into vertical sets, fully interconnected by weights Aij . The discrete visible states are grouped into horizontal sets, and are fully interconnected with the hidden states by weights Bjk . In addition, boundary weights Πi1 = p1 , p2 , . . . , pN model an extra bias on the first hidden unit. Each configuration of units represents a state of energy [Saul∗ and Jordan, 1995] H[{il , jl }] = Πi1 −

L−1 X l=1

Ail il+1 −

L X l=1

Bil ,jl

(2.49)

2.6. HMM ENHANCEMENT

20

L where {il }L l=1 , {jl }l=1 is the sequence of states over the hidden, resp. visible units. The possibility to find the network in a particular configuration is given by

1 (−H/T ) e Z where T is the temperature, and the partition function P ({il , jl }) =

Z=

X

(2.50)

e(−H/T )

(2.51)

il ,jl

is the sum over states that normalizes the Boltzmann distribution (2.50). Comparing this to the HMM distributions, the correspondence between Boltzmann chain of Figure 2.5 at temperature T and the unfolded HMM implies Aij = T ln aij ,

Bij = T ln bij ,

Πi = T ln πi

(2.52)

Training the net with a single pattern, or a list of L visible states, consists of clamping the visible states and performing Boltzmann learning throughout full network, with the constraints that each of the rime shifted weights labelled by a particular Aij have the same numerical value [Duda et al., 2001].

2.5.5

Markov random fields

So far we have discussed the one-dimensional case. The current subsection is focused on the related two-dimensional generalization. That is, observations will be treated as two-dimensional sequences X(i, j). Such problems result in image processing and observation can be, for example, the gray levels of the image array pixels. Let us assume that we are given an array of observations X : X(i, j), i = 0, 1, . . . , Nx−1 , j = 0, 1, . . . , Ny−1 , and a corresponding array of states Ω : ωij , where each ωij can take one of M values [Theodoridis and Koutroumbas, 1999]. Our objective is, given the array of the observations, estimate the corresponding value of the state array Ω so that p(X|Ω)P (Ω) is maximum (Bayesian approach). The values of the element of Ω are assumed to be mutually dependent. We will assume that the range of this dependence is limited within a neighborhood. This leads us to Markov random fields (MRFs). Thus, for each (i, j) element of the array Ω a respective neighborhood Nij is defined so that ωij ∈ / Nij and ωij ∈ Nkl ⇐⇒ ωkl ∈ Nij . In words, the (i, j) element does not belong to its own set of neighbors, and, if ωij is a neighbor of ωkl , then ωkl is also a neighbor of ωij . The Markov property is then defined as ¯ ij ) = P (ωij |Nij ) P (ωij |Ω

(2.53)

¯ ij includes all the elements of Ω except the (i, j) one. The first order neighborhoods where Ω of a lattice point consists of its four neighbors (in the Manhattan metric), and the second order consists of its eight nearest neighbors [Luettgen∗ and Karl, 1994]. A good survey on this topic is provided by [Chellappa and Jain, 1993]. Unfortunately, two-dimensional case imposes limitations on the involvement of the computationally elegant dynamic programming technique. We will refer reader for detail learning procedure description to [Geman and Geman, 1984].

2.6

HMM enhancement

In literature there exists several types of modification of the HMM basic scheme. Most of these generalizations are based on graphical probabilistic models. In the next following paragraphs we will discuss some HMM extensions along with description of learning procedure enhancement.

2.6. HMM ENHANCEMENT

21

S

Figure 2.6: Markov Random Field.

2.6.1

Factorial HMM

In Factorial Hidden Markov Model (FHMM) the states are factored into multiple states variables and the model is therefore represented in a distributed manner [Ghahramani∗ and Jordan, 1997]. The comparison between classical structure and the propose extension is depicted in Figure 2.7 where probabilistic graph of those models specifying conditional relation for both models is shown. (1)

S t-1

(1)

S t+1

(2)

S t+1

(3)

St

(2)

(1)

(2)

S t-1

St

S t+1

S t-1

Yt-1

Yt

Yt+1

S t-1

St

S t+1

Yt-1

Yt

Yt+1

St

(3)

(a)

(3)

(b)

Figure 2.7: Difference between HMM and FHMM (a) An acyclic directed graph (ADG) specifying conditional independence relations for an HMM. Each node is conditionally independent from its nondescendant given its parents. (b)An ADG representing the conditional independence relations in a FHMM with S = 3 underlying Markov chains.

Using the independent relations (see Section 6.4.1), the joint probability for the sequence of states and observations can be factored in HMM as P ({St , Ot }) = P (S1 )P (O1 |S1 )

T Y

P (St |St−1 )P (Ot |St )

(2.54)

t=2

The HMM state representation can be expressed by letting the state be represented by a (1) (m) (M ) collection of state variables St = St , . . . , St , . . . , St , each of which can take K values. We refer to these models as factorial hidden Markov models, as the state space consists of the cross product of these state variables. Again, using independence relations, i.e., each state is a priori

2.6. HMM ENHANCEMENT

22

uncoupled from the other state variables, we get for the state probability M Y

P (St , St−1 ) =

(m)

P (St

(m)

|St−1 )

(2.55)

m=1

Due to the combinatorial nature of the hidden state representation, the Baum-Welch algorithm is intractable. As in other intractable systems, approximate inference can be carried out using Gibbs sampling or variational methods [Ghahramani∗ and Jordan, 1997].

2.6.2

Input-Output HMM

HMM provides a model of the unconditional density of the observation sequences. In certain problem domains such as learning with a teacher, some of the observations can be better thought of as inputs and the others as outputs. The goal, in these cases, is to model the conditional density of the output sequence given the input sequence. In machine learning terminology, unconditional density estimation is unsupervised while conditional density estimation is supervised. Several algorithms for learning in HMM that are conditioned on inputs have been recently presented in the literature [Bengio∗ and Frasconi, 1995], [Meila∗ and Jordan, 1996]. Given a sequence of input vectors {Xt }, the probabilistic model for an input-conditioned HMM is P ({St , Ot }|{Xt }) = P (S1 |X1 )P (O1 |S1 , X1 ) T Y

P (St |St−1 , Xt )P (Ot |St , Xt )

(2.56)

t=2

The model depends on the specification of P (Ot |St , Xt ) and P (St |St−1 , Xt ), which are conditioned both on a discrete state variable and on an input vector. The approach used in Bengio [Bengio∗ and Frasconi, 1995] Input-Output HMM suggests modeling P (St |St−1 , Xt ) as K separate neural networks as can be seen in Figure 2.8. This decomposition ensures that a valid probability transition matrix is defined at each point in input space if a sum-to-oneconstraint (e.g., softmax nonlinearity) is used on the output of these networks. current expected output, given past input sequence E[yt | u t ] 1

convex weighted sum

E[ y | x t-1 =1, t

P(x

1

P(x

t

ut

|

1

xt-1

)

delay convex weighted sum

yt-1

| u t-1 ) t-1 1

xt+1

yt

yt+1 HMM

x

xt+1

P( x | x t-1 =1, u ) t t

...

softmax

O

N

1

n

...

softmax

yt-1

N

yt

yt+1

n

ut-1 current input

x

xt-1

ut] O

current state distribution

ut

ut

ut+1 IOHMM

(a)

(b)

Figure 2.8: Input-Output HMM Architecture. (a) The proposed IOHMM architecture. (b) bottom:Bayesian network expressing conditional dependencies for an IOHMM, top: Bayesian network for a standard HMM.

2.6. HMM ENHANCEMENT

2.6.3

23

Hidden Markov decision trees

HMM decision trees is an interesting generalization of HMM [Jordan∗ et al., 1997b]. The architecture can be viewed as an HMM in which the state variable at each moment in time is factorized-see Figure 2.9. Or the resulting architecture can be seen as a probabilistic decision tree with Markovian dynamics linking the decision variables. How this probabilistic model would generate data at the first time step, t = 1 ? Given input X1 , the top node S11 can take on K values. This stochastically partitions X space into K decision regions. The next node down the hierarchy, S12 , subdivides each of these regions into K subregions, and so on. The output Y1 is generated from the input X1 , and the K-way decisions at each of the M hidden nodes. At the next time step, a similar procedure is used to generate data from the model, except that now each decision in the tree is dependent on the decision taken at that node in the previous time step. Thus, the ”hierarchical mixture of experts” architecture is generalized to include Markovian dynamics for the decisions [Jordan∗ and Jacobs, 1994]. X t-1

(1)

S t-1

Xt

X t+1

(1)

S t+1

(2)

S t+1

(3)

(1)

St

(2)

S t-1

(2)

St

(3)

(3)

S t-1

St

S t+1

Yt-1

Yt

Yt+1

Figure 2.9: Hidden Markov decision tree.

2.6.4

Fuzzy HMM

In the conventional hidden Markov model, the models are reestimated by an iterative procedure known as Baum-Welch method. The crisp approach is used, e.g., each sequence belong exclusively to a single state sequence (model) whereas in fuzzy scheme a observation can belong simultaneously to more than one model. We will introduce the fuzzy schemes in the HMM context based on fuzzy c-means clustering (FCM) [Tran∗ and Wagner, 1999]. Consider a generalised Q-function as follows Qm (U, Λ) =

X

um S (O) log(O, S|Λ)

(2.57)

all S

{um S}

where U = is a set of the membership functions, denoting the degree to which the observation sequence O belongs to the state sequence S and satisfies 0 ≤ uS (O) ≤ 1,

X

uS (O) = 1

(2.58)

all S

m ≤ 1 is a weighting exponent on each fuzzy membership uS (O) and is called the degree of fuzziness. As shown in [Tran∗ and Wagner, 1999], the generalised Q-function reduces to the Qfunction for the conventional HMM for m = 1, uS (O) = P (S|O, Λ). The task of this approach

2.6. HMM ENHANCEMENT

24

is to maximise the generalised Q-function in (2.57) on variables U and Λ. The minimization follows a scheme of FCM described in [Theodoridis and Koutroumbas, 1999].

2.6.5

HMM and neural nets

In order to overcome the limitations of HMMs, we shall look at the possibility of combining HMMs and NNs to form hybrid models that contain the expressive power of artificial neural networks with the sequential time series aspect of HMMs. The motivation why to combine two different approaches lies in several weaknesses of HMMs, namely [Boulard et al., 1991]: - Poor discrimination due to the training algorithm which maximizes likelihoods (see Section 2.4.1) instead of posterior probabilities. - A priori choice of model topology and statistical distributions, e.g., assuming that the emissions probabilities associated with the HMM states can be described as a mixture of multivariate Gaussian densities. - Assumptions that the state sequences are first-order Markov chains. - No contextual information is taken into account and the possible correlation of the successive feature vectors is overlooked. - For the sake of storage and computational efficiency, the need to treat certain processes that are not truly independent – see equation (6.30). To overcome these limitations, intensive research has been recently carried out suggesting that the combination of HMMs and NNs could solve most of the mentioned problems. The major advantage of NNs is that they provide discriminant-based learning (that is, models are trained to suppress incorrect classification as well as to accurately model each class separately). The main idea is to use multilayer perceptron (MLP) to compute the emission probabilities used in HMM systems [Boulard and Morgan, 1990] or [Baldi and Chauvin, 1996]. It was shown if each output unit of an MLP is associated with a particular state qk , it was possible to train MLP to generate probabilities like p(qk |on ) where on , a particular feature vector, is provided to its input.

Output Layer

Hidden Layer

Y

H

S HMM States Figure 2.10: Each state S of HMM is fully interconnected to the common hidden layer H, each unit in the hidden layer is fully interconnected to each output unit Y, which calculates the emission probability.

Probabilities like p(qk |on ) are generally refer to as Bayes probabilities or posterior probabilities and can be transformed into likelihoods for use as emission probabilities in HMMs by

2.6. HMM ENHANCEMENT

25

Bayes’ rule p(on |qk ) =

p(qk |on )p(on ) p(qk )

(2.59)

As shown in [Boulard and Morgan, 1990], the advantage of this hybrid HMM/MPL approach is the possibility to estimate the emission probabilities needed for the HMMs with better discriminant properties and without hypotheses about the statistical distributions of the data. One example of this hybrid architecture is shown in Figure 2.10. The schematic representation consists of [Baldi and Chauvin, 1996] (i) input layer, where at each time, all units are set to 0, except one that is set to 1, (ii) hidden layer with logistic transfer function and (iii) output layer with softmax functions.

q S1

S2

S3

SqT

Y1

Y2

Y3

YT

Figure 2.11: An auto-regressive model.

2.6.6

Auto-regressive HMMs

We will call the model in Figure 2.11 an auto-regressive HMM (confusingly, the term autoregressive HMM refers to two different models, the one being discussed here and the one in [Rabiner∗ , 1989], which is also called a linear predictive or hidden filter HMM, which looks like a regular HMM, but which uses a state-dependent auto-correlation function for the observation distribution). The AR-HMM model reduces the effect of the St ”bottleneck”, by allowing Yt−1 to help predict Yt as well; this often results in models with higher likelihood. If Y is continuous, one possibility for the cdf for Y is P (Yt = yt |St = i, Yt−1 = yt−1 ) = N (yt ; Ri yt−1 , Ci )

(2.60)

where Ri is the regression matrix given that St is in state i. This model is also known as a correlation HMM, conditionally Gaussian HMM switching regression model, switching Markov model, or switching regression model [Murphy∗ , 2002].

2.6.7

Buried Markov Models

Buried Markov models [Bilmes∗ , 1998] generalize auto-regressive HMMs by allowing non-linear dependencies between the observable nodes. Furthermore, the nature of the dependencies can change depending on the value of St : see Figure 2.12 for an example.

2.6.8

Mixed-memory Markov models

One of the simplest approaches to modelling sequential data is to construct a Markov model of order n − 1. These are often called n-gram models, e.g., when n = 2, we get a bigram model, and when n = 3, we get a trigram model.

2.6. HMM ENHANCEMENT

St-1

St

26

St+1 t

St-1

Y1

Y1

Y2

Y2

St

St+1 t

Figure 2.12: Buried Markov Models. Depending on the value of the hidden variables, St , the effective graph structure between the components of the observed variables, Yt , can change. Two different instantiations are shown. Based on Figure 2.6 of [Murphy∗ , 2002].

When Zt is a discrete random variable with many possible values (e.g., if Zt represents words), then there might not be enough data to reliably estimate P (Zt = k|Zt−1 = j, Zt−2 = i). A common approach is to create a mixture of lower-order Markov models: P (Zt |Zt−1 , Zt−2 ) = α3 (Zt−1 , Zt−2 )f (Zt |Zt−1 , Zt−2 )

(2.61)

+ α2 (Zt−1 , Zt−2 )f (Zt |Zt−1 ) + α1 (Zt−1 , Zt−2 )f (Zt ) where the α coefficients may optionally depend on the history, and f (·) is an arbitrary (conditional) probability distribution. This is the basis of the method called deleted interpolation [Jelinek, 1999]. As with mixtures of Gaussians, we can either represent mixtures of multinomials as a primitive cdf, or we can explicitly model the latent mixture variable as in Figure 2.13. Here Zt = 1 means use a unigram, Zt = 2 means use bi- and unigrams, and Zt = 3 means use tri-, bi- and uni-grams. Zt is called a switching parent, since it determines which of the other

Z t-3

Zt-2

Z t-1

St-3

St-2

St-1

Zt S t >=1

St

St >=2

Figure 2.13: A mixed-memory Markov model. The dashed arc from Zt to St means Zt is a switching parent: it is used to decide which of the other parents to use, either St or St−1 or both. The conditions under which each arc is active are shown for the St node only, to reduce clutter. The dotted arcs from St and St−1 are optional. Based on Figure 2.8 of [Murphy∗ , 2002].

2.6. HMM ENHANCEMENT

27

parents links are active, i.e., the effective topology of the graph changes depending on the value of the switch; the result is called a (dynamic) Bayesian multinet [Bilmes∗ , 2000]. Since Zt is hidden, the net effect is to use a mixture of all of these. The arcs from St model the fact that the coefficients can depend on the identity of the words in the window.

2.6.9

Coupled HMMs

In a coupled HMM [Saul∗ and Jordan, 1995], [Brand∗ , 1996] and [Rezek∗ et al., 2000] the hidden variables are assumed to interact locally with their neighbors. It is one of the possible extensions to multi-dimensional systems. We already mentioned multiple channel modelling in Section 2.6.5 (neural network) and in Section 2.6.2 (input-output models). Also, each hidden node has its own private observation. The current state is dependent on the states of its own chain S and that of the neighboring chain Y at the previous time-step. This is shown in Figure 2.14. 1

S1

1

S2

1

S3

Y1

1

Y2

1

Y3

S1

2

S2

2

S3

Y1

2

Y2

2

Y3

S1

3

S2

3

S3

3

Y2

3

Y3

Y1

1

2

2

3

3

Figure 2.14: A coupled HMM with 3 chains and with lag 1.

2.6.10

Hierarchical HMMs

The Hierarchical HMM (HHMM) [Fine et al., 1998] is an extension of the HMM that is designed to model domains with hierarchical structure and/or dependencies at multiple length/time scales. In an HHMM, the states of the stochastic automaton can emit single observations or strings of observations. Those that emit single observations are called production states, and those that emit strings are termed abstract states. The strings emitted by abstract states are themselves governed by sub-HHMMs, which can be called recursively. When the sub-HHMM is finished, control is returned to wherever it was called from; the calling context is memorized using a depth-limited stack.

2.6.11

Variable-duration (semi-Markov) HMMs

The self-arc on a state in an HMM defines a geometric distribution over waiting times. Specifically, the probability we remain in state i for exactly d steps is p(d) = (1 − p)pd−1 where p = A(i, i) is the self-loop probability. To allow for more general durations, one can use a

2.6. HMM ENHANCEMENT

28

semi-Markov model [Murphy∗ , 2002]. (It is called semi-Markov because to predict the next state, it is not sufficient to condition on the past state: we also need to know how long we have been in that state.) We can model this as a special kind of 2-level HHMM, as shown in Figure 2.15. The reason why there is no S to F arc is that the termination process is deterministic. In particular, the bottom level counts how long we have been in a certain state; when the counter expires (reaches 0), the F node turns on, the parent node S can change state, and the duration counter, S D , resets.

S1

S3

S2 F2

F1 D

F3

S1

D

S2

S3

Y1

Y2

Y3

D

Figure 2.15: A variable-duration HMM modelled as a 2-level HHMM. St represents the state, and StD represents how long we have been in that state (duration).

2.6.12

Segment models

The basic idea of the segment model [Ostendorf∗ et al., 1996] is that each HMM state can generate a sequence of observations instead of just a single observation. The difference from an HHMM is that the length of the sequence generated from state Si is explicitly determined by a random variable li , as opposed to being implicitly determined by when its sub-HHMM enters its end state. Furthermore, the total number of segments, τ , is a random variable. A schematic illustration of a segment model is depicted in Figure 2.16.

2.6.13

Markovian objects with acyclic structure

We would like to provide the reader some notes on [Hlavac and Schlesinger, 2002] where especially the authors in Lecture 8 treat the problem of HMM in a quite general in view of structural pattern recognition. Instead of expressing HMM in the form of the standard statistical model λ = (A, B, π) the model is described by the joint distribution p(O, Q) that fulfills the Markovian condition (2.9) resulting in the need of definition of one function p(q0 ) and T functions p(Oi , qi |qi−1 ) p(O, Q) = p(O1 , O2 , . . . , OT , q0 , q1 , . . . , qT ) = p(q0 )

T Y

p(Oi , qi |qi−1 )

(2.62)

i=1

Then the authors analyze all three problems of HMM theory mentioned in Appendix B.1 on this form of statistical model 2.62. They show that the first two Problems 1 and 2 (P (O|λ) computation and seeking the most probable state sequence) can be merged by suitable mathematical formulation into one task. Generalized matrix multiplications are proposed under an algebraic structure known as a semi-ring. The Problem 1 uses the generalized matrix product

2.6. HMM ENHANCEMENT

29

t l2

l1

Y1

Y2

...

...

S2

S1

Yl1

Yl1+1

...

St

...

YL1+l2

...

lt

YSli

...

YT

Figure 2.16: A schematic depiction of a segment model, from [Murphy∗ , 2002]. The Yt nodes are observed, the rest are hidden.

in the semi-ring (+, product) and the Problem 2 problem uses the semi-ring (max, product). Moreover it is also possible to solve the first two problems in a general framework where not all observations Oi are available and we are interested only in the values of some hidden states qi . In addition they cast more general look at the Problem 2 of seeking the most probable state sequences. They show that the standard formulation (B.5) is only a special case of a more general task of minimising the Bayesian risk with a concrete penalty function. Hence the important conclusion is that the Problem 2 can be solved under the classical Bayesian statistical decision making without the necessity of using dynamic programming which was introduced as Viterbi algorithm (B.6-B.12). Finally they give also more general insight on the last Problem 3 of learning model parameters. They formulate this problem as a task of supervised and unsupervised learning. In the first case both observation and state sequences are known while in the second case we know only the observation sequences. The supervised learning is solved under maximum likelihood framework. However, due to the well-known limitations of ML approach (the observation sequence must be random sample from, in our case, normal distribution (2.16) and must be iid) the general Minimax estimate of the statistical model p(O, Q) is proposed. Concerning the unsupervised learning that equals to our formulation of Problem 3, the task is developed only as a ML estimate. Besides, the above mentioned principles are extended also to Markovian objects with acyclic structure that were discussed into more detail in Section 2.5.1. The algorithm is divided into two steps: Recognition and Learning that resembles the two steps of E-M algorithm described in Section 2.4.1. It is interesting to note that the Recognition step uses the results of Problem 2 solved by minimising Bayesian risks.

2.6.14

Another enhancement

We will briefly complete our review by other techniques which have been developed more in theoretical rather than practical framework. It is possible to extend HMM to have a countably infinite number of hidden states. This concept is demonstrated by [Beal∗ and Ghahramani, 1995] where the models are called Infinite HMM. By using the theory of Dirichlet processes the infinitely many parameters are integrated out, leaving only three hyper-parameters that can be learned from data. It is also possible to allow the alphabet of emitted symbols to be infinite-for

2.7. APPLICATIONS

30

example possible words in English text. In [Thrun∗ et al., 1999] and in [Scott∗ , 2002] the Monte Carlo HMM (MCHMM) are introduced. MCHMMs use samples and non-parametric density trees to represent probability distributions. This makes it possible to learn non-parametric HMM with continuous state and observation space. MCHMMs can control the complexity if the state space during learning in contrast to the classical discrete state space HMM, where the number of states must be determined in advance, before training. So far we have analyzed the first order HMM. In [Berchtold∗ , 1999] the principle of the Double Chain Markov Model (DCMM) by the use of Markovian dependences of order greater than 1 is extended. The DCMM combines two Markov chains, hence its name: an observed nonhomogeneous Markov chain and a hidden homogeneous one whose state at each time t decides of the matrix used in the visible process. Unfortunately, as the order of the chain increases, the number of independent parameters increases exponentially and becomes rapidly too large to be estimated efficiently. Therefore the Mixture Transition Model is introduced to approximate high order HMM with far few parameters than the fully parameterized model. In [Laffery∗ , 1995] the Gibbs-Markov models are developed. Gibbs distributions are used to model state transitions and output generation, and parameter estimation is carried out using an EM algorithm where the M-step uses a generalized iterative scaling procedure. A new variant of the discrete HMM is proposed in [Choi et al., 1998]. The output distributions is estimated by state-dependent source quantizing modelling and the output probability is weighted by the entropy of each feature-parameters at a state. The given model is used in speech processing tasks. Entropy seems to be a favorite topic in speech processing community. In [Singer∗ and Warmuth, 1996] Entropy Based HMM are proposed. The authors developed new algorithms for HMM parameters estimations by using the relative entropy between the two HMMs and so assuring that the current estimated parameters will stay ”close” to each other. A new learning method is proposed in paper [Arslan and Hansen, 1998] under the framework that is similar in terms of its goal to maximum mutual information training. However it requires less computation, and the convergence properties of maximum likelihood estimation are retained in the new formulation. Finally Tran [Tran∗ and Wagner, 2000] suggests the so-called Frame-Level HMM . An alternative approach presented in this paper shows that dependence can be considered in the probability of each observation in contrast to classical discrete HMM where the dependence on states of observations is considered in the probability of observation sequence. The forward and backward variables are reduced to a new variable called state distribution and state are considered as mixture at each time rather than sequence in time.

2.7

Applications

Neither the theory of HMM nor its applications to engineering problems are new. The basic theory was published in a series of classical papers by Baum and his colleagues [Baum and Petrie, 1966] in the late 1960s and early 1970s and was implemented for speech processing applications by Baker [Baker, 1975] and by Jelinek at IBM [Jelinek, 1976], [Jelinek, 1999]. Apart from those basic applications in speech processing, the HMM theory has found its use in a lot of applications such as molecular biology, robotics, handwriting recognition, etc. In the following text we will try to give a reader at least the basic application areas: - Speech processing. Among other fields, the area of speech processing is where HMMs have the most successful tradition. The first employment of HMM for speech recognition was proposed independently by Baker [Baker, 1975], and the research group of IBM [Jelinek, 1976]. Since that time, enormous number of successful examples of HMM implementation has been published. To

2.8. WHICH METHOD?

31

name just a few papers regarding particular problems in speech recognition: speaker clustering [M.J. Gales, 2000] or hybrid architectures [Baldi and Chauvin, 1996], [Boulard and Morgan, 1990], [Tran∗ and Wagner, 1999], [Bengio∗ and Frasconi, 1995], [Ghahramani∗ and Jordan, 1997]. - Molecular biology. In computational biology HMM are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. A good survey on those nowadays so-discussed topic can be found in [Krogh and et al., 1994] or [Baldi and Brunak, 2000]. - Dynamic systems and robotics. In [Chang, 1994] the HMM are used for linear-switching models. Another interesting application is a fault detection in dynamic systems [Smyth, 1994] applied on antenna pointing system. Group of Yang [Yang et al., ] focuses its interest on human-robot skills learning including gesture recognition. A similar problem is solved in [Morimoto et al., 1996]. - Text recognition. Text recognition is a field of machine vision that has a lot of application areas. In [Aningbogu and Belaid, 1995] the text recognition is built around the first and second HMM. A good reference can also be found in [Vlontzos and Kung, 1992]. - Economy. A prediction of the indexes in stock markets is a favorite topic of econometrics. Weigend [Weigend and Shi, 1998] introduced Hidden Markov Experts on predicting daily probability distributions of S&P500 returns. Another interesting task is to model the price share development. One example of thinly traded shares on the Johannesburg Stock Exchange is given in [MacDonald and Zucchini, 2000]. - Image analysis. Image processing is from nature two-dimensional where Markov random fields have been useful. A seminal paper that had an impact on the use of MRF modeling in image processing and analysis was that of [Geman and Geman, 1984]. But we can find some applications where the problem is to be solved using Markov chains [Aas et al., 1999]. Another interesting paper deals with incorporation of HMM theory into wavelet transform [Crouse∗ , 1998]. The application comparison with different wavelet denoised based methods can be found in [Novak∗ , 2002]. - Biological signal processing. In [Obermaier, 2001] HMM are presented for the online classification of single trial EEG data during imagination of a left or right hand movement. Another use of HMM applied to EEG multielectrode data analysis can be found in [Radons and Becker, 1994] or in [Penny∗ and Roberts, 1998] where HMMs detect changes in DC levels, correlation and frequency. In the area of Brain Computer Interface, [Panuccio∗ et al., 2002] proposed to use HMM for segmentation of cognitive tasks. [Rezek∗ et al., 2002] suggested to use HMM for the detection of sleep stages in EEG data. The same author [Rezek∗ and Roberts, 2002] used the HMM models for heart beat intervals analysis and for the analysis of Cheyne Stokes respiration data. Considering ECG processing, [Lepage∗ et al., 2001] and [Coast∗ et al., 1990] used the HMM for the detection both the QRS complex and the P-wave. In [Cheng∗ and Chan, 1998], there an approach is presented for the ECG classification. Finally, Koski [Koski∗ , 1996] describes how to model ECG signals using segmentation as a feature technique.

2.8

Which method?

In Section 2.4 we have mentioned numerous possibilities of learning techniques for HMM optimization. In this section we will compare three methods of HMM optimization: the classical approach (Section 2.4.1) and Bayesian approach (Section2.4.4) - maximum a posteriori estimation and finally variational Bayes approach. Furthermore we will focus on the well-known

2.8. WHICH METHOD?

32

problem of EM algorithm that is the sensitivity to initial conditions. We will investigate four methods: GMM, k-means, Viterbi and random initialization.

2.8.1

Evaluation criteria

We evaluate the quality of derived HMM using the following criteria: (i) the likelihood of the training data computed from the converged HMM, (ii) the percentage difference in HMM parameter values to the true HMM, (iii) distance measure between two HMM and (iv) the number of algorithm iterations. Next, we describe each of these criteria in more detail. Data likelihood The data likelihood measures the log likelihood of data for a given HMM model. We used the equation (2.21) for data likelihood computation. Given the same data and the same HMM model, the higher the data likelihood, the better the parameter configuration of the model. Percentage difference in model parameter values To evaluate the quality of a derived HMM, we compare the parameters of the derived model to those of the generating model. This is done in two steps. First, we establish the correspondence between states in the derived and the generating models by matching states that are the closest in emission definitions. In the second step, the percentage difference is computed for the corresponding emission parameter values and the transition probability values. Finally, for each of those HMM parameters π, A, B, the total percentage difference is weighted sum where weights correspond to the number of free parameters of π, A, B resp. Distance measure An important question associated with the methods performance comparison is the following: Given two HMMs, λ1 and λ2 , what is a reasonable measure of the similarity of the two models? A key point here is the similarity criterion. That means, for λ1 to be equivalent to λ2 , in terms of having the same statistical properties for the observation symbols. Thus in some cases might happen that in spite of having big percentage difference of derived and generating model, their distance measure is small so both models share similar statistical behavior. We can define a distance measure D(λ1 , λ2 ), between two Markov models, λ1 and λ2 , as [Rabiner∗ , 1989] D(λ1 , λ2 ) =

1 (log P (O2 |λ1 ) − log P (O2 |λ2 )) T

(2.63)

where O2 = O1 , O2 , . . . , OT is a sequence of observations generated by model λ2 . Basically, (2.63) is a measure of how well model λ1 matches observations generated by model λ2 , relative to how well model λ2 matches observations generated by itself. One of the problems with the distance measure of (2.63) is that it is nonsymmetric. Hence a natural expression of this measure is the symmetrized version, namely D(λ1 , λ2 ) + D(λ2 , λ1 ) (2.64) 2 Since each time, when (2.64) is computed, observation sequences must be generated from both model, we compute distance measure over ten iterations to reflect its stochastic nature. Ds (λ1 , λ2 ) =

Number of iterations We compare the effectiveness of the HMM initialization in terms of the number of forwardbackward algorithm iterations before model convergence. In variational learning convergence is measured by the actual value of the KL divergence (2.48) and the estimation is terminated

2.8. WHICH METHOD?

33

when (2.48) no longer changes significantly. In MAP case the stopping criteria is determined by no significant changes of total log-likelihood (2.36). In all experiments, this change was set to  = 1e − 3.

2.8.2

Initialization methods

In theory, the reestimation equations should give values of the HMM parameters which correspond to either a local maximum of the likelihood function, a maximum of a posterior function or a maximum of KL divergence. A key question is therefore how we choose initial estimates of the HMM parameters so that the local maximum is the global maximum of the likelihood function. Such initial estimates can be obtained in a number of ways, including random, kmeans, GMM and Viterbi initialization. Next we present each of those initialization techniques in more detail. Common setting The setting of some HMM parameters remained the same across different types of initialization methods. This is the case of initial π and transition matrix A when they were set randomly except of Viterbi initialization. Regarding the model prior in MAP and VAR, they were set to be as flat as possible. The hyper-parameters κi ηij were fixed to 1, reflecting the fact that little is known a priori about the initial state and state transition probabilities. The observation mean had an associate Gaussian prior with the mean and covariance parameters set to the median and squared range (data range is the difference between maximum and minimum value of data) of the training data, resp. Also Wishart density, mainly the shape parameter ζ and the scale parameter V were the same for all methods. The value ζ was set to the half the dimensionality of training data, while V was set to the total training covariance matrix, scaled up by ζ. Practice has shown this to be the most robust initialization procedure; robust, that is, in the sense least sensitive to training data cluster shape and data range [Rezek∗ and Roberts, 2002]. Random initialization All HMM parameters were random initialized (subject to the stochastic and the nonzero value constraints (2.18-2.20)). GMM and k-means initialization The Gaussian observation model (2.16) was initialised by running a few iterations of k-means algorithm or Gaussian mixture maximum likelihood EM on the training data to obtain some initial clustering. In case of classical approach, each of the HMM state means were assigned to the centers of the mixture model. Similarly, the state variances were initialised to the cluster variances. In case of MAP and Bayesian learning, the hyper-parameters of mean µi : (i) the posterior means %i were assigned to the centers of the mixture model and (ii) means’ posterior matrices Ui were set all equal to the covariance of the total training data. The hyper-parameters of covariances Ci : (i) the shape parameters ζi were set to the mixture weights multiplied by the length of observation sequence T (equals to expected counts per state) and (ii) the covariances’ matrixes Vi were assigned to the cluster variances. Viterbi initialization The Viterbi initialization is quite similar to Viterbi learning proposed in Section 2.4.3. It can reveal some information about the initial state π and state transition probability A. Following the random model initialization, the set of training observation sequences is segmented into states. This segmentation is achieved by finding the optimum state sequence, via the Viterbi

2.8. WHICH METHOD?

34

algorithm (Appendix B.1), and then backtracking along the optimal path. To get π and A we simply count the number of transitions from state i to j and divide it by the number of transitions from state i to any state. The observation means and covariance are computed from the segmented clusters that belong to a particular state.

2.8.3

Data description

We constructed one HMM model (2.65) for generating the data sets. The model is a full transition model of four states. The first data set (Data Set I) consists of only one observation sequence of length T = 100 generated assuming a probabilistic walk through the HMM (Section 4.1). The second data set (Data Set II) contained five observation sequences of length T = 500. Since our data are generated assuming a probabilistic walk through the HMM, by using sequences of longer lengths, more data is available for the estimation of individual HMM parameters, therefore, increasing the chance that model parameter and model structure estimation is to be more accurate.  0.25 0.25 A= 0.25 0.25

2.8.4

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

    µ1 = −2 0.25 0.25  µ2 = 0 0.25 0.25   π =  0.25 B =  µ3 = 2 0.25 µ4 = 4 0.25 0.25

σ12 σ22 σ32 σ42

 = 0.5 = 0.5  = 0.5 = 0.5

(2.65)

Results

To verify the effectiveness of different initialization and learning methods we performed in total ten runs Nrun = 10 to get statistical significant results. In addition we use diagonal covariance matrix in all experiments and only one mixture component per state. The assumption of using only one mixture component per state is justified by the proof of Bicego [Bicego∗ et al., 2003]. He proved the equivalence between HMMs with more than one Gaussian per state and HMMs with one Gaussian per state. Given the HMM λ with N states, where the emission probability P 0 0 of each state qi is a mixture of Mi Gaussians, then there is another HMM λ with N = N i=1 Mi states, with only one Gaussian per state. Then the equivalence is understood in a likelihood 0 sense, that is, P (O|λ) = P (O|λ ). Throughout all the following tables, the mean and the variance (in parentheses) of each criteria across ten runs are shown. Considering set-up of experiments, the maximum number of HMM iteration was CY CHM M = 100, in case of k-means and GMM the maximum number of iterations was CY CGM M = 100 as well. The convergence threshold was the same for all the methods:  = 1e − 3. Table 2.2: Methods comparison-Data Set I: Percentage difference.

Random k-means GMM Viterbi

Classical 102.2 (10.9) 80.6 (12.5) 82.9 (15.8) 117.6 (17.7 )

MAP 150.2 (12.7) 113.6 (16.9) 117.3 (15.5) 131.3 (37.9)

VAR 65.8 (7.4) 65.7 (11.2) 62.9 (6.9) 62.4 (7.2)

It can be seen in Tables 2.2-2.3 that k-means and GMM initialization improves the final estimates, mainly in the second case (Table 2.3), where more data are available. The best performance yields VAR learning . The value of total percentage difference might be a slight surprise for the reader. This value varies between 50% up to 100%. As we discussed in paragraph Distance measure, the most important is a statistic similarity. Hence two quite different models in terms of their parameters dissimilarity might be in view of statistical behavior very

2.8. WHICH METHOD?

35

Table 2.3: Methods comparison-Data Set II: Percentage difference.

Classical 90.1 (17.7) 70.3 (8.0) 58.7 (10.9) 122.2 (34.4)

Random k-means GMM Viterbi

MAP 146.7 (23.4) 78.3 (8.8) 75.2 (15.8) NA

VAR 94.0 (10.7) 49.5 (11.3) 54.4 (7.6) 55.1 (12.8)

close to each other. We still think that it is useful to include parameter percentage difference values because we persuade the relative changes across different methods; we are not interested in absolute values so much. Table 2.4: Methods comparison-Data Set I: Likelihood.

Random k-means GMM Viterbi

Classical 206.6 (6.1) 204.0 (5.5) 205.7 (4.3) 221.4 (7.5 )

MAP 222.0 (4.9) 217.9 (5.1) 218.6 (4.6) 221.3 (3.8)

VAR 222.4 (3.4) 222.3 (3.5) 221.1 (3.4) 222.3 (3.5)

Table 2.5: Methods comparison-Data Set II: Likelihood.

Random k-means GMM Viterbi

Classical 5397.3 (23.7) 5382.0 (18.9) 5379.0 (16,5) 5507.1 (99,6)

MAP 5521.0 (27.5) 5483.5 (23.5) 5482.7 (18.6) NA

VAR 5447.4 (18.9) 5448.7 (14.6) 5445.0 (16.8) 5457.9 (24)

We did not normalize the likelihood in respect to the cardinality of the training data set. Therefore in Table 2.5 the likelihood is much bigger than in Table 2.4. Again both tables show improved performance when k-means and GMM initialization is used. Also it is interesting to remark that VAR learning is insensitive to initialization. Since the distance measure (2.64) is normalized here we can compare the effect of data sufficiency-Tables 2.6-2.7. When the more data are presented to all learning methods the less value of distance measure we get. Please note again, that VAR learning is not so sensitive to the initialization method in case of sufficient data. Surprisingly, in some cases (see for example Classical approach in Table 2.8) the number of required iterations for algorithm convergence was higher when the initialization method was

Table 2.6: Methods comparison-Data Set I: Distance measure.

Random k-means GMM Viterbi

Classical 0.0171 (0.0154) 0.0150 (0.043) 0.0200 (0.0134) 0.0432 (0.0372)

MAP 0.1452 (0.0492) 0.1430 (0.0411) 0.0912 (0.0548) 0.0710 (0.039)

VAR 0.0865 (0.0372)) 0.0576 (0.0224) 0.0804 (0.0468) 0.0452 (0.0326)

2.8. WHICH METHOD?

36

Table 2.7: Methods comparison-Data Set II: Distance measure.

Random k-means GMM Viterbi

Classical 0.0071 (0.0048) 0.0060 (0.0026) 0.0049 (0.0030) 0.0673 (0.0590)

MAP 0.0039 (0.0065) 0.0039 (0.0065) 0.0039 (0.0057) NA

VAR 0.023 (0.0041) 0.0248 (0.0026) 0.0235 (0.0033) 0.0282 (0.0066)

Table 2.8: Method comparison-Data Set I-II: Number of iterations of the first/ second data set

Random k-means GMM Viterbi

Classical 36/51 45/75 48/80 17/29

MAP 10/24 33/31 43/25 7/NA

VAR 50/37 36/36 18/34 41/33

used compared to the number of iteration without using any sophisticated starting method for learning. We made detailed inspection of likelihood development in case of Random and the k-means/GMM initialization technique. When the Random initialization is used, no further knowledge is given to algorithm and in most cases the likelihood converges quite quickly to the nearest local maximum as it can be seen in Figure 2.17(a). Meanwhile in case of more elaborate initialization the algorithm is supposed to start near global maximum and in some cases it takes a while on the way to global maximum (or at least to dominant local maximum) to pass some minor local maximum as shown in Figure 2.17(b). Last but not least, observe again, that VAR learning produces very similar number of iterations across the Data Set II. Since we use Bayes optimization approach, it is interesting to look at the development of hyper-parameters which determine HMM parameters λ = π, A, B. We will especially investigate the development of means and covariances of each state during Bayes variational learning. In Figure 2.18 the means and covariances evaluation over 12 iterations is depicted. As we

Likelihood development, observation sequence number:4, run number:1 1135

2600

1130

2400

1125

2200

1120

2000

1115

Likelihood

Likelihood

Likelihood developwment, observation sequence number:1, run number:6 2800

1800

1110

1600

1105

1400

1100

1200

1095

1000

0

10

20

30 Number of iterations

40

(a) Random initialization

50

60

1090

0

10

20

30

40 50 Number of iterations

60

70

80

(b) GMM initialization

Figure 2.17: Sub-figure (a): Random initialization. EM algorithm converges directly to the nearest local maximum. Sub-figure (b): GMM initialization. First the algorithm is stuck in minor local maximum, after a few iterations finally converges to global or local dominant maximum.

2.8. WHICH METHOD?

37

State 1

Mean cdf

Covariance cdf 4

2 Last posterior

Prior

1

Last posterior

2

0

0 −4

State 2

Prior

−2

0

2

4

0

2

4

6

0

2

4

6

4

1.5 1

2

0.5 0 −8

−6

−4

−2

0

2

State 3

3

4

2 2 1 0

State 4

0

0 −6

−4

−2

0

2

0.5

1

1.5

2

2.5

3

4

2 2 0

1 0 −4

−2

0

2

4

1

2

3

4

Figure 2.18: Development of mean and variance cdf of the model (2.65) throughout variational Bayes learning. The total number of iterations was 12. The synthetic Data Set I was used.

mentioned in Section 2.8.2 the model priors were set to be flat as possible. The mean priors, which follow Normal cdf (2.32), are really very flat-the green (lighter grey) cdf in Figure 2.18, first column. The covariances priors, which follow Wishart cdf (2.32), were set to the total training covariance matrix. Therefore the covariance cdf prior is the same over all states (the reader should not be confused with different subfigures scale). As the training process goes further, the refinement in means and covariances can be observed. The final cdf is marked by red (darker gray) color. More difficult is to display the Dirichlet distribution of transition matrix A or initial probability π. Since we are working with the model (2.65) that has four states N = 4, the Dirichlet distributions are four dimensional. To be able to illustrate at least some evaluation of Dirichlet distribution, we decided to show in Figure 2.19 the first two states of initial probability π. At the beginning, the corresponding hyper-parameters κ1 and κ2 are set to be flat, e.g. κ1init = κ2init = 1. As in case of mean and covariance evaluation, the same refinement procedure can be seen in Figure 2.19. Even though the Dirichlet distribution is two dimensional, the final cdf is not a surface but it is a curve in two dimensions due to the restriction (2.18). Hence the points that satisfy the condition (2.18) lie on the diagonal of 2D mesh grid x-y surface (the cyan–lighter gray colored line connecting the [0, 1] and [1, 0] coordinates in Figure 2.19). It is interesting to note that after two steps, where the curve shape is linear, the curve

2.8. WHICH METHOD?

38

starts to have the similar shape as a last curve. The final value of initial hyper parameters are κ1 = 1.6627, κ2 = 1.0078 indicating that observation sequence did not include any useful information about the initial state 2 because only very small difference of 7.8e−3 from the initial value κ2init = 1 has taken place.

Figure 2.19: Development of two first Dirichlet cdf states throughout variational Bayes learning. The total number of iterations was 38. The synthetic data set I was used.

Throughout all experiments we must address the numerical problems of classical and MAP approaches. Especially in MAP approach when in case of combination of MAP learning and Viterbi initialization we could not proceed further and therefore not available data (NA) are presented across the Tables 2.2-2.8. In addition, we found that there are no big differences in performance if the hyper-parameters of MAP learning are updated during training procedure or are used only for the initialization and afterwards kept fixed. As a result, instead of hyperparameters updating the parameters λ were directly optimized-see Appendix B.2 where the reestimation formulas are provided. We also follow the same approach in classification (Chapter 5). Apart of the main drawback of EM algorithm that is the sensitivity to initialization, the EM algorithm led also to meaningless parameters estimation several times when the EM converged to the boundary of the parameter space. Here the likelihood is unbounded [McLachlan and Peel, 2000], and the computation had to be either restarted or it was sufficient to restart only the covariance diagonals. We discuss about numerical stability and relating problems in Appendix B.4.

2.8. WHICH METHOD?

2.8.5

39

Conclusion

We conclude that HMM initialization, mainly k-means and GMM approaches, deliver better performance in terms of these evaluation criteria: percentage difference, data likelihood and distance measure. There are no big differences between k-means and GMM methods regarding performance but in terms of speed the k-means initialization would be preferable. Viterbi initialization was outperformed by GMM and k-means initialization. During experiments we had to cope with numerical instability of EM algorithm, mainly in case of classical and a maximum a posteriori approach. In some cases the MAP approach was unsuccessful to find local optima even when the algorithm was restarted several times. We noted that no similar numerical problems occurred in case of variational Bayes approach. We will also show in Section 5 that variational Bayes approach is the fastest learning method. Considering MAP approach, its advantages, when sparse training data (i.e. our Data Set I) is available, are argued in literature [Gauvain∗ and Lee, 1994], [Rezek∗ et al., 2002]. It should give a better estimates than for example classical ML approach. However, we could not confirm these findings because there are no apparent differences in performance compared to the rest of methods in case of Data Set I. To sum up, the most suitable combination for further proceeding is to apply variational Bayes approach due to its numerical stability, partly due to its insensitivity to initialization in case of sufficient data and finally due to its speed. We prefer also to use k-means initialization method rather than GMM initialization because of its speed.

Chapter 3

Signal Preprocessing Signal preprocessing plays a very important role for further analysis as classification (Section5) or clustering (Section 6). In this chapter we explore the three main problems which normally arise in signal processing tasks: characteristic points detection, baseline removal and signal denoising. All the mentioned methods will be presented under wavelet framework. Therefore in the following first two sections, we will focus our attention on the wavelet theory. Then a combination of Hidden Markov Model approach with wavelet properties will be explained in Section 3.1.2. In addition, the application of wavelet transform on characteristic points detection (Section 3.2), baseline removal (Section 3.3) and denoising (Section 3.4) will be demonstrated. Moreover, we assume that the reader is familiar with the basic concept of wavelet transform and with wavelet packets. We just present fundamental viewpoints merely for terminology and notation conventions. We refer the reader to the following excellent tutorials [Burrus, 1997] or [Daubechies, 1992].

3.1

Wavelet transform

The classical Fourier transform (FT) is not suitable if the signal has time varying frequency, i.e., the signal is non-stationary. This means that the FT tells whether a certain frequency component exists or not. This information is independent of where in time this component appears. It is possible to analyse any signal by using an alternative approach called wavelet transform (WT). WT analyses the signal at different frequencies with different resolutions. WT is designed to give good time resolution and poor frequency resolution at high frequencies and good frequency resolution and poor time resolution at low frequencies. This approach makes sense especially when the signal at hand has high frequency components for short durations and low frequency components for long durations, which is the case in most biological signals, mainly EEG, EMG and ECG. One example of the wavelet analysis is shown in Figure 3.1

3.1.1

Wavelet theory

The continuous wavelet transform (CWT) is defined as follows [Daubechies, 1992]: Z ∞ C(a, b) = f (t)ψa,b (t)dt

(3.1)

−∞

where ψa,b is a window function called the mother wavelet, a is a scale and b is a translation. The term wavelet means a small wave. The smallness refers to the condition that this (window) function is of finite length (compactly supported). The wave refers to the condition that this function is oscillatory. The term mother implies that the functions with different region of support that are used in the transformation process are derived from one main function, or

40

3.1. WAVELET TRANSFORM

41 ECG signal

9 0 8 0

Amplitude

7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 -1 0 0

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0

1 2 0 0

Samples

Figure 3.1: At the top the ECG signal to be decomposed is shown. At the bottom corresponding wavelet coefficients are depicted.

the mother wavelet. In other words, the mother wavelet is a prototype for generating the other window functions. The relation between scale and frequency is that low scales correspond to high frequencies and high scales to low frequencies. The illustration in Figure 3.2 is commonly used to explain how time and frequency resolutions should be interpreted. Every box in Figure 3.2 corresponds to a value of the wavelet transform in the time-scale plane. Note that boxes have a certain non-zero area, which implies that the value of a particular point in the time-scale plane cannot be known. Regardless of the dimensions of the boxes, the areas of all boxes are the same and determined by Heisenberg’s inequality. As a summary, the area of a box is fixed for each mother wavelet, whereas different windows or mother wavelets can result in different areas. However, all areas are lower bounded by π/4. That is, we cannot reduce the areas of the boxes as much as we want due to the Heisenberg’s uncertainty principle. Regarding practical aspects of wavelet implementation, we introduce discrete wavelet transform (DWT). Here, we have a discrete function f (n) and the definition of DWT is given by [Burrus, 1997]: C(a, b) = C(j, k) =

X

f (n)ψj,k (n)

(3.2)

n∈Z

where ψj,k is a discrete wavelet defined as: ψj,k = 2−j/2 ψ(2j n − k)

(3.3)

3.1. WAVELET TRANSFORM

42

Figure 3.2: The two basic wavelet operations-scale and translation define a tiling of the scale-time plane.

The parameters a, b are defined in such a way that a = 2j , b = 2j k. Sometimes the analysis under these constraints is also called dyadic. The inverse transform is defined in a similar way as [Burrus, 1997]: f (n) =

XX

C(j, k)ψj,k (n)

(3.4)

j∈Z k∈Z

3.1.2

Wavelet-domain Hidden Markov Models

The wavelet transform has several attractive properties that make it natural for signal and image processing. We call them the primary properties of the wavelet transform. • Locality: The wavelet coefficient is localized simultaneously in time and frequency. • Multiresolution: Wavelets coefficients are compressed and dilated to analyze at a nested set of scales. • Compression: The wavelet transform of real-world signals tends to be sparse. The locality and multiresolution properties enable the wavelet transform to efficiently cope with a wide range of different signals, from high-frequency transient and edges to slowly varying harmonics. Another interesting property is the compressing ability, which implies that wavelet transform of most signals are sparse, resulting in a large number of small coefficients and a small number of large coefficients. [Crouse∗ , 1998] adopted a statistical approach to model dependencies between wavelet coefficients. His statistical model describes following secondary properties: • Clustering: If a particular wavelet coefficient is large/small, then adjacent coefficients are very likely to be large/small as well. • Persistence: Large/small values of wavelet coefficients tend to propagate across scales. These primary and secondary properties of the wavelet transform suggest a natural statistic model. Persistence suggests that wavelet coefficients can have strong dependencies across scale (vertically in Figure 3.3(b)), whereas clustering and locality suggest that coefficients can have strong dependencies within scale (horizontally in Figure 3.3(b)).

3.1. WAVELET TRANSFORM

43

The problem is how to model the joint probability density function of all the wavelet coefficients f (wi ). A typical wavelet coefficient density or histogram is non-Gaussian, typically much more ”peaky” at zero and heavy-tailed. The proposed stochastic model uses these two components: probabilistic graphs and mixture densities. Probabilistic Graphs: To characterize the key dependencies between the wavelet coefficients, we introduce Markovian dependencies between the hidden state variables. These dependencies are described by a probabilistic graph or tree-see Figure 3b. We will investigate three simple probabilistic graphs with state-to-state connectivities in Figure 3.3(b). The independent mixture (IM) model leaves the state variables unconnected and, hence, ignores any intercoefficients dependencies - 3.3(a). The Hidden Markov Chain model connects the state variables horizontally within each scale. The Hidden Markov Tree model connects the state variables vertically across scale. We will refer to these models collectively as wavelet-domain HMMs. Mixture densities: To mach the non-Gaussian nature of the wavelet coefficients, we model marginal probability f (wi ) of each coefficients as a mixture density with a hidden state variable.

(a) The independent mixture model

(b) wavelet-domain HMMs

Figure 3.3: Statistical dependencies of wavelet transform. Sub-figure (a): The state variables are unconnected. Sub-figure (b): Horizontally connected states are represented by Hidden Markov Chain while vertically connected states are represented by Hidden Markov Tree.

Probabilistic model for an individual wavelet coefficient is depicted in Figure 3.4. The compression property of the wavelet transform states that most wavelet coefficients have small values and, hence, contain very little signal information. A few wavelet coefficients have large values that represent significant signal information. The property leads to the following simple model for an individual wavelet coefficient. Each coefficient is modelled as being in one of two states: ”high” corresponding to a wavelet component containing significant contributions of signal energy or ”low”, representing coefficients with little signal energy. If we associate with each state a probability density-say a high-variance, zero mean density for the ”high” state and a low-variance, zero mean density for the ”low” state, the result is a two mixture model for each wavelet coefficient.

Figure 3.4: Two-state, zero-mean Gaussian mixture model for a random variable W . In total two mixtures describe the statistical behavior of wavelet coefficients: ”high” and ”low” state energy level.

3.2. ECG CHARACTERISTICS POINTS DETECTION

44

As we see from Figure 3.4, the two-state, zero-mean mixture model is completely parameterized by the density function of the state variable S, pS (1), 1 − pS (1), and the variance corresponding to each state. In most applications of mixture models, the value of the coefficient W is observed, but the value of the state variable S is not; we say that S is hidden. In general, we can imply an M -state Gaussian mixture model for a random variable W . Then the cdf of W is given, which is similar to (2.16): fW (w) =

M X

ps (m)fW (w|S = m)

(3.5)

m=1

Now we have prepared all necessary framework to define Hidden Markov Tree (HMT) which we construct by connecting state variables across scale. Using an M -state Gaussian mixture model for each wavelet coefficient Wi , the parameters for the HMT model are the following: • pS1 (m), the initial probability for the root node S1 (this is similar to the initial state probability). • pSi ,Sx (m|Sx = r), the conditional probability that child Si is in state m given parent Sx is in state r (this is analogous to transitional probability matrix A). • µi,m , σi,m , the mean and variance, resp., of the wavelet coefficient Wi given Si , is in state m. As in case of classical HMM, there are three canonical problems associated with the HMTs: (i) Training: Given one or more sets of observed wavelet coefficients wi , determine the HMT parameters that best characterize the wavelet coefficients. (ii) Likelihood Determination: Given a fixed HMT with parameters, determine the likelihood of an observed set of wavelet coefficients wi . (iii) State Estimation: Given a fixed HMT with parameters, determine the most likely sequence of hidden states si . The mainframe of EM algorithm is applied to these three problems in [Crouse∗ , 1998].

3.2

ECG characteristics points detection

Characteristics points detection is difficult, not only because of the physiological variability of the QRS complex, but also because of the various types of noise that can be presented in the ECG signal. Noise sources include muscle noise, artefacts due to the electrode motion, power-line interference, baseline wander, and T wave with high-frequency characteristic similar to QRS complexes. As a preprocessing, a filtering can be used, but filtering can alter the signal and may require substantial computational overhead. We give a brief survey of QRS wavelet detection algorithms here. In [Mallat and Zhong, 1992], the wavelet algorithm is used as a pre-processed linear tool for further detection by means of differation, time averaging and thresholding techniques. In [Yang et al., 1997] the WT is used for feature extraction for neural network. This approach allows the classification of ECG signals (automatic ECG diagnosis). Moreover, in [Yang et al., 1997] all important points of ECG signal are found, afterward the baseline is estimated by using the least squares approximations. Finally, in [Lhotska∗ et al., 2003] and [Li et al., 1995] the wavelet transform is used as a main tool for detection. The detection is performed both in time and wavelet domain. Our approach is based on the work of [Mallat and Zhong, 1992] and Li et al. [Li et al., 1995]. The basic idea stems from the properties of the function derivatives that represent useful information about the analyzed signal in the time domain, i.e. the morphology of the signal. The algorithm developed to analyse ECG and recognize the characteristic points uses WT detail components with different frequency contents, the characteristic points detection in time

3.2. ECG CHARACTERISTICS POINTS DETECTION

45

has been performed on the first four details. We must construct an appropriate mother wavelet to reveal as much information as possible about characteristic ECG complexes. From now on we denote by details the results of presented following formulas, i.e. j-th detail is denoted by W2j f (x). We use dyadic scales of the smoothing function that we denote Θ2j (x) = 21j Θ( 2xj ). It follows that each of the details derivatives W2j f (x) in Figure 3.5 can be described by the formula d d Θ j )(x) = 2j (f ∗ Θ2j )(x) (3.6) dx 2 dx where * denotes convolution. We call the first derivative of the smoothing function Θ2j (x) wavelet mother function and denote it by ψ2j (x). This allows us to modify the formula (3.6) into W2j f (x) = f ∗ (2j

W2j f (x) = (f ∗ ψ2j )(x)

(3.7)

Table 3.1: Equivalent FIR filters coefficients and normalisation coefficient λ

G H ξj , j = 1, . . . , 5

[-2.0 2.0] [0.125 0.375 0.375 0.125] {1.50,1.12,1.03,1.01,1}

Parameters of the smoothing function determine parameters of a finite impulse response (FIR) filter, which is used to obtain all details utilized for ECG signal analysis. In this work we use the cubic spline smoothing function and the quadratic spline wavelet function with compact support, for details see [Mallat and Zhong, 1992]. The smoothing function and the corresponding wavelet function are defined by its Fourier transform sin(ω/4) 4 ) ω/4 sin(ω/4) 4 ψ(ω) = iω( ) ω/4

Θ(ω) = iω(

(3.8) (3.9)

The parameters of the equivalent FIR filter for this pair of smoothing and wavelet function are described in Table 3.1. All these parameters of FIR filters are taken from [Mallat and Zhong, 1992]. The equivalent FIR filters of the wavelet function used ψ(ω), G(ω) and H(ω) are defined as ω H(ω) = exp(iω/2)(cos( ))3 2 ω 3 G(ω) = 4i exp(iω/2)(sin( )) 2

(3.10) (3.11)

We obtain the jth detail W2j of the analysed ECG signal by the following formulas which implement a bank of dyadic filters

W2j+1 f = S2j+1 f

1 S j f ∗ Gj ξj 2 = S 2j f ∗ H j

(3.12) (3.13)

where S20 f is the input signal, ξ is the normalization coefficient and Gj and Hj are filters formed from filters G and H in Table 3.1 by inserting 2j−1 zeros between the coefficients.

extreme. The position of the P wave is selected as a zero cross-point between the two selected extremes. We select the offset of the P wave as a zero cross-point or the first local minimum after the second selected extreme. While the FIR filtering moves the details forward on the time axis we have to subtract the constants for sharp P 3.3. BASELINE WANDERING REMOVAL localisation. These values are empirically set to ∆P1=7, ∆P2=10 and ∆P3=11 for the onset, position and offset of the P wave, respectively. 1 0.5 Amp 0 [mV] -0.5 1

0

46

Input ECG Signal S ( 20 f)

20

40

60

80 sa

100

120

First Detail (W21 f )

[-] 0 -1 5

0

20

40

60

80

100

120

Second Detail (W22 f )

[-] 0 -5 0 2

20

40

60

80

100

120

Third Detail (W23 f)

[-] 0 -2 0 1

20

40

60

80

100

120

Fourth Detail (W24 f )

[-] 0 -1 0

20

40

60 samples [-]

80

100

120

Figure 12. Details of ECG signal obtained by filtering by the equivalent FIR filter of Figure 3.5: ECG signal obtained by filtering FIR filter of the wavelet the Details waveletoftransform. Arrows illustrate detection by of the QRSequivalent complex starting from transform.fourth Arrows illustrate detection of QRS complex starting from fourth detail. detail. The search for the T wave is performed in the time window after the detected R

Detection thewidth QRS,ofP this andwindow T complexes is performed four details of the processed = 65 all samples for sampling wave.ofThe is set to the value δTusing signal acquired by filtering with equivalent FIR filters of the wavelet transform according to frequency fs=125 Hz. If the width of the time window exceeds two thirds of the time Figure 3.5. It between can be two seenneighbouring that the procedure is change a nested method going span R waves, we thethresholding width of the time window to from the twodetail thirds until of the the time first span detail of the R-R waves. In thedetermines time windowQRS we search for a pair last fourth which finally complex, resp P and T of extremes with opposite signs that the∗ threshold value εP= 0,0025 [-] and is complexes. More details can be found in exceed [Lhotska et al., 2003].

the closestofpair the detected QRS complex. The in pair signs have to be used ordered [+ -]signal is the One example thetoalgorithm output is depicted Figure ??. The ECG when the detected R wave is oriented to +∞, and [- +] otherwise. The onset of the T second register in the MIT/BIT arrhythmia’s file 100.dat. Note that in this case all important wave is detected as the zero cross-point position or the first minimum that precedes the points have detected. firstbeen extreme. The position of the T wave is selected as a zero cross-point between the The wavelet transform be aofgood signal analysis. We can two selected extremes.has Weproven select thetooffset the T tool wave for as a ECG zero cross-point or the conclude from our experiments that the detection achieves a sufficiently high level of reliability, e.g. - about 80-90 %. It enables to detect required values of selected attributes in few steps which is important from the point of view of time required for processing. The extracted values are then used as input values for an ECG modelling using HMM in Chapter 4 and for beats extraction in Chapter 6.

14

3.3

Baseline wandering removal

One of the first stages applied in the preprocessing of ECG signals is the removal of baseline wandering. This interference appears in the acquisition stage due to different reasons: patient movement, breathing, physical exercise, etc. The baseline wandering can make the inspection of ECG difficult. Moreover, in automatic inspection systems, other processing tasks such as wave detection, signal classification, etc. can be affected by it. Therefore, it is of great importance to reduce its effect as much as possible. The comparative study of some methods can be found in [Royo, 1998]. The proposed method is straightforward. The wavelet approximation of signals is applied as the tool to obtain a good approximation of the ECG baseline since both are low frequency signals. Then resulting signal is obtained by subtracting the approximation from the original signal.

3.3. BASELINE WANDERING REMOVAL

47

ECG point detection 1.2

R

p

1

Amplitude

0.8

0.6

0.4

0.2

Pp

Tp To

0

Qo

Sf

T f

P f

Po

Sp

-0.2

Q 1050

1100

1150

p

1200 1250 Samples

1300

1350

1400

Figure 3.6: Detection of significant points of an ECG signal. Legend: o-complex onset, p-complex peak, f-complex offset.

The process described previously has been applied to signals with a synthetic baseline wander, given by combined sinusoids. These sinusoids have been generated for a frequency range from 0.5Hz to 5Hz, in steps of 0.5 Hz. Real signals from our database with baseline wandering have been tested and signals with the synthetic baseline wander described above. In both cases, the level of the approximation obtained is such that the error of the resulting signal, after baseline suppression, was found to be minimum. Nevertheless, to obtain good results, the level of such approximation must be defined, namely, the degree of accuracy of the approximations. Otherwise, there will be an over-fitting effect in the baseline approximation due to a level too low, as is shown in Figure 3.7, or the opposite, with a poor approximation due to a level too high. In this section a method to automatically ascertain the best level is presented [Cuesta∗ et al., 2000], which makes the process unsupervised. This method is based on measures of the resulting signal variance and on spectrum energy dispersion of the approximation. The best level depends on the amplitude and main spectrum distribution of the baseline interference. In order to eliminate or reduce the baseline wandering, the approximations must have a narrow spectrum, as such interferences are usually almost pure sinusoids. Besides, the variance of the resulting signal should be as low as possible, since the approximations must not have high frequency components such as peaks following R peaks, and so, the final signal must be flat. Once the level is obtained, the wavelet approximation is calculated, and then, it is subtracted from the signal. So, the baseline wander of this signal is greatly reduced. Therefore, the whole process is carried out without user intervention, which is an advantage regarding other traditional methods. For example, in Figure 3.8, a good baseline approximation using wavelets is shown. In order to obtain the best approximations, some wavelets have been tested. After this experiment phase, Daubechies 4 wavelets offering a lower level of approximations have been

3.3. BASELINE WANDERING REMOVAL

48

Baseline approximation, level: 6 200

Amplitude

150

100

50

0

-50

-100 0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Samples

Figure 3.7: Over-fitting using wavelet approximation with Daubechies 4 wavelet. The level used is too low and then high-frequency features are also approximated.

Baseline Base-lineapproximation, approximation, level: 8 level: 8 200

Amplitude

Amplitude

150

100

50

0

-50

-100

0

1000

2000

3000

4000 5000 Samples

6000

7000

8000

9000

Samples

Figure 3.8: Suitable approximation of the baseline using Daubechies 4 wavelet. The level used is correct.

finally applied. In Figures 3.9 and 3.10, featuring two kinds of results, the level calculations are shown. The functions plotted correspond to the error obtained after baseline wander removal, variance, and spectrum dispersion. It can be seen how the intersection point coincides with the level corresponding to the minimum error. In Figure 3.9 , the intersection point corresponds to the level 7. It can be seen how the approximations dispersion is higher before and after that point. The resulting signal variance increases greatly after that point as well. The error, calculated from the difference between the ideal signal without baseline wandering and the signal obtained,

3.3. BASELINE WANDERING REMOVAL

49

BestThelevel approximation 7 determination of the best level determination: approximation: 7 2.5 Error Variance Dispersion

Amplitude

Amplitude

2

1.5

1

0.5

0

-0.5

-1

6

7

8

9 Level

10

11

12

Samples Figure 3.9: Level approximation detection using the variance and spectrum dispersion. The intersection point is approximated to the nearest integer to ascertain the level.

is also greater before and after that point, which shows the proposed method is suitable for baseline removal. The question arising from this procedure is the choice of approximation level range. Within this range, the variance and dispersion is calculated for each level and the intersection point is taken as a reasonable level as was shown in Figure 3.9 and 3.10, where the range is r =< r1 , r2 >=< 6, 12 >. If the range is very wide, the computational costs are increased, if it is narrow, the optimal approximation level may not be found. As the signal with sampling

BestThelevel approximation 8 determination of the best level determination: approximation: 8 2.5 Error Variance Dispersion

Amplitude

Amplitude

2

1.5

1

0.5

0

-0.5

-1 6

7

8

9 Level

10

11

12

Samples

Figure 3.10: Level approximation detection using the variance and spectrum dispersion. The intersection point coincides with the minimum error.

3.4. NOISE FILTERING

50

frequency fs is decomposed, the frequency bandwidth is divided by two at each stage after downsampling. This results in a logarithmic set of bandwidths as illustrated in Figure 3.11. Approximation cJ is computed passing signal through J low-pass filters, the resulting frequency band for a generall case J, is B = (0, π/2J ). Considering sample frequency fs = 250Hz and according to Nyquist’s rule, the highest frequency component that exists in the signal is 125Hz (corresponding to discrete frequency domain), we obtain for r1 = 6, r2 = 12 the bandwidth B1 = (0, 2Hz), B1 = (0, 0.03Hz), resp. Hence we cover all baseline variations in the frequency range R = (0.03Hz, 2Hz). Since the baseline has the low-frequency character, where the frequency range according to [Friesen et al., 1990] covers our proposed range R, it is sufficient to compute approximation within the r =< r1 , r2 >=< 6, 12 >. The method presented allows the off-line removal of the baseline wandering without need of neither preprocessing nor baseline points selection, which makes its use easier than other methods. Besides, it is not affected by noise because details, where the noise is included, are not used in the calculations. H(v) d3

c3

d1

d2

v 0

p 8

p 4

p 2

p

Figure 3.11: Frequency bands for three stages of wavelet decomposition, J = 3, ci is approximation, di is detailed.

Another advantage is the possibility of applying this approximation to a wide range of baseline wander fundamental frequency, because the level is automatically adapted. The disadvantage of this method is the fact of calculating more approximations to find the best one, although it is fast enough. Nevertheless, further work must be done in order to improve the results when the baseline bandwidth is wide, since the wavelet approximation covers only a narrow bandwidth. Some synthesis baselines with non-stationary frequency spectrum must be studied, such as sinc or chirp signals.

3.4

Noise Filtering

Wavelet denoising method or nonlinear wavelet filtering is based on taking DWT of a signal, passing transform through a threshold , which removes the coefficients below a certain value, then taking the inverse DWT, as illustrated in Figure 3.12. The method is able to remove noise and achieve high compression ratios because of the ”compressing” ability of the wavelet transform as it was mentioned in Section 3.1.2. The DWT localises the most important spatial and frequential features of a regular signal or image in a limited number of wavelet coefficients. Moreover the orthogonal transform of stationary white noise results in stationary white noise. This means that the expected noise energy is the same in all coefficients. If this energy is not too large, noise has a relatively small influence on the important large signal coefficients. These observations suggest that the small coefficients should be replaced by zero, because they

3.4. NOISE FILTERING

y

Decomposition W

51

w

wd

Thresholding Dd

Reconstruction -1 W

yd

Figure 3.12: Basic denoising concept.

are dominated by noise and carry only a small amount of information. As we will see next, some kind of thresholding scheme is normally applied.

3.4.1

Basic concepts

We consider the following model of a discrete noisy signal [Donoho∗ , 1995]: y(n) = f (n) + σe(n), n = 1 . . . N

(3.14)

The vector y represents a noisy signal and f is an unknown, deterministic signal. We suppose that e is Gaussian white noise N (µ, σ) = N (0, 1). To reconstruct the original data, we use a wavelet representation. We use simple nonredundant, orthogonal, discrete wavelet transforms. This operation can be represented by an orthogonal matrix W . We consider the following transformation [Jansen∗ et al., 1997]:

w = Wy v = Wf n = We (3.15) We will use Donoho’s ”soft-thresholding” or ”shrinking” function, shown in Figure 3.13, a wavelet coefficient w between −δ and δ is set to zero, while the others are shrunk in absolute value.

wd

d d

w

Figure 3.13: Comparison of soft (full) and hard (dashed) thresholding signal.

The shrinking (soft-thresholding) operations can be represented as:

3.4. NOISE FILTERING

52

 S(j, k) =

sgn(Cj,k )(|Cj,k | − δ) if |Cj,k | ≥ δ 0 if |Cj,k | < δ

Or in matrix notation: wδ = Dδ w

(3.16)

Dδ = diag(dii )

(3.17)

where

( diag(dii ) =

0 1−

δ |wi |

if |wi | < δ otherwise

Other possible thresholding function is hard function, which can be described as the usual process of setting to zero the elements whose absolute value are lower that the threshold - see Figure 3.13. Inverse transforms give the result: yδ = W −1 y

(3.18)

The overall operation can then be represented by: y δ = Aδ y

(3.19)

Aδ = W−1 Dδ W

(3.20)

where

We call Aδ the influence matrix. Through Dδ , it depends on the threshold value but also on the input signal y. A natural question arising from this procedure is how to choose the threshold. We will try to answer to this question in the next section.

3.4.2

Which threshold method?

The choice of threshold should be optimal in a certain way. If yδ is the result of applying the threshold procedure to the wavelet coefficients of a signal y, and eδ = yδ − f is the noise of this result, then an often used criterion to measure the quality of this result is its signal-to-noise (SNR) ratio: P 2 f SN R = 10log10 P i 2i (3.21) i eδi An optimal choice of δ should maximize SNR. This is equivalent to minimizing the mean square error R(δ), or in the literature, the error is also called risk function R(δ) =

N 1 X 1 1 (yδi − fi )2 = kyδ − f k2 = keδ k2 N N N

(3.22)

i

where we used the classical Euclidian vector norm. Because of the orthogonality of W , we can also compute R(δ) from the wavelet coefficients as: R(δ) =

1 kηδ k2 N

(3.23)

3.4. NOISE FILTERING

53

where ηδ = wδ − v = W eδ

(3.24)

is the noise after the operation in the wavelet domain. However, because f is unknown the function, R(δ) is not computable and hence it cannot be used to find an optimal δ. The optimal threshold has to be estimated. VisuShrink Donoho and Johnstone [Donoho∗ , 1995] propose to use the ”universal threshold” estimation √ δ=

2 ln N σ ¯

(3.25)

where σ ¯ is an estimation of the noise variance σ 2 given by σ ¯ = median(|C(j, k)|)/0.6745

(3.26)

They called the method VisuShrink in reference to the good visual quality of reconstruction obtained by the simple ”shrinkage” of wavelet coefficients. The resulting estimation yδ is with high probability tending to 1 as N → ∞ at least as smooth as f . SureShrink In the other paper [Donoho∗ , 1991], Donoho proposed estimation of threshold based on Stein’s Unbiased Estimate of Risk (SURE). The scheme is level dependent, this means that the shrinking function is applied at each scale j, j = 1, .., J. Then the estimated coefficients wδ,j are obtained based on the selected threshold δ = [δ1 , . . . , δJ ]. Note δj is the threshold for wavelet coefficients at scale j. Now we briefly introduce the Stein estimator. Let f = (f1 , f2 , . . . , fN ) be a N dimensional vector, and let yi = N (fi , 1) be a multivariate normal observation with that mean vector. Let fˆ = fˆ(y) be a particular fixed estimator of f . Stein introduced a method for estimating the loss kfˆ− f k2 in an unbiased fashion. Stein showed that when g(y) is weakly differentiable, then Ef kfˆ − f k2 = N + Ef {kg(y)k2 + 2∇y g(y)}

(3.27)

PN

∂gi where ∇y g(y) = i=1 ∂y . i SURE is an unbiased estimator of the above risk (3.27), defined as

SU RE = N + {kg(y)k2 + 2∇y g(y)}

(3.28)

If we now consider equation (3.28), we can apply the result in the wavelet domain where ˆ f (y) ≈ wδ (y), f (y) ≈ v(y) and if we use soft threshold function wδ = Dδ w we get: SU RE(δ, w) = N − 2{i : |wi | ≤ δ} +

N X

(|wi |δ)2

(3.29)

i=1

We can select a threshold δ as: δS =

argmax

SU RE(δ, w)

(3.30)

δ∈{w1 ,w2 ,...,wN }

Notice that we do not need minimize equalization, the optimisation problem (3.29) is computationally straightforward. Suppose, without any loss of generality, that the wi have been reordered in order of increasing |wi |. Then, on intervals of δ which lie between two values of |wi |, SURE (3.29) is strictly increasing. Therefore, the minimum value δ S is one of the data values |wi |.

3.4. NOISE FILTERING

54

Comparison of denoising methods In order to gain the best performance of algorithm proposed, in the next section we tested the combination of VisuShrink and SureShrink along with soft and hard thresholding. The results can be seen in Table 3.2. In the first column is SNR of noisy signal, in the second one, is the gain of denoising defined as: Gain = SN R(denoiseds ignal) − SN R(noisys ignal). Table 3.2: Comparison of different denoising techniques using Daubechies 4 wavelet and 4th level of decomposition, used signal is sel100 1 (MIT/BIH database – first lead).

Noisy Signal 10 5 0 -5

VisuShrink Hard +4.54 +6.30 +8.12 +9.94

VisuShrink Soft +7.73 +8.38 +9.32 +9.84

SureShrink Hard +3.25 +3.28 +4.69 +5.23

SureShrink Soft +4.23 +4.10 +5.12 +5.58

It is clear from the result that the combination of VisuShrink along with soft-thresholding approach yields the best performance among others. Thus, this denoising policy is used in the next section.

3.4.3

Which wavelet?

Because we have a choice among an infinite set of basis functions, we may wish to find the best basis function for a given representation of a signal (3.14). A basis of adapted waveform is the best basis function for a given signal representation. The chosen basis carries substantial information about the signal, and if the basis description is efficient (that is, very few terms in the expansion are needed to represent the signal), then that signal information has been compressed [Wickerhauser∗ , 1994]. For adapted waveform analysis, we seek a basis in which the coefficients, when rearranged in decreasing order, decrease as rapidly as possible. To measure rates of decrease, the tools from classical harmonic analysis are used including calculation of information cost functions. Examples of such functions include the number above a threshold, concentration, entropy, logarithm of energy, Gauss-Markov calculations, and the theoretical dimension of a sequence. We used the modified scheme proposed by [Wickerhauser∗ , 1994] based on adaptive waveform denoising. • Select the best basis Ψbest , where Ψbest is mother wavelet. • Put coefficients of Ψbest into decreasing order of amplitude. • Take only N0 significant coefficients. We can obtain N0 as the number of coefficients with amplitudes above some threshold. The selection of the threshold is VisuShrink, described in the previous section. • The algorithm is constituted of iterative decomposition and reconstruction, where its flow is controlled by two parameters - fraction η and percentage factor ξ. The behaviour of wavelet coefficients during each iteration depicted in Figure 3.15 suggests the following scheme for termination of algorithm. After the first iteration when the best base has been selected, we take the maximum of the wavelet coefficients m1 = max(C1 (i, j)) times the percentage factor ξ as a base of algorithm termination condition – m1 ξ. The algorithm stops if the difference of two successive maximums is less than mi−1 − mi < m1 ξ. The further decompositions yield no better results, as the signal contains mostly noise.

3.4. NOISE FILTERING

55

Table 3.3: Control parameters of adapted wavelet denoising.

Noise Level Low Medium High

Fraction η 0.7 0.8 0.9

Percentage ξ 0.25 0.20 0.15

If the condition is not fulfilled, we split signal f into two parts c1 , r1 , where c1 is reconstructed from ηN0 and r1 is the remainder reconstructed from the rest of coefficients. We proceed by using r1 , r2 , . . . as the signal and iterating the splitting until the fulfilment of the condition defined above. Finally, only ck parts of signal are superposed, the remainder is qualified as noise. The two parameters ξ and η are adjusted heuristically by feedback to get the cleanestlooking signal according to different level noise. The detection of noise level will be discussed in Section 3.4.5. The set-up of the parameters is summarized in Table 3.3. For an only one iteration, the algorithm is reduced to standard wavelet packet analysis. We give up computational speed in exchange for denoising efficiency. The whole algorithm is depicted in Figure 3.14.

3.4.4

Wavelet-domain HMMs denoising

In Section 3.1.2 we emphasised the combination of wavelet transform with HMM by applying statistical dependencies across vertical and horizontal scales. In our previous work [Novak∗ , 2002] we applied this approach to image denoising. We will take advantage of this approach and use HMT for signal denoising as well. We first fit an HMT model to the vi from

Decompose...

...

Y1 best

...reconstruct.

N0 1 Decompose...

...

Y2 best

...reconstruct.

N0 2

YK best

. . .

Decompose...

...

...reconstruct.

N0 K

=

Coherent signal

Figure 3.14: Scheme of adapted denoising.

3.4. NOISE FILTERING

56

The wavelet coefficients after each iteration M1

15

db6 db8 db4 coif3 sym3

14 db6 13

12

Amplitude

11

10

9

M2

8

7 db8 6

M3

db4

5

0

20

40

60

80

100

120

140

160

180

Samples

Figure 3.15: The behaviour of wavelet coefficients in the medium noise section in Figure 3.17. Only 3 iterations have been processed, where Daubechies wavelets with 6 to 8 vanishing moments have been applied (coif-Coiflet wavelet, sym- Symlet wavelet). The signal sel100 1 was used.

the noisy data and then use this model as a prior signal distribution to compute conditional mean estimates of the vi given wi . We begin by estimating HMT parameters Θ for the signal wavelet coefficients using the noisy signal observation. This approach is called a Bayesian estimation procedure.

w=v+n

EM Algorithm

Q y

Decomposition W

Bayesian Estimation

wd

Reconstruction -1 W

yd

Figure 3.16: Bayesian wavelet-based denoising.

To use the HMT theory developed in Section 3.1.2, the key observation is that if the signal has a wavelet-domain HMM cdf, then the noisy signal has it as well. Therefore, adding the 2 independent zero-mean white Gaussian noise increases each mixture model variance σi,m by σ but leaves the other parameters unchanged. Hence we can obtain the signal wavelet model from the noisy signal by fitting an HMT to the noisy signal wavelet coefficients and then subtracting the added variance due to noise. If we denote the estimated mixture variance of the noisy 2 , then wavelet coefficient at location i in the mth state as γi,m 2 2 σi,m = (γi,m −σ ¯ 2 )+

(3.31)

with (x)+ = x for x ≥ 0 and (x)+ = 0 for x < 0. The noise power σ ¯ 2 is estimated by (3.26).

3.4. NOISE FILTERING

57

Once we have trained the HMT, estimation of the true signal wavelet coefficients (denoising) is straightforward. The conditional mean estimate of vi given wi and the state m is: E[Vi |Wi = wi , Si = m] =

2 σi,m 2 σ ¯ 2 + σi,m

wi

(3.32)

In other words, the (3.32) is a general implementation of Wiener filtering (3.33) considering that the signal coefficient is iid Gaussian with zero-mean and variance σy2 , then Wiener filter is simply a one-tap filter (we can consider as a simple linear rescaling of the measurement wi ). σy2

E[Vi |Wi ] = ¯2 wi σ + σy2

(3.33)

The output of EM algorithm are also the hidden state probabilities p(Si |w, Θ), given the model and the observed wavelet coefficients. Using these state probabilities, we obtain conditional mean estimates for vi via the chain rule for conditional expectation wδi = E[vi |w, Θ] =

X m

2 σi,m p(Si = m|w, Θ) 2 wi σ ¯ + σ 2 i,m

(3.34)

Note that this is the weighted version of Wiener filtering, with the likelihood of each state as the weight. The final signal estimate (denoised signal) is computed as the inverse wavelet transform of these estimates of the signal wavelet coefficients. The Figure 3.16 summarizes the proposed denoising scheme.

3.4.5

Detection of noise

Regarding our case – the ECG signal of long duration, standard denoising methods fail due to the fact that present noise varies in variance over the whole signal resulting in invalidity of the model (3.14). Therefore, one possible solution is to divide the signal in l rectangular windows. The problem is the selection of l, in case it is very high, the computational costs are high as well, and on the contrary, if it is very low we are drawn back to invalidity of model (3.14). To overcome these limitations we suppose that the first detail carries the largest quantity of information about noise, obtaining the detail using Daubechies 4 wavelet. Afterwards, the energy function of the first detail is computed to amplify the noisy parts of the signal and the estimation of ”average” white noise variance is performed by taking the median value of the wavelet coefficients at the finest scale [Donoho∗ , 1995]. We used the estimation (3.14) as a threshold for the square first detail, the amplitudes above the threshold define the windows of the non-fixed length. In each window we compute the standard deviation and normalise to the three levels that reflect the level of noise. The example of detection is shown in Figure 3.17.

3.4.6

Experimental set-up

For denoising the signal, we implemented the method described in Sections 3.4.3 and 3.4.4. Both wavelet transform, wavelet packet analysis and wavelet-domain HMMs were used to compute wavelet coefficients. We evaluated our scheme by signals obtained from the MIT/BIH arrhythmia database. In this experiment we use a ECG signal 100 1 (the first lead). To simulate ECG signal corrupted by noise, we generated three sources of Gaussian noise having zero-mean and standard deviation whose SNR is equal to 0,5,10 dB resp. Afterwards, we split the signal into several variable-length sections and added synthetic noise, where the order of the sources and sections lengths are determined by a uniform random distribution. We supposed that the first two sources with the lowest SNR simulate motion artefacts, the last one represents muscle contractions, respiration and noise generated by the

3.4. NOISE FILTERING

58 Detection of noise level

2

1.5

Amplitude

1

0.5

0

-0.5

Medium -1

0

1000

Low 2000

3000

High 4000

5000 6000 Samples

7000

Low 8000

9000 10000

Figure 3.17: The detection of noise level, the different noise level is considerated as low, medium and high (signal 100 1).

amplifier of the acquisition system. The final signal with different randomly generated sections of noise level is shown in Figure 3.17.

3.4.7

Results

In order to measure the performance of the proposed techniques, we compared the obtained results with two different filters that are commonly used in classic ECG processing: elliptic filter (cut-off frequency 32 Hz, three taps) [Sanjit, 1993] representing a linear method and median filter [Oppenheim, 1989] as a nonlinear technique (windows size 7, iteration times 3). In case of HMT denoising, we have used M = 4 mixture components and the maximum number of decomposition levels, which depends on the signal length. Table 3.4: Comparison of different denoising techniques.

Noisy signal 2.18 5.54 9.21 14.39

HMT wavelet +4.23 +6.01 +4.23 +3.11

Adapted wavelet +5.74 +4.53 +3.28 +3.27

Adapted packets +6.78 +5.52 +4.36 +2.54

Elliptic filter +2.73 +0.29 -2.88 -7.77

Median filter +4.74 +3.91 +2.46 +1.25

The results of wavelet and filters denoising methods are summarized in Table 3.4. In the first column SNR of the noisy signal is computed, in the columns referring to distinct filtering methods the gain of denoising is computed, defined as: Gain = SN R(denoisedsignal) − SN R(noisysignal). One example of output filtering is shown in Figure 3.18. By investigating Table 3.4 some conclusions can be drawn in terms of methods comparison [Cuesta∗ et al., 2000], [Novak∗ , 2003]. The more the signal is corrupted by noise (the less SNR) the higher Gain is achieved. The HMT denoising results are comparable to adapted wavelet

3.4. NOISE FILTERING

59 Noisy signal

Wavelet HMT

Adaptive wavelets

Adaptive packets

Median filter

Elliptic filter

Figure 3.18: Example of denoised signals, original signal 100 1 is solid (red) and denoised signal is dashed (blue). In the first row denoised original signal is shown, in the rest of pictures the results of denoising procedure are depicted. Note, that the most nice-looking ECG signal are those filtered by HMT and adaptive wavelets denoising. The other methods do not filter noise and artifacts out so efficiently, mainly in case of adaptive packets and elliptic filter, where the artefact are clearly visible.

and wavelet packets approach. In the former case we take advantage of the improved wavelet dependency structure description offered by HMT (Figure 3.3) while in the latter case the denoising improvement is earned due to the iterating denoising scheme (Figure 3.14). In most cases, the HMT and adapted wavelet approach outperform the classical methods as elliptic and median filtering. From the visual inspection point of view, the most nice-looking ECG signal is yielded using HMT and adaptive wavelets denoising. The other methods do not filter noise and artifacts out so efficiently.

Chapter 4

Modelling concept ECG signals are biomedical signals originating from the actions of the human heart. The human heart operates cyclically pumping blood into the arterial system, which causes bioelectrical variations detectable from human skin. The action of the heart causing the ECG signal is depicted in Figure 4.1, where we have one real cardiac cycle of an ECG signal. The whole ECG recording consists of several consecutive cardiac cycles. In the cardiac cycle there are fairly periodic waves and peaks corresponding to the consecutive phase of the heart’s action.

R

R R-R interval

P

T

P wave PQ segment

P

T wave

Q

S

ST segment

QRS complex

Q

S

QRS complex

Figure 4.1: Representative cycle of ECG signal.

A cardiac cycle of an ECG signal includes a QRS complex as its primary and most dominant subpattern. Before the QRS complex there is a P wave in the ECG, and after the QRS complex a T wave, which is larger that the P wave. Flat segments between the above-mentioned components are PQ, ST and TP segments. An RR interval, which is the delay between two consecutive QRS complexes, gives us useful information about the action of the heart. The most important parameters for the cardiologists are the durations and the amplitudes of the above mentioned subpatterns. In this part we will describe two slightly different approaches of ECG signal modelling using 60

4.1. GENERATION OF THE ECG SIGNAL

61

HMM. The achieved results and conclusions will create an important foundation for further ECG analysis by HMM, mainly in the next two chapters where the classification and clustering scheme are proposed. The main motivation for the research on the ECG modelling is the results verification in the clustering phase. Moreover, in this section we will determine the HMM topology (number of hidden states) by visual inspection of the artificial ECG signal generated from HMM. As we could see in Section 2.8 the HMM learning algorithms are very sensitive to the initial condition. In addition, taking into account that the more states the HMM has the longer it take to optimise parameters, we present two different HMM structures left-to-right model and full model. We only concentrate on ECG modelling using classical HMM approach in this chapter, because during experimental phase we found out that there are no big differences among different learning approaches and the results can be interpreted in the same way as we will interpret them in case of classical learning. Before we describe the left-right and full HMM modelling in more details, we explain how the ECG signal is generated from the HMM model.

4.1

Generation of the ECG signal

In case of an ECG signal analysis, the input string is either the amplitude or a broken line approximation of an ECG signal (4.3) achieved by the method described in Section 4.3.1. In the latter case, the first feature is the duration of the line segment, ti , and the second feature is the amplitude of the start point of the line segment, hi . We tested the construction of a hidden Markov model with ECG signal 100 2 (the second lead) from MIT/BIT arrhythmia database whose length after segmentation was approximately 300 segments, T = 300. Therefore the input string can be described as follows Oi =

ht i i

hi

∈ r

(4.2)

j=1

4.2

HMM ergodic model

The left-to-right model, also known as ergodic model, assumes that each state is assigned to each waveform or interval of the ECG waveforms (see Figure 4.2); states are connected in a left-to-right order. This topology reflects the sequential activity of the cardiac conduction system. Using the hidden Markov modelling approach, one observation sequence element is generated for each state transition so that there is a one-to-one correspondence between the observation sequence and the underlying state sequence. Transitions from each state back to

4.2. HMM ERGODIC MODEL

62

itself allow repetitions of model state symbols so that the variable durations of each waveform segment can be accounted for.

Po

Pp

Pf

Qo

Qp

Rp

Sp

Sf

To

Tp

0.143

0.070

0.091

0.026

0.039

0.025

0.011

0.251

0.181

0.008

0.857

2:

0.992

3:

0.930

4:

0.909

5:

0.974

6:

0.961

0.975

7:

8:

0.989

9:

0.749

Tf

0.008

0.819

10 :

End

0.163 11 :

0.837

1:

12 : 0.992

Figure 4.2: Ergodic HMM model of an ECG cycle. Each state has its physical interpretation: o-complex onset, p-complex peak, f-complex offset.

The Figure 4.2 illustrates the representation of the Markov chain topology. The left-to-right model is unable to learn structure without proper initialization. Therefore, we proposed the following technique based on the characteristic ECG points detection that was described in Section 3.2. As can be seen in Figure 4.2, each state represents one waveform segment. An artificial not interpolated signal 1.5

Amplitude

1

0.5

0

−0.5

0

10

20

30

40

50

60

70

Viterbi inference 12

Amplitude

10 8 6 4 2 0

0

10

20

30

40 Samples

50

60

70

Figure 4.3: Viterbi inference of an artificial not interpolated ECG generated by HMM.

Furthermore, as Figure 3.6 in Section 3.2 suggests, taking the significant detected points over several beats, e.g, the averaged ST time duration, average ST amplitude and ST variance, an appropriate starting initialization point in the likelihood surface is more likely to be found. The EM algorithm is used to obtain final parameter estimates. It usually converges within five to ten iterations. In Figure 4.3 an automatically derived state sequence using Viterbi algorithm is depicted (for further description of Viterbi algorithm see Appendix B.1) represented as a step-like waveform (the bottom trace); the levels of the waveform steps have no significance other than to

4.3. HMM FULL MODEL

63

indicate different model states. The sequence corresponds to the best path through the statetime trellis representing possible correspondence between the model states and the ECG data. Circular transitions from the final state back to the initial state enables analysis of continuous ECG data. Original ECG signal 1.5

1

0.5

0

−0.5

0

100

200

300

400

500

600

500

600

An artificial ECG generated by HMM,12 states 1.5

1

0.5

0

−0.5

0

100

200

300

400

Figure 4.4: An artificial interpolated ECG generated by HMM with 12 states.

Finally, in Figure 4.4 the comparison between original and an artificially generated beat is shown. We have used the resampling polyphase method [IEEE, 1979] to adjust the artificially generated ECG data on the same time scale as the original signal has.

4.3

HMM full model

In this section we will try to solve out the initialization problem discussed in the previous one where the characteristic points detection algorithm had to be used. Our approach is motivated by the work of Koski [Koski∗ , 1996]. If we wish not to use any sophisticated initialization method we must expand the modelling capacity of ECG signal. The simplest way to improve HMM modelling capacities is to increase the number of HMM states. But, at the same time we must cope with the time-consuming operation of parameters learning. Reasonable compromise is to use some kind of compression while preserving all the necessary information about ECG characteristics features, e.g. its morphology information. In the next section we will describe a method for feature extraction which enables us to adjust signals of different length and at the same time the signal compression is carried out.

4.3.1

Trace segmentation

Trace segmentation [Pieraccini, 1984] is originally a method used to reduce the length of a discrete signal, without a great loss of information. Here, a modified approach called nonuniform sampling method, which was proposed by [Cuesta∗ , 1999], is applied. This procedure is based on detecting changes in the signal from the amplitude of its derivative. To perform this, an auxiliary signal is calculated, where each sample corresponds to the summation of all the previous samples derivatives from the original signal x(i). xT S (i) = abs|x(i) − x(i − 1)| + xT S (i − 1)

(4.3)

64

...

4.3. HMM FULL MODEL

4L 3L 2L

L 0

j0

j1

j2

j

...

Figure 4.5: Trace segmentation process. Entering in the curve from multiples of L, the projection in horizontal axis offers the sampling points.

Therefore, a rising curve is obtained (Figure 4.5), starting with zero amplitude and ending with the amplitude representing the global change in the signal. Finally, in order to ascertain the sampling points, the range of this curve is split into as many equal intervals as samples desired. Obtaining the position corresponding to each interval, the order of all the final samples from the original signal is obtained as well. When using this procedure, the beats are approximated with the same number of samples, and then, for example, the Euclidean distance between the resulting vectors could be applied in further processing steps. An example of the algorithm output is shown in Figure 4.6. Trase segmentation: original beat−red solid, traced beat−blue dashed

1.2

1

Amplitude

0.8

0.6

0.4

0.2

0

−0.2

0

50

100

150 200 Samples

250

300

350

Figure 4.6: Trace segmentation algorithm demonstration. The traced signal is dotted, the original one is solid.

4.3. HMM FULL MODEL

4.3.2

65

Full model: results

In contrast to ergodic model where the transition matrix A was composed of the main diagonal and one subdiagonal, here the full model allows any connection between two states. First we trained an HMM model with 15 states. After saturation (about 80 iterations in our tests) we obtained the model described in Figure 4.7. The log probability of the training data after training was 2098. The artificial signal produced by this model and the corresponding Viterbi path is described in Figure 4.8. We remark that this model is unable to model ECG cardiac cycle properly. Also some complexes as P waves are not visible. Next we increased the number of states to 25. The trained model can be seen in Figure 4.9. The log probability of the training data after training was 2409 which is better than the previous model. The artificial data produced by the model are shown in Figure 4.10. 0.017 0.053 0.950 1 :-0.39

0.857 0.033

T1

7 :-0.32

P1

0.953 0.435

0.047

0.143

8 :-0.37 6 :-0.52

P1

0.118 0.565

0.118

0.118

P1

0.222 1.000

11 :0.80

0.947

0.066

3 :0.26

5 :-0.31

0.778 0.318

S

13 :-0.41

T1

0.176 0.405

1.000

R

0.241

0.947

15 :-0.46

10 :0.18

Q

0.054

0.110

12 :-0.42

T1

0.910 0.053

0.036 9 :-0.44

0.025

0.949 2 :-0.43

0.949 0.025

14 :-0.47

0.968 0.017

4 :-0.40

0.034 0.032

0.331

Figure 4.7: HMM model of ECG cycle. 15 states. Using Figure 4.8 it can be assigned the corresponding complexes to some states as in case of QRS complex which is modelled by states 10, 11, 3. In circles, next to the number of state, the mean values of each state are shown.

Actually, a cardiac cycle in the training data has about 20 segments, since the heart beats have almost 1s interval, the sampling frequency is 360Hz and, on average, every 20th point gets into the approximative broken line. Therefore, more than 20 states is in principle enough to model the whole ECG signal. Considering initialization method, the transition probabilities were randomly chosen and the observation probabilities were all equal to N (µ, σ 2 ) distribution where µ is the mean over all the training data and σ 2 is the variance over all the training data. Thus, the separation of the states for different observation vectors started only after a low number of iterations (5 in this case) as can be observed in Figure 4.11. The disadvantage of full-HMM modelling is the black-box problem. In contrast to ergodic model where each state has its direct physical meaning, here the HMM structure learns data without any regard to physical data interpretation. However, it is very easy, at least in the ECG case, to analyze trained models and find the corresponding ECG structures using Viterbi algorithm. For example, in Figure 4.9 the corresponding complexes of each state are depicted. States 22,21 and 16 model the QRS complex and the states 19,7,13 correspond to T wave. It is interesting to point out that we can find other state series which model the T complexes as well, as in case of states 13,24 and 25. In addition, the same trend can be observed when modelling P complexes (states 1,8 resp. 6,9,15,17,18).

4.3. HMM FULL MODEL

66 An artificial not interpolated signal, 15 states

1.5

Amplitude

1 0.5 0 −0.5 −1

0

10

20

30

40

50

60

40

50

60

Viterbi inference

Amplitude

15

10

5

0

0

10

20

30 Samples

Figure 4.8: Viterbi inference of an artificial not interpolated ECG generated by HMM-15 states.

We can ask ourselves whether the full HMM model could be used for a QRS detection as it is suggested in Coast [Coast∗ et al., 1990]. Coast used the ergodic model to detect the QRS complex. However, the proposed scheme is semi-automatic because a manual ECG signal segmentation over a few beats must be performed by an operator to properly initialize the HMM. To overcome this major disadvantage the full model might be used for the QRS detection.

0.178 0.822 0.650

10 0.350

0.625 0.343 1

P1

8

0.375

22

P1 Q

1.000

24 0.454

21

0.300

T1

23 0.317

R

0.949

0.289

0.229

0.430

25 0.051

0.334

S

16 0.711

0.700 0.228

0.950

2

13

0.598

0.201

T2 19

0.375 0.125 0.125

0.201

7

0.375

0.052

0.860

0.947

0.013

12

0.040

5

3

T2 P2 0.140

0.056

0.953 20 0.847 0.047

0.153

0.135

11

0.164 0.109

P2

0.865

6

0.891

T1 T1

18

0.100

0.900

1.000

0.944

0.048

14

4

T1

0.017

0.952

0.948

0.223 0.223

T2 0.033

0.221

T1

15

17

P2 P 2 0.836

9

Figure 4.9: HMM model of ECG cycle. 25 states. All important complexes could be labelled. The labelling of the states was derived using Figure 4.10.

P2

4.3. HMM FULL MODEL

67 An artificial not interpolated signal, 25 states

1.5

Amplitude

1

0.5

0

−0.5

0

5

10

15

20

25

30

35

40

45

50

30

35

40

45

50

Viterbi inference 25

Amplitude

20 15 10 5 0

0

5

10

15

20

25 Samples

Figure 4.10: Viterbi inference of an artificial not interpolated ECG generated by HMM-25 states.

Before any detection takes place, some resampling or interpolation scheme must be applied to warp the artificial ECG signal on the same scale as the original one. We have used an interpolation method described in [IEEE, 1979] and implemented in Matlab in resample.m. Nevertheless, we found that the interpolation introduces artefacts and at the same time due to the stochastic nature of HMM modelling, the time-lags are presented in ECG generated signal as can be seen in Figure 4.12. To sum up, the idea of detecting QRS complexes by an HMM full model seems not to be suitable. Likelihood throughout all training 4500

Log−likelihood

4000

3500

3000

2500

2000

0

10

20

30

40 50 Iterations

60

70

80

90

Figure 4.11: Likelihood learning curve of HMM model with N = 25 states throughout training. Stopping criteria:  = 1e − 5.

4.4. CONCLUSION

68 Original ECG signal

1.5

Amplitude

1

0.5

0

−0.5

0

50

100

150

200

250

300

350

An artificial ECG interpolated signal generated by HMM,25 states, Amp100 2 1.5

Amplitude

1

0.5

0

−0.5

0

50

100

150

200

250

300

350

Samples

Figure 4.12: An artificial ECG generated by HMM, 25 states. Remark that the interpolation scheme introduces artifacts. Also, a significant time phase-shift between the original and the artificial generated ECG signal is presented.

4.4

Conclusion

We have modelled ECG signal with HMM and found that either using good initialization or suitable number of states we are able to model ECG signal. The disadvantage of the hidden Markov model are much like the usual counterarguments against black box models. We do not actually know how the model has learnt the training data, the knowledge is hidden. We need to carefully analyze the trained model in order to fully trust them as we have discussed in the previous section where we have used for detail analysis Figure 4.9. An artificial ECG generated from the model is a very good way how to visually check the result of learning-see Figures 4.3, 4.8 and 4.10. Actually, we observed that some deterministic transition chains were formed within models. We conclude that the hidden Markov models is a suitable method to model ECG signal.

Chapter 5

Classification concept Automatic systems for the analysis and classification of electrocardiogram have been developed since the early 1960’s. We present HMM classification framework and believe that it provides a good mathematical basis for solving the classification problem. The objective of this chapter is to develop HMMs for the ECG classification. First the classification scheme and preprocessing steps are discussed briefly, then the classification results of both synthetic data sets and ECG MIT/BIH arrhythmia database are summarized and finally the conclusion is drawn.

5.1

Method

Assume we have a set of K classes to be recognised (for example normal and pathological ECG beats) and that each class is to be modeled by a distinct HMM. Further assume that for each class in our data set we have a training set of Ltrain occurrences (e.g. ECG beats) of each class where each occurrence of the class constitutes an observation sequence, where the observations are appropriate representations of the characteristics of the class. In order to do classification, we must perform the following: 1. For each class k in our data set, we must build an HMM λk , i.e., we must estimate the model parameters (A, B, π) that optimize the likelihood of the training set observation vectors for the kth class. 2. For each unknown observation to be recognized, the processing shown in Figure 5.1 must be carried out. Namely, in case of ECG processing, firstly the baseline signal wandering is removed (Section 3.3), secondly noise is filtered out of the signal (Section 3.4), then QRS detection follows to isolate beats (Section 3.2) and finally a feature extraction is performed using trace segmentation (Section 4.3.1). The preprocessing block is followed by the calculation of model likelihoods for all possible models, P (O|λk ), 1 ≤ k ≤ K and finally the observation whose model likelihood is the highest, is chosen, i.e k ∗ = argmaxP (O|λk )

(5.1)

1≤k≤K

The probability computation step is performed using Viterbi algorithm.

5.2

Results

In this section we summarize the results of our classification scheme. Before going directly to ECG processing, we evaluate the proposed framework on two different synthetic Data Sets I and II. In order to test the classification accuracy of the method, we have used the following testing procedure: 69

5.2. RESULTS

70

l1 P(O|l1 ) Probability Computattion

l2 P(O|l2)

ECG signal Baseline Removal

QRS Detection

Filtering

Preprocessing block

Models Training

Trace Segmentation

Probability Computattion

Select maximum

Index of recognized class k*=argmax P(O| lk)

Observation

lK P(O|lK) Probability Computattion

Figure 5.1: Scheme of ECG classification.

• Three training sets are generated, according to three models, each corresponding to one of three classes. • Three HMMs, one for each class, are trained. • Three test sets from the same true models are then generated. • The classification accuracy using these test sets is finally estimated. For each model, the training set contained Ltrain = 1, resp. Ltrain = 5 sequences of length T = 100, resp. T = 500. The test sets were composed of 30 sequences, Ltest = 10 sequences for each model. To increase statistical significance, the experiments were repeated Nrun = 10 times.

5.2.1

Synthetic data set

In this experiment, the HMM models used for each class are those shown in (5.2-5.4) only slightly differing in transition matrix A and observation matrix B. The last two models only differ in the transition matrix A while the first two are almost the same. The only differences are in the variance of the Gaussian of the first state (0, 2; 2 instead of 0.5, resp.) and in the mean vector. Actually, the third model is obtained from the second one by changing the transition matrix. In total, the three HMMs are quite similar, but, as we can see in Tables 5.1-5.2, the classification performance is very good. More in detail, we tested how the classification depends on the initialization and the learning methods. We followed the experimental framework proposed in Section 2.8.2. Four different initialization methods were used: random, k-means , Gaussian mixture model and Viterbi initialization. First Data Set I consists of only one observation sequence of length T = 100 generated assuming a probabilistic walk through the HMMs (5.2-5.4): see Section 4.1. The second Data Set II contained five observation sequences of length T = 500. Regarding the experimental set-up, the maximum number of HMM iterations was CY CHM M = 100, in case of k-means

5.2. RESULTS

71

Table 5.1: Data Set I classification results. The means and variances in parenthesis are displayed. Furthermore the number of iterations of HMM learning algorithm is presented. To sum the syntax up: {Mean (Variance)/Iterations}

Random Kmean GMM Viterbi

Classical 80 (16.3)/45 87.3 (9.9)/38 86 (8.9)/39 79.3 (19.3)/40

MAP 67.3 (8.1)/16 77.7(13.5)/40 77.3 (14.3)/35 70 (9.2)/7

VAR 78.7 (12.1)/35 77.3 (12.5)/37 78.0 (12) /38 77.0 (12.5)/38

Table 5.2: Data Set II classification results. In addition the total elapsed time of classification scheme (training and testing included) is shown in minutes (the experiments were run on AMD Athlon XP 1700, 1.47GHz, 512MB RAM, Matlab 6.5). Hence the syntax is the following: {Mean (Variance)/Iterations/TimeElapsed}

Random k-means GMM Viterbi

Classical 88.7 (9.8)/57/42 91.0 (7.0)/77/74 90.7 (6.8)/75/74 81.7(15.5)/36/47

MAP 87.3 (14.1)/31/17 86.3(7.4)/27/34 86.7 (8.6)/20/32.83 NA

VAR 86.3 (9.0)/32/9.11 85.7(9.3)/36/28.83 85.7(8.6)/34/30.78 85.7 (8.9)/30/29

and GMM the maximum number of iterations was CY CGM M = CY Ck−means = 100. The convergence threshold was the same for all the methods:  = 1e − 3. A careful selection of observation sequence length must be carried out. Naturally, if the length is long enough then the models have sufficient data to learn and the accuracy of classification is increased along with the length of observation sequence. On the other hand, we can improve the algorithm performance using suitable initializations when having insufficient data. Investigating Tables 5.1 and 5.2, the above mentioned assumptions are confirmed. In the first Data Set I, the performance increases by applying GMM or k-means initialization. The best performance is achieved by classical HMM approach, while MAP and VAR methods give similar performance. 

0.25 0.25 A1 =  0.25 0.25  0.25 0.25 A2 =  0.25 0.25  0.85 0.05 A3 =  0.05 0.05

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

0.05 0.85 0.05 0.05

0.05 0.05 0.85 0.05

    0.25 0.25 µ1 = −2 0.25  µ2 = 0 0.25  π =  B = 0.25 1 0.25 1  µ3 = 2 0.25 0.25 µ4 = 4     0.25 0.25 µ1 = 0     0.25 0.25 µ =0 π = B = 2 0.25 2 0.25 2 µ3 = −4 0.25 0.25 µ4 = 4     0.05 0.25 µ1 = 0 0.25  µ2 = 0 0.05  π =  B = 0.05 3 0.25 3 µ3 = −4 µ4 = 4 0.85 0.25

σ12 σ22 σ32 σ42

 = 0.5 = 0.5  = 0.5 = 0.5  2 σ1 = 0.2 σ22 = 2.0  σ32 = 0.5 σ42 = 0.5  σ12 = 0.2 σ22 = 2.0  σ32 = 0.5 σ42 = 0.5

(5.2)

(5.3)

(5.4)

In addition, when the Data Set II is concerned, even better performance is accomplished. Comparing classical learning with k-means initialization we get in average about 3.7% improvement and in case of MAP and VAR learning the improvements in performance are apparently more evident. It is clearly seen that the rate of classification is improved on the whole when using sufficient amount of data.

5.2. RESULTS

72

Moreover, comparing the time algorithm complexity, the MAP and VAR learning are much faster then the classical approach. Here we must again emphasise the numerical stability of VAR learning in contrast to classical approach where lot of singularities had taken place and in contract to MAP learning where we were not able to compute the results using Viterbi initialization (i.e., Not Available (NA) value in Table 5.2). Another interesting thing to note is the number of iterations of classical approach. On average this number is equal to 74 in case of k-means and GMM learning and Data Set II. This means that in many cases the maximum number of iterations, 100, was exceeded. This fact contrasts with VAR learning where the average number of iterations did not surpassed the value 40. As in Section 2.8.4, we could draw very similar conclusions. The most reliable algorithm in terms of performance, numerical stability and speed is the combination of VAR learning and k-means initialization. In order to compare the results in terms of performance, we will include in the next section both classical and VAR approaches with k-means initialization excluding this time MAP learning due to its numerical problems.

5.2.2

Electrocardiogram

ECG classification has always been a very difficult task in the realization of computer aided ECG diagnosis. Obviously, different patients exhibit large variations in the morphologies of their ECG signals. An ECG classifier, which performs well on a large set of testing data, may fail on a patient’s ECG. The approaches being used include template matching [Trahanias and Skordalakis, 1989], artificial neural networks [Hu et al., 1997], [Silipo et al., 1997], linear discriminant analysis [Lepage∗ et al., 2001] and a classifier based on purpose-oriented feature extraction [Fujimora∗ and Kiyasu, 2001]. In this section we will summarize the experiment results when an HMM was applied as a classification tool. Regarding the experimental set-up, the ECG signal was first preprocessed as described in Section 5.1 and shown in Figure 5.1. This step is very important because the detection of the QRS complexes or noise elimination is almost a prerequisite in automatic ECG classification. Regarding trace segmentation we fixed the length of the observation sequence to T = 100. For classification evaluation, we have used different arrhythmia beats included in MIT/BIH arrhythmia database. For each beat class, we took Ltrain = 5 representatives to form the training sets. This number had been proved sufficient from our previous experiments. As in case of synthetic results evaluation, the test sets were composed of 30 sequences, Ltest = 10 sequences for each model. This time we run the classification scheme over the same training sets only three times, Nrun = 3. Again this number is adequate for statistical results processing: standard deviations are quite small-note that the maximum value is 6.4% in Table 5.5. In each run a different start-up initialization point was picked and different training beats from the MIT/BIH arrhythmia database were selected. Having explored the structure of MIT/BIH arrhythmia database we finally decided to test in total K = 7 classes as maximum. The biggest number of arrhythmias includes file 207, where 6 classes can be found (we list the classes in MIT/BIH arrhythmia notations: [L], [R], [A], [V], [!], [E]). Therefore we believe that to test 7 different classes/beats set is sufficient. In Figure 5.2 the examples of classes used for the classification are depicted. The syntax used in Tables 5.3-5.5 is the same as in case of synthetic data evaluation: {Mean (Variance)/Iterations/TimeElapsed}. The experiments were run on AMD Athlon XP 1700, 1.47GHz, 512MB RAM. Furthermore, we decided to evaluate three data sets. Each data set consists of a particular number of classes. We wish to evaluate the classification performance on the data examples that include more or less similar kind of classes. The first one contains only three classes that are quite dissimilar from a morphological point of view (see Figure 5.2). The second and third sets include 5, resp. 7 classes. Next we will describe each data set using the notation of Figure 5.2. The final K = 3 classes data set consist of [Class1, Class2, Class4]. The final K = 5 classes

5.2. RESULTS

73

100

0

Class 2

Class 1

50 −50 −100 −150

50 0 −50

0

50

100

150

0

50

100

150

50 Class 4

Class 3

0 −50

0 −50

−100 0

50

100

150

200

250

0

Class 6

Class 5

100 0 −100 50

100 150 200 250 300

100

200

300

150 100 50 0 −50 0

100

200

300

Class 7

40 20 0 −20 −40 0

50

100

150

200

Figure 5.2: ECG classes used for the classification.

data set consist of [Class1, Class2, Class4, Class6, Class7]. The final K = 7 classes data set consist of [Class1, Class2, Class3, Class4, Class5, Class6, Class7]. Finally, we focus on question of HMM topology as well. How many states should the model have to be able to yield the best results? As in Chapter 4, here we test the classification performance on three different HMM topologies. We mean by the HMM topology the number of states, not the underlying graphical structure. The full model is used by default - see Section 4.3. To sum up the proposed experiments, next we present three tables, each table for a particular number of HMM states N = 10, 15, 25. In addition, each topology is tested on different ECG data sets in terms of the number of classes. Table 5.3: ECG classification results using k-means initialization, N = 10.

N=10 K=3 K=5 K=7

Classical 100.0 (0.0)/46/6.3 100.0 (0.0)/48/11.16 76.2 (3.0)/40/14.59

VAR 100.0 (0.0)/20/1.57 99.3 (1.2)/19/3.6 87.1 (1.4)/20/5.35

5.3. CONCLUSION

74

The first data sets K = 3 contains quite different classes, therefore, it is not a big surprise that the number of states N = 10 is sufficient to describe the underlying beats structure. We will not consider the case of N = 10 in our next analysis due to its satisfactory performance. In case of the second data set K = 5 the performance is also very satisfactory. Nevertheless, we will include this data set for further analysis because VAR method shows slight standard deviation of 1.4%. It will be interesting to note if the overtraining problem (as N , number of states, increases) will have some impact on performance. The last case K = 7 includes all the classes. Class 6 and 7 are mixed up together with class 3. Table 5.4: ECG classification results using k-means initialization, N = 15.

N=15 K=5 K=7

Classical 71.3 (3.1)/70/24.70 77.6 (0.8)/72/34.69

VAR 89.3 (2.3)/20/4.81 91.4 (0.0)/19/7.89

When the number of states is increased to N = 15, we must already face numerical problems mainly in the classical learning optimization. In both cases, the average number of iterations is high (about 70) suggesting that in many cases the EM algorithm did not converge in 100 steps. Also the assumption of EM algorithm was broken and therefore classical approach had to be restarted many times. On the other hand, the VAR algorithms remained stable. In terms of performance, no improvements can be observed in the classical case. In VAR case a slight increase of improvement can be noted for the third data set K = 7 when the number of HMM states was increased. Table 5.5: ECG classification results using k-means initialization, N = 25.

N=25 K=5 K=7

Classical 95.3 (6.4)/97/51.51 77.7(5.7)/67/75

VAR 98.7 (2.3)/24/8.79 91.0 (0.8)/20/14.22

In VAR case we achieved the best performance for both data set K = 5, K = 7. Note the remarkable difference in speed between both approaches. The VAR methods is in general five or sixth times quicker than the classical approach.

5.3

Conclusion

We presented classification using HMM both on synthetic data sets and on automatic ECG classification. Regarding synthetic, the results are comparable with the conclusions of the Section 2.8.4. In short, the more training data are available the better classification rate could be achieved. And again the most numerical stable and the fastest method was found out to be the combination of VAR learning and k-means initialization. Considering ECG case, models representing various types of beat were trained using the MIT/BIH arrhythmia database. We found that the HMM is a very suitable method to classify ECG signals. It produces probability values, which give more information than the simple binary yes/no decision. This is important when a cardiologist receives two probabilities very close to each other, the models do not discriminate the signal well and further examination is required. Also if the probabilities are both very low we may have received a new kind of signal that has not been previously found in the training data.

Chapter 6

Clustering concept In this chapter we review past and current work in the area of time series clustering. First in Section 6.1 we describe in general the basic steps of clustering technique itself such as a feature selection, similarity measures, criterion function that are crucial for further clustering process. Next, in Section 6.2 we go on by explaining three basic approaches in time series unsupervised learning: (i) proximity based methods, (ii) feature based methods and (iii) model based methods. We will discuss each of these methodologies and give examples of applications that use these methodologies. Then, we will focus our attention in Section 6.3.1 on the difficult task of the determination how many underlying clusters are presented in one partition. We solve the problem using Gaussian mixture model and several objective criteria. Finally, we describe two methods for clustering itself, the first one is based on partitional clustering-Section 6.3.3, and the second one belongs to the family of hierarchical techniques-Section 6.3.4. We conclude in Section 6.4 that a model based approach, particulary HMM approach for time series modelling, provides us the best framework for biological signal clustering and model derivation.

6.1

Clustering methodology

Clustering is one of the most primitive mental activities of humans, used to handle the huge amount of information they receive every day. In that sense, it becomes something very natural for us. Clustering may be found under different names in different contexts, such as unsupervised learning and learning without a teacher (in pattern recognition), numerical taxonomy (in biology, ecology), typology (in social sciences), and partition (in graph theory). The basic steps that we must follow in order to develop a clustering task are the following: feature extraction and selection, similarity measure, clustering criterion and algorithms [Duda et al., 2001]. The mathematical definition of clustering is as follows. Let X be our data set composed of vectors xi , X = {x1 , x2 , . . . , xN }. We define as an m-clustering of X , the partition of X into m sets (clusters), C1 , . . . , Cm , so that the following three conditions are met: • Ci 6= ∅, i = 1, 2, . . . , m S • m i=1 Ci = X T • Ci Cj = ∅, i 6= j, j = 1, . . . , m Note, that, under the preceding definition of clustering, each vector belongs to a single cluster. This type of clustering is sometimes called hard or crisp. An alternative definition of clustering in terms of fuzzy sets is the following: a fuzzy clustering into m clusters is characterized by m functions uj uj :→ [0, 1],

j = 1, . . . , m

75

6.1. CLUSTERING METHODOLOGY and

m X

uj (xi ) = 1,

i = 1, 2, . . . , N,

j=1

76

0 500 and number of HMM states was N > 30. The unique solution is to check the loglikelihood (B.4) and if the overflow happens the whole training is restarted-obviously with new initialization applied to HMM coefficients. In the second case (ii) we check whether the minimum singular value of covariance matrix is less that the pre-selected constant CheckCov = 2.2e − 16. The small singular values indicate that the matrix is singular and the covariance matrix is restarted to its initial value before the training procedure had been started.

Appendix C

Distributions Three important probabilities play important role throughout the thesis, mainly in those parts where Bayesian learning techniques were described and applied. The distributions are following: Normal, Dirichlet and Wishart. Here we will only mention the last two because we assume that normal distribution is notorious well-know.

C.1

Dirichlet distribution

The Dirichlet distribution is the conjugate prior of the parameters of the multinomial distribution. The multinomial distribution is a discrete distribution which gives the probability of choosing a given collection of m items from a set of n items with repetitions and the probabilities of each choice given by p1 , . . . , pN . These probabilities are the parameters of the multinomial distribution. The probability density of the Dirichlet distribution for variables p = p1 , . . . , pN with parameters κ = (κ1 , . . . , κN ) is defined by Dir(p|κ) = Z(κ)

N Y

piκi −1

(C.1)

i=1

PN where p1 , . . . , pN ≤ 0, i=1 pi = 1 and κ1 , . . . , κN > 0. The parameters κi can be interpreted as prior observation counts for events governed by pi . In (2.30) we use the term ∝ for simplicity reasons to indicate the equality up to constant term. Thus the normalisation constant Z(κ) becomes P (κi )) Γ( N Z(κ) = QN i=1 i=1 (Γ(κi ))

(C.2)

where Γ is the gamma function. Often we need to calculate some kind of point estimate. Since we used log-likelihood as a measure how data fit to partition, mainly in classification and clustering section, then a reasonable way is to use the evaluation of the expectation E(log κi ). This expectation can be reduced evaluating the expectation over a two dimensional Dirichlet distribution for [Gelman et al., 1995] (p, 1 − p) ≈ Dir(κi , κ0 − κi )

(C.3)

PN

where κ0 = i=1 κi A two dimensional Dirichlet distribution is given by the integral Z E(log κi ) = 0

1

Γ(κ0 ) pκi −1 (1 − p)κ0 −κi −1 log p dp Γ(κi )Γ(κ0 − κi ) ix

(C.4)

C.2. WISHART DISTRIBUTION

x

This can be evaluated analytically to yield E(log κi ) = (Ψ(κi ) − Ψ(κ0 )) where Ψ(x) =

C.2

d dx

(C.5)

log(Γ(x)) is also known as the digamma function.

Wishart distribution

The Wishart distribution (named after its discoverer) plays a prominent role in the analysis of estimated covariance matric C. The Wishart distribution is the multivariate analog to the chi-square, and is related to the multivariate normal in the same way that the chi-square is related to the univariate normal. The conjugate prior for a multi-variate normal distribution N (µ, Σ) (2.16) with unknown mean and unknown variance is a normal-Wishart distribution W i(C|ζ, Σ). Thus, if the data set O(F × T ) ∼ N (µ, Σ), then the first real parameter is ζ = F and the second covariance parameter equals to Σ. P C is estimated covariance matrix from the PT 0 1 ¯ ¯ ¯ data set O, C = i=1 (Oi − O)(Oi − O) , O = T Ti=1 Oi . The density of W i(C|ζ, Σ) for a positive definite C is given by W i(C|ζ, Σ) =

F +1 1 |C|ζ− 2 e−tr(ΣC) Z(ζ, T )

(C.6)

where the constant term Z(ζ, T ) is given by Z(ζ, T ) = 2

ζ(T −1) 2

π

F (F −1) 4

|Σ|

T −1 2

T Y i=1

Γ(

T −i ) 2

(C.7)

The sample covariance matrix CS = TC−1 which is an estimator for the matrix Σ has a Wishart distribution with ζ = T − 1 degrees of freedom. To sum up, The Wishart distribution is a basic distribution in multivariate statistical analysis; it is the F -dimensional generalization of the 1-dimensional chi-squared distribution.

Index acyclic directed graph, 17, 28 Akaike’s information criteria, 83

clustering, 75 denoising, 50 modelling, 60 points detection, 44 electroencephalogram, 31, 40, 107 ensemble learning, see variational Bayes learning entropy, 30, 54 Euclidian norm, see distance evidence density, 85 expectation-maximization algorithm, viii, 5, 8– 10, 11, 14

Baum-Welch reestimation, see expectationmaximization algorithm Bayes clustering approach, 92 Bayes selection method, 105 Bayes theorem, 11 Bayesian information criterion, 84 Boltzmann machine, 19 classification, 69 classifier, 69 cluster, 75 clustering, 75 Bayes, 92 hierarchical, 94 model based, 81 partial, 92 coefficients amplitude, 79 wavelet, 40, 79 complete link, 77, 94 conjugate prior, ix, x, 11 covariance matrix, v, viii, 4, 7, 11

feature extraction, 63, 69, 88, 96 finite state automata, 18, 81 forward-backward reestimation, see expectation-maximization algorithm fuzzy hyper volume, 85 Gaussian mixture model, 4, 7, 33, 34, 82 genetic algorithm, 16, 107 gradient descent, 10 grammar, 18 hidden Markov model algorithm, iii autoregressive, 25 buried, 25 classical, iii–v, 6 coupled, 27 decision trees, 23 definition, 6–8 discrete, 7 entropy, 30 estimation, see learning factorial, 21 frame-level, 30 fuzzy, 23 Gibbs, 30 hierarchical, 27 inference, iii input-output, 22 learning, iv, 8 MAP learning, v, 11 mixed memory, 25 Monte Carlo , 30 second-order, 30 speech processing, 30 training, see continuous, see learning variable duration, 27 variational learning, vi–vii, 13

dendrogram, 97 denoising, 50, 55, 57 deviation standard, 57, 74, 88, 96 within-cluster, 85 Dirichlet distribution, see distribution distance between two HMM, 32, 94–95 Euclidian, 10, 64, 76, 79 Kullback-Leibler, see divergence Manhattan, 76, 101 matrix, 94 proximity measure, 76 distribution Dirichlet, ix, 12, 15, 29, 37 Gaussian, see normal normal, vi, 4, 7, 29 Wishart, x, 12, 15, 33 dynamic programming, 18, 29, 90 dynamic time warping, ii, 90, 101, 103 electrocardiogram baseline removal, 46 classification, 69 cluster selection, 82

xi

INDEX hierarchical algorithm, 77, 94, 96 Holter register, ii, 82, 95 hyperparameter, vi, 12 image analysis, 31 incomplete data, 4, 8

xii QRS complex, 44, 61, 63 regression clustering, 81 HMM, 25 risk, 29, 52 Root Mean Squared error, 87

Jensen’s inequality, 14 k-means, 103 clustering, ii, 101 initialization, 33, 34, 70, 72, 74, 85, 104 Viterbi, 10 Korhunen-Lo´eve transformation, see principal component analysis Kullback-Leibler divergence, v, vi, 14, 94, 97 likelihood, viii, 9, 10 classification, 69 data, 32 equivalence, 34 GMM, 5 initialization, 62 matrix, 96 measure, 76, 89, 92 partition, 92 stopping criteria, 13, 61, 96 Manhattan norm, see distance Markov chain, 5 Markov random fields, 20 matrix distance, 94 dynamic programming, 92 dynamic time warping, 92 maximum a posteriori learning, v, 11–13, 32, 33, 38, 39, 71 measure DTW, 90, 91 HMM, 89 minimum description length, 84, 105 missing data, see incomplete data model ergodic, 61 full, 63 graphical, 17 hierarchical, 23, 27 linear, 19 mixture, 4 segment, 28 support results, 88 neural networks, 24, 81 Occam’s razor, 83 over-fitting, 47 partition coefficient, 85 polygonal approximation, 80, 96 principal component analysis, 19, 88

Signal to Noise Ration, 52, 54, 57, 58 simulated annealing, 16 speech processing, 30 statistical independence, 12, 17, 21, 95 stopping criteria, 13, 33, 85, 88, 96 supervised learning, see classification testing set, 72 text recognition, 31 threshold, 50, 52, 80, 104 time series, 78 trace segmentation, 63, 69, 79, 86 training set, 69, 94 unsupervised learning, see clustering variational Bayes learning, 13–16, 22, 34, 104 classification, 71, 74 clustering, 99 formulas, vi setting, 33 Viterbi inference, 8 initialization, 33 learning, 10 wavelet transform, 31, 40, 104 baseline removal, 86 denoising, 50 feature extraction, 79, 88 HMM, 42 QRS detection, 44 shrinkage, 52 Wishart distribution, see distribution