Tutorial - CiteSeerX

2 downloads 0 Views 648KB Size Report
A. Quattoni, Sy BorWang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell.
Tutorial– ICDAR 2011 Title: Discriminative Markovian Models for sequence recognition

Length: half day Presenters: Thierry Artières (Professor), Alain Biem (Researcher)

Thierry Artières Affiliation : Computer Science lab (LIP6), Université Pierre et Marie Curie (UPMC), Paris, France Tel. Fax E-mail Web Address

33-1 44 27 72 20 33-1 44 27 70 00 [email protected] http://www-connex.lip6.fr/~artieres/ LIP6, Université Pierre et Marie Curie (UPMC), 4 Place Jussieu, 75005, Paris, France

Resume Thierry Artières earned a PhD at Paris Sud University in 1995. He joined University of Cergy Pontoise, France, as assistant Professor in 1996 and moved to the computer Science Lab of Pierre et Marie Curie University in Paris, France, (LIP6-UPMC) in 2001 where he is now full Professor since 2007. His main research focus are statistical machine learning, sequences and signal processing, signal labeling, Hidden Markov models and Conditional Random Fields, Hybrid Systems. He belongs to the PASCAL European network of excellence on machine learning. He is author or co-author of about 50 international journal and conference papers. He was co-organizer of IWFHR 2006 at La Baule, France. He joined a number of program committees and is a frequent reviewer for a number of journals and conferences in data mining, pattern recognition and machine learning, document analysis and handwriting recognition. Selected bibliography (in relation with the tutorial) 1. 2. 3. 4. 5. 6. 7.

8. 9.

Soullard Y., Artieres T., Hybrid HMM and HCRF model for sequence classification, European Symposium on Artificial Neural Networks (ESANN), 2011. T.M.T. Do and T. Artieres, Neural Conditional Random Fields, International Conference on Artificial Intelligence and Statistics (AISTAT), 2010. Trinh Minh Tri Do et Thierry Artières. Learning mixture models with support vector machines for sequence classification and segmentation, Pattern Recognition, 2009. Rudy Sicard, Thierry Artières, Modelling sequences using pairwise relational features, Pattern Recognition, 2009. T.M.T. Do and T. Artieres, Large Margin Training for Hidden Markov Models with Partially Observed States, International Conference on Machine Learning (ICML), 2009. T.M.T. Do and T. Artieres, Maximum Margin Training of Gaussian HMMs for Handwriting Recognition, International Conference on Document Analysis and Recognition (ICDAR), 2009. Volkmar Frinken, Tim Peter, Andreas Fischer, Horst Bunke, Trinh-Minh-Tri Do, and Thierry Artieres, Improved Handwriting Recognition by Combining Two Forms of Hidden Markov Models and a Recurrent Neural Network, International Conference on Computer Analysis of Images and Patterns (CAIP), 2009. Trinh Minh Tri Do et Thierry Artières, Max-Margin Learning of Gaussian Mixtures with Sequential Minimal Optimization, International Conference on Frontiers in Handwriting Recognition (ICFHR), 2008. Artières T., Marukatat S., Gallinari P., (Février 2007), On-line Handwritten Shape Recognition using Segmental Hidden Markov Models, Février 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Trans. PAMI).

10. Do T.M.T, Artières T., (2006), Polynomial Conditional Random Fields for signal processing, European Conference on Artificial Intelligence (ECAI), poster. 11. Do T.M.T, Artières T., (2006), Conditional Random Fields for Online Handwriting Recognition, International Workshop on Frontiers in Handwriting Recognition (IWFHR) . 12. Artières T., Gauthier N., Dorizzi B., Gallinari P., (2003), A Hidden Markov Models combination framework for Handwriting Recognition, International Journal on Document Analysis and Recognition (IJDAR), Vol. 5, n° 4. 13. Marukatat S., Sicard R., Artières T., Dorizzi B., Gallinari P., (2003), "A Flexible recognition engine for complex online handwritten character recognition", International Conference on Document Analysis and Recognition (ICDAR). 14. Artières T., Dorizzi B., Gallinari P., Li H., Marukatat S., (2002), "From character to sentences : a hybrid neuromarkovian system for on-line handwriting recognition", in « Hybrid Methods in Pattern Recognition », H. Bunke, A. Kandel (eds), World Scientific Publ. 15. Artières T., Marchand J-M., Gallinari P., Dorizzi B., (2000), "Multi-modal segmental models for on-line handwriting recognition", International Conference on Pattern Recognition (ICPR), Madrid.

Alain Biem Affiliation: IBM T. J. Watson Research Center Resume: Alain Biem received in the “Diplome d’ingenieur” in Telecommunications (MS EE) in 1991 and the PhD Degree in Computer Science from the University of Paris VI, France in 1997 with Summa Cum Laude honors. From 1992 to 1999, he was a Researcher at the Advanced Telecommunication Research (ATR), Japan involved in research activities in signal processing and speech recognition with a particular focus on HMM-based minimum-error training approaches. In 2000, he joined the IBM T. J. Watson Research Center in New York as Research lead on Discriminative training approaches to large-vocabulary multilingual cursive handwriting recognition and signature verification. Currently, he is participating or leading Research and Development projects in various areas including multimedia signal processing, machine learning, high-performance computing, and real-time data analytics with applications to anomaly detection and prediction, cyber-security, healthcare, and economic sciences. Dr. Biem is an IEEE member and a frequent reviewer or technical committee member in leading pattern recognition journals and conferences. Selected bibliography (in relation with the tutorial): 1. 2. 3.

4.

5. 6.

7. 8.

9.

M Adankon, M Cheriet, A Biem, ”Semisupervised Least Squares Support Vector Machine” IEEE Transation on Neural Networks, Vol. 20, No. 12, December, 2009 Biem A, ”Minimum Classification Error Training for On-line Handwriting Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 28, number 7, pp. 1041-1051, July 2006. R. Nopsuwanchai, A. Biem, and W Clocksin, ”Off-line Thai Handwriting Recognition Using Discriminative Training and Maximization of Mutual Information”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 28, No. 8, pp 1347-1351. August 2006. R. Nopuwanchai, A. Biem ”Discriminative training of tied mixture density HMMs for online hand- written digit recognition”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003. Proceedings, Volume: 2, April 6-10, 2003 Page(s): 817-820 Biem, S. Katagiri, B.-H. Juang, E. McDermott: ”Discriminative Feature Extraction Applied To Filter-Bank Design”, IEEE Transactions on Speech and Audio Processing. Feb. 2000. F. Russell, J. Hu, A. Biem, A. Heilper and D. Markman, ”Dynamic Signature Verification Using Discriminative Training”, Proc. of Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pp. 12601264 Jin-Yooung Ha, Mina Park, Alain Biem, ”A Study of Various Model Selection Criterion for HMM Topology Optimization”, Proceeding of International Conference on Graphonomics, 2003. A. Biem, ”A model selection criterion for classification: application to HMM topology optimization”, Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003 Aug. 3-6, 2003 Page(s): 104 -108 A. Biem, ”Optimizing Features and Models Using the Minimum Classification Error Criterion”, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) - Proceedings, April 2003.

10. R. Nopuwanchai, A. Biem ”Discriminative training of tied mixture density HMMs for online handwritten digit recognition”, Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003. Proceedings, Volume: 2, April 6-10, 2003 Page(s): 817-820 11. Biem, ”Minimum classification error training for online handwritten word recognition”, Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition, 2002, 6-8 Aug. 2002 Page(s): 61 -66 12. Biem, J-Y. Ha, J. Subrahmonia, ”A Bayesian model selection criterion for HMM topology optimization” , Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing, Volume: 1 , 13-17 May 2002 Page(s): I-989 -I-992. 13. Biem, ”Minimum Classification Error Training of Hidden Markov Models for Handwriting Recognition”, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2001. 14. D. Li, A. Biem, J. Subramonia, ”HMM Topology Optimization for Handwriting Recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing, 2001. 15. E. McDermott, A. Biem, S. Tenpaku, S. Katagiri: ”Discriminative Training for Large Vocabulary Telephone-based Name Recognition, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2000.

Description of the tutorial

Sequence modeling is key component in pattern recognition and data mining and Hidden Markov Models (HMMs) have been widely used for modeling sequences of patterns. HMMs are broadly used in automatic speech recognition [Rabiner 1990] and handwriting recognition. The HMM is typically used as generative model of the observed sequence. Its parameters are estimated through Maximum Likelihood Estimation (MLE) which optimizes the joint likelihood of the sequence of observations and of the sequences of states via the Estimation-Maximization (EM) algorithm. This assumes a supervised training approach based on a partially labeled training set comprising a set of examples in the form of a observation sequences and their corresponding character sequences. In test, segmentation, that is, the likely state sequences is retrieved through Viterbi decoding that maps an observation sequence into a state sequence. Based on the underlying semantics of the states transition matrix (e.g. passing through the three states of the left-right HMM corresponding to a particular character means this character has been recognized), the sequence of states translates into a sequence of labels. The generative approach to HMMs is very popular as it is both simple and efficient, and it scales well with large corpus. However, such a learning strategy does not focus on what one is primarily of concern, namely minimizing the classification (or the segmentation) error rate. Two approaches have been explored to improve the discriminative power of HMM based systems. The first approach replaces the MLE criterion by a discriminative criterion that maximizes the separation between models; an optimization algorithm is derived and use to optimize HMM parameters with respect to this criterion. A number of attempts have been made in this direction. First studies were performed in the speech recognition field, and proposed to optimize a probabilistic criterion such as the Maximum A Posteriori criterion (MAP), Conditional Maximum Likelihood (CML), or information theoretic criterion such as the Maximum Mutual Information (MMI) criterion [Woodland 2002]. Another approach focused on the definition of a criterion that would be even more closely related to the error rate. A first attempt was to use the Minimum Classification Error (MCE) [Juang 1992]. Similar ideas have been applied in the handwriting recognition field [Zhang 2007], [Nopsuwanchai 2006]. Lastly, in the last few years, different authors proposed to make use of a large margin criterion, building on the success of support vector machine for vectorial data [Jiang 2006, 2007, Sha 2006, 2007]. To overcome the difficulty of formalizing such a training procedure, various simplifying assumptions have been proposed in these studies [Yu 2007]. The second approach proposes new Markovian structures to achieve intrinsic discrimination, either through probabilistic models or non probabilistic ones. Among probabilistic models one should note Maximum Entropy Markov Networks and Conditional Random Fields (CRFs). These models are conditional models that aim at learning the conditional distribution over state sequence given the observed sequence, rather than the joint distribution. CRFs are an instance of Markov networks where the distribution of state sequence is conditioned on the observation sequence. They have been extended to deal with partially labeled datasets by introducing hidden variables. Firstly [Quattoni 2007] proposed Hidden Conditional Random Fields (HCRF) and [Morency 2007] extended to more general Latent Dynamic Conditional Random Fields. A number of systems have been proposed since then in the speech community [Gunawardana 2005; Reiter 2007; Yu 2009] as well as in the handwriting and gesture recognition communities. Non probabilistic models have also been investigated for dealing efficiently with sequences and structured data. They are related to what is called structured output prediction in the machine learning community [Tsochantaridis 2004, 2005, Sarawagi 2008]. Among non probabilistic models Hidden Markov Support Vector Machines (HMSVM) [Altun 2003] and Maximum Margin Markov Networks (M3N) [Taskar et al., 2004] play a central role. Few extensions of such models for dealing with signals have been proposed in the literature. While CRF achieves discriminative training by defining a conditional probability and by using maximum conditional likelihood as training criterion, large margin methods

such as M3N and HMSVM directly focus on the definition of a discriminative function exploiting the same Markov structure as CRF. This tutorial aims to provide an overview of existing methods in the two families mentioned above, training generative models discriminatively and designing discriminative models structurally. It will provide the technical basis for understanding the strengths and weaknesses of these methods and to identify potential implementation difficulties.

References [Altun et al. 2003] [Bahl et al., 1986] [Bishop 1995] [Chou et al., 1992] [Duda et al., 1973] [Gunawardana et al., 2005]

[Jebara et al., 1998] [Jiang et al., 2006] [Jiang et al., 2007] [Juang et al., 1992] [Katagiri et al., 1991]

[Laferty et al., 2001] [Morency et al., 2007]

[Nadas, 1983]

[Nathan et al., 1995]

[Nopsuwanchai et al., 2006]

[Normandin et al., 1991]

[Quattoni et al., 2007]

Y. Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden markov support vector machines, In ICML, pages 3-10, 2003. Bahl et al. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In Proceedings of ICASSP 1986, 49-52. C. M. Bishop, Neural Network for PatternRecognition. Oxford University Press, 1995. W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD Training of HMM Based Speech Recognizer,” in Proc. IEEE ICASSP, vol. 1, mar 1992, pp. 473–476. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, Wiley Interscience Publications, 1973. A. Gunawardana, Milind Mahajan, Alex Acero, and John C. Platt. Hidden conditional random fields for phone classification. In International Conference on Speech Communication and Technology, 2005. Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound maximization and the CEM algorithm, NIPS 11. H. Jiang, X. Li, and C. Liu. Large margin hidden markov models for speech recognition. IEEE Transactions on Audio, Speech&Language Processing, 2006. H. Jiang and X. Li. Incorporating training errors for large margin hmms under semi-definite programming framework, ICASSP, 4:IV–629–IV–632, April 2007. B. Juang and S. Katagiri. Discriminative learning for minimum error classification. In IEEE Trans. Signal Processing, Vol.40, No.12, 1992. S. Katagiri, C.-H. Lee, and B.-H. Juang, New Discriminative Training Algorithms Based on the Generalized Descent Method, in Proc. IEEE Worshop on Neural Networks for Signal Processing, 1991, pp. 309–318. John Laferty. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282-289. Morgan Kaufmann, 2001. L.-P. Morency, A. Quattoni, T. Darrell, Latent-Dynamic Discriminative Models for Continuous Gesture Recognition, Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 2007. A. Nadas, “A Decision Theoretic Formulation of a Training Problem in Speech Recognition and a Comparison of Training by Unconditional Versus Conditional Maximum Likelihood,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 31, no. 4, pp. 814–817, 1983. K. S. Nathan, H. Beigi, J. Subrahmonia, G. J. Clary, and H. Maruyama, “Real-time on-line unconstrained handwriting recognition using statistical methods,” in Proceedings of ICASSP, 1995, Roongroj Nopsuwanchai, Alain Biem, W. F. Clocksin: Maximization of Mutual Information for Offline Thai Handwriting Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 28(8): 13471351 (2006) Y. Normandin, Hidden Markov Models, Maximum Mutual Information Estimation, and the Speech Recognition Problem, Ph.D. dissertation, McGill University, Montreal, Department of Electrical Engineering, 1991. A. Quattoni, Sy BorWang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell., 29(10):1848-1852, 2007.

[Rabiner et al., 1990] [Reiter et al., 2007] [Sarawagi et al., 2008]

[Sha et al., 2006] [Sha et al., 2007] [Shlueter et al., 2001] [Soong et al., 1991]

[Taskar et al., 2004] [Tsochantaridis et al., 2004] [Tsochantaridis et al., 2005] [Vapnik, 1998] [Woodland et al., 2002] [Yu et al., 2007] [Yu et al., 2009] [Zhang et al., 2007]

L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. S. Reiter, Bjorn Schuller, and Gerhard Rigoll. Hidden conditional random fields for meeting segmentation. In ICME, pages 639-642, 2007. S. Sarawagi and Rahul Gupta. Accurate max-margin training for structured output spaces. In ICML '08: Proceedings of the 25th international conference on Machine learning, pages 888895, 2008. F. Sha. Large margin training of acoustic models for speech recognition. Phd Thesis, 2006. F. Sha and L. K. Saul. Large margin hidden markov models for automatic speech recognition. In NIPS. MIT Press, 2007. R. Schlueter and H. Ney, “Model-based MCE Bound to the True Bayes’ Error,” IEEE Signal Processing Letters, vol. 8, no. 5, pp. 131–133, 2001. F. K. Soong and E.-F. Huang, “A Tree-treillis based Fast Search for Finding the N-best Sentence Hypothese in Continuous Speech Recognition,” in Proceedings of ICASSP, vol. 1, April 1991, pp. 705–708. B. Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Sebastian Thrun, NIPS 16, 2004. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML 04, 2004. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, JMLR 2005. V. N. Vapnik, Statistical Learning Theory. Wiley, 1998. P.C. Woodland, and P. D. Large scale discriminative training of hidden markov models for speech recognition. Computer Speech and Language, January 2002. D. Yu and L. Deng. Large-margin discriminative training of hidden markov models for speech recognition. In ICSC, 2007. D. Yu, Li Deng, and Alex Acero. Hidden conditional random field with distribution constraints for phone classification. In Interspeech, 2009. Y. Zhang, P. Liu, and F. Soong. Minimum error discriminative training for radical-based online chinese handwriting recognition, ICDAR 2007.

Outline of the tutorial

1.

Introduction : a. Sequence modeling : issues and opportunities b. Generative vs. discriminative models for sequences

2.

Hidden Markov Models for sequences a. Fundamentals and notations b. Learning and inference c. Search and Decoding d. Model selection and system design e. Pros and cons

3.

Discriminative training for HMMs a. Probabilistic criteria i. Maximum A Posteriori ii. Conditional likelihood b. Information-theoretic criteria i. Maximum Entropy ii. Maximum Mutual Information c. Error-based criteria i. Minimum Classification Error ii. Maximum Margin iii. Unit Error Minimization d. Discriminative model selection

4.

Conditional models and structured output prediction a. Structured output prediction framework b. Popular models i. Conditional Random Fields (CRFs) ii. Maximum Margin Markov Networks (M3N) iii. Relations to Hidden Markov Models c. Extensions to signal labeling tasks i. Segmental Conditional Random Fields ii. Hidden Conditional Random Fields iii. Non linear Conditional Random Fields

5.

Conclusion

Potential target audience and prerequisite knowledge The objective of this tutorial is to provide a panorama of existing methods for discriminative learning of Markovian-based models for signal labeling tasks. It provides first a survey of discriminative learning schemes for Hidden Markov Models as used for handwriting recognition, both on-line and off-line. Secondly it introduces the structured output prediction framework and corresponding conditional models and algorithms. This tutorial is dedicated to Ph. D students as well as to researchers willing to learn more about discriminative learning of HMMs, structured output prediction, conditional Random Fields, etc. Prerequisite knowledge: Basics of Hidden Markov Modeling.

Why the tutorial topic would be of interest to a substantial part of the ICDAR audience ? While Hidden Markov Models are a well known and very popular technology in the handwriting recognition field, they are most often trained using a non discriminative criterion (Maximum Likelihood). Yet a number of discriminative criterions for learning HMMs (MMI, MCE, Large Margin, etc) have been proposed but are still merely used in HWR, while they have been shown to bring significant improvements in other application fields as well as in HWR. Using a discriminative criterion does not mean one has to throw away his good old HMM system, which usually encodes much valuable prior knowledge. On the contrary, using a discriminative criterion usually requires starting from an initial (e.g. non discriminative) good solution so that it may be viewed as a fine tuning step able to improve existing non discriminatively trained systems. Besides, recent advances in the machine learning community opened a new and promising area of research named structured output prediction. Works in the structured prediction field have led much interest in loglinear models such as Conditional Random Fields which have shown good results in tasks such as information extraction, POS tagging and more recently on speech and handwriting recognition. Understanding what these models are and how one can extend these models to build HWR systems could become a main focus of any future researcher in the HWR field.