Sign Language Number Recognition - Faculty e-Portfolio

11 downloads 1019 Views 82KB Size Report
In the Philippines, for example, there are 13 variations of .... utilizes computer gesture recognition technology to develop ..... Conference on Robotics, Vision, Information, and Signal Processing, ... Master's thesis, De La Salle University Manila.
Sign Language Number Recognition Iwan Njoto Sandjaja

Nelson Marcos, PhD

Informatics Engineering Department Petra Christian University Surabaya, Indonesia [email protected]

Software Technology Department De La Salle University Manila, Philippines [email protected]

Abstract— Sign language number recognition system lays down foundation for handshape recognition which addresses real and current problems in signing in the deaf community and leads to practical applications. The input for the sign language number recognition system is 5000 Filipino Sign Language number video file with 640 x 480 pixels frame size and 15 frame/second. The color-coded gloves uses less color compared with other color-coded gloves in the existing research. The system extracts important features from the video using multi-color tracking algorithm which is faster than existing color tracking algorithm because it did not use recursive technique. Next, the system learns and recognizes the Filipino Sign Language number in training and testing phase using Hidden Markov Model. The system uses Hidden Markov Model (HMM) for training and testing phase. The feature extraction could track 92.3% of all objects. The recognizer also could recognize Filipino sign language number with 85.52% average accuracy. Keywords- computer vision, human computer interaction, sign language recognition, hidden markov model, hand tracking, multicolor tracking

I.

INTRODUCTION

Sign language is local, in contrast with the general opinion which assumes it is universal. Different countries and at times even regions within a country have their own sign languages. In the Philippines, for example, there are 13 variations of Filipino Sign Language based on regions [1]. Sign language is a natural language for the deaf. It is a kind of visual language via primarily hands and arms movements (called manual articulators which consist of dominant hand and non-dominant hand) accompanying other parts of body, such as facial expression, eye movement, eyebrow movement, cheek movement, tongue movement, and lip motion (called non-manual signal) [2]. Most hearing people do not understand any sign language and know very little about deafness in general. Although many deaf people lead successful and productive lives, this communication barrier can have problematic effects on many aspects of their lives. There are three main categories in sign language recognition, namely handshape classification, isolated sign language recognition, and continuous sign classification. Handshape classification, or finger-spelling recognition, is one of the main topics in sign language recognition since

handshape can express not only some concepts, but also special transition states in temporal sign language. During the period of 1994-1998, finger-spelling is the sign language. Sign language is just a string of signs. Isolated words are widely considered as the basic unit in sign language and many researchers [3] [4] focus on isolated sign language recognition. Some researchers [5] [6] also pay attention to continuous sign language recognition. A lot of works on continuous sign language recognition apply HMMs for recognition. The use of HMM offers the advantage of being able to segment a data stream into its continuous signs implicitly and thus bypasses the hard problem of segmentation entirely. The system architectures for sign language recognition can be categorized into two main classifications based on its input. First is datagloves-based, whose input is from gloves with sensor. The weakness of this approach is that it has limited movement. The advantage is its having higher accuracy. Second is vision-based, of which input is from camera (stereo camera or web/usb camera). The weakness of this approach is lower accuracy and consuming more computing power. The advantages are cheaper and less constraining than datagloves. To make human hand tracking easier, color-coded gloves are usually used. A combination of both architectures is also possible which is called hybrid/mix architecture. In vision-based approach, the architecture of the system is usually divided into two main parts. The first part is the feature extraction. The feature extraction part should extract important features from the video using computer vision method or image processing such as background subtraction, pupil detection, hand tracking, and hierarchical feature characterization (shape, orientation, and location). The second part is the recognizer. From the features already extracted and characterized, the recognizer should be able to learn the pattern from training data and recognize testing data correctly. The recognizer employs machine learning algorithms. Artificial Neural Network (ANN) and Hidden Markov Model (HMM) are the most common algorithms used. In vision-based approach, the camera could be single, two or more or special 3D. Stereo camera uses a two-camera configuration that imitates how human eyes work. The most

recent approach is virtual stereo camera which only uses one camera generating second camera virtually. The main problem in sign language recognition is involving multiple channels and simultaneous events that create combinatorial explosion and huge search space. This research also addresses problems in computer vision, machine learning, machine translation, and linguistic problems. This research could be seen as computer-based lexicon/dictionary from sign language phrase specifically numbers into words/texts. II.

RELATED LITERATURE

A vision-based medium Chinese sign language recognition (SLR) system [3] is developed using tiedmixture density hidden Markov models (TMDHMM). Their experiment is based on the single frontal view; only a USB color camera is employed and placed in front of the signer to collect the CSL video data, and has the image size of 320 X 240 pixels. In this system, the recognition vocabulary contains 439 CSL signs, including 223 two-handed and 216 one-handed signs. Their experimental results show that the proposed methods could achieve an average recognition accuracy of 92.5% on 439 signs, including 93.3% on twohanded signs and 91.7% on one-handed signs, respectively. In recent research of sign language recognition [4], a novel viewpoint invariant method for sign language recognition is proposed. The recognition task is converted to a verification task under the proposed method. This method verifies the uniqueness for a virtual stereo vision system, which is formed by the observation and template. The recognition vocabulary of this research contains 100 CSL signs. The image revolution is 320 X 240 pixels. The proposed method achieves an accuracy of 92% at rank 2. In recent work [7], the design and implementation of hand mimicking system is discussed. The system captures hand movement, analyzes it using MATLAB, and produces 3D hand graphical model that imitates the movements of the user hand using OpenGL. The system captures hand movement using two cameras and approximate the user 3D hand pose using stereovision techniques. Based on the test, the obtained average difference of ideal hand part orientation from actual orientation is about 20. Furthermore, 72.38% of the measured hand angular orientations is less than 20, while 86.35% of the test cases have angular orientations error which is less than 45. Two real-time hidden Markov model-based systems for recognizing sentence-level continuous American Sign Language (ASL) using a single camera to track the user's unadorned hands was presented [5]. For this recognition system, sentences consisting the form of “personal pronoun, verb, noun, adjective, (the same) personal pronoun” are to be recognized. Six personal pronouns, nine verbs, twenty nouns, and five adjective are included making up a total lexicon of forty words. The first system observes the user from desk mounted camera and achieves 91.9% word accuracy. The second system mounts the camera in a cap

worn by the user and achieves 96.8% accuracy (97% with unrestricted grammar). A portable letter sign language translator is developed using specialized glove with flex/bend sensor [8]. Their system translates hand-spelled words to letters through the use of Personal Digital Assistant (PDA). They use NeuroFuzzy Classifier (NEFCLASS) for letter translation algorithms. Their system cannot recognize the letter M and N but it can recognize other letters with minimum accuracy of 65% (letter Z) maximum accuracy of 100% and average accuracy of 90.2%. A framework for recognizing American Sign Language (ASL) is developed using hidden Markov model [6]. The data set consist of 499 sentences, between 2 and 7 signs long, and a total of 1604 signs from a 22-sign vocabulary. They collect these data with an Ascension Technologies MotionStarTM system at 60 frames per second. In addition, they collect data from the right hand with a Virtual Technologies CybergloveTM, which records wrist yaw, pitch, and the joint and abduction angles of the fingers, also at 60 frames per second. The result shows clearly that the quadrilateral-based description of the handshape (95.21%) is far more robust than the raw joint angles (83.15%). The best result is achieved using PaHMM monitoring movement channel right hand and handshape right hand with 88.89% sentence accuracy and 96.15% word accuracy. CopyCat which is an educational computer game that utilizes computer gesture recognition technology to develop American Sign Language (ASL) skills in children ages 6-11 is presented [9]. Data from the children’s signing is recorded using an IEEE 1394 video camera and using wireless accelerometers mounted in colored gloves. The dataset consist of 541 phrase samples and 1,959 individual sign samples of five children signing game phrases from a 22 word vocabulary. The vocabulary is limited to subset of ASL which includes single and double-handed signs, but does not include more complex linguistic construction such as classifier manipulation, facial gestures and level emphasis. Each phrase is a description of an encounter for the game character, Iris the cat. The students can give warning of the predator presence, such as “go chase snake” or identify the location of a hidden kitten, such as “white kitten behind wagon”. Bradsher et al. achieve an average word accuracy of 93.39% for the user–dependent models. The user–independent models are generated by training on a dataset consisting of four children and testing on the other child’s dataset. Bradsher et al. achieve an average word accuracy of 86.28% for the user–independent models. They achieve on average 92.96% of accuracy in word-level with 1.62% of standard deviation when they chose samples across all samples and users (they trained and tested using data from all students). A vision-based interface for controlling a computer mouse via 2D and 3D hand gestures was presented [10] [11]. The proposed algorithm addresses three different subproblems: (a) hand hypothesis generation (i.e., a hand

appears in the field of view for the first time) (b) hand hypothesis tracking in the presence of multiple, potential occluding objects (i.e. previously detected hands move arbitrarily in the field of view) and (c) hands hypothesis removal (i.e. a tracked hand disappears from the field of view). Their proposed algorithm also involves simple prediction that uses a simple linear rule to predict location of hand hypotheses at time t, based on their locations at time t-2 and t-1. Having already defined the contour of a hand, finger detection is performed by evaluating at several scales a curvature measure on contour points. As confirmed by several experiments, the proposed interface achieves accurate mouse positioning, smooth cursor movement and reliable recognition of gestures activating button events. Owing to these properties, their interface can be used as a virtual mouse for controlling any Windows application. III.

ARCHITECTURAL DESIGN

The general architectural design for sign language number recognition is shown in Fig. 1. The input of sign language number recognition system is Filipino Sign Language number video. In general, there are two main modules in sign language recognition architecture: namely, the feature extraction module, and recognizer module. The feature extraction module extracts important features from the video per frame. The recognizer module learns and recognizes the video from its features. The feature extraction module consists of face detection, hand tracking, and feature characterization. Face detection module is used to detect face area. Hand tracking module tracks hands movement, dominant hand, and nondominant hand. Feature characterization takes important features such as the position of the face as reference, position of dominant hand and its fingers, area of dominant hand and each finger, the orientation of dominant hand and each fingers, position of non-dominant hand, area of of nondominant hand, and non-dominant hand orientation. In this research, the feature characterization that is used as feature vectors is the position of dominant-hand’s thumb in x and y coordinates and the x and y coordinate of others fingers relatively to the thumb position. The output of feature characterization becomes the input for second block which is the recognizer. The recognizer employs Hidden Markov Model. The recognizer consists of two main parts which are training module and testing module. In the training module, the recognizer learns the pattern of sign language number using annotated input from feature extraction module. In testing or verification module, the recognizer receives unknown input which is never be learned before yet annotated for verification purpose.

Figure 1. System Architecture

A. Feature Extraction The feature extraction module uses OpenCV library [12]. The detailed flowchart of feature extraction is shown in Fig. 2. For each frame in the video, the feature extraction module begins with calling a smooth procedure to eliminate noise from the camera. The frame size is 640 x 480 pixels in BGR (Blue, Green, and Red) color space. After smoothing the frame, feature extraction module converts frame’s color space from BGR color space to Hue, Saturation and Value (HSV) color space.

Figure 2. Feature extraction flowchart

The saturation and value filtering module extracts hue channel based on specific saturation and value parameter. The saturation and value filtering give two outputs: skin frame and color frame. Skin frame is processed by skin tracking module. The skin tracking module is basically color tracking module with different size filtering because face and nondominant hand have a larger area than each finger of dominant hand. Skin tracking procedure is producing a face ellipse. Skin tracking module also creates black-filled contour of the face on color frame for removing the lips in the color frame. The color frame is processed by color tracking module as shown in Fig. 2 with different parameters of hue range for each finger. The hue parameter of each finger is known by searching the maximum and minimum hue value for each finger. The hue parameter of each finger is not overlapping with other hue parameter. The color tracking module gives the ellipse area for each finger as result. The Merge algorithm is simply executing cvFindContours procedure again with connected contours from color tracking procedure as input. Each contour is fitted by an ellipse. If the number of ellipses is more than two, the Merge algorithm returns the first ellipse, therefore Merge algorithm always returns the first detected ellipse. By always returning the first ellipse, the merge algorithm avoids a long/endless recursive process. The next module is Draw and Print module which draws each ellipse and prints its parameters in XML format. Fig. 3 shows a sample frame captured with resulting ellipses.

relatively to the thumb position. The first two feature vectors is taken from the center coordinate (x,y) of thumb ellipse. The rest of feature vector is taken from the distance between thumb and other finger in x and y. Thus, there are 10 feature vectors for each frame. The feature vectors are saved in XML format. B. Recognizer The recognizer learns the pattern from the feature vectors that are generated by the feature extraction module using machine learning algorithms. Hidden Markov Model (HMM) is used as the machine algorithms. The CambridgeUniversity HMM Toolkit [13] is chosen to be used as HMM library. The recognizer consists of three main parts. The first part is data preparation module. In data preparation module, the recognizer generates all directories and files that are needed for HMM processes. The second part is HInit module. In HInit module, the recognizer creates HMM model for each sign language number and initializes the models using forward-backward algorithms using labeled training feature vectors from feature extraction module as its input. The last part is HModels module. In HModels module, the recognizer uses labeled training feature vector from feature extraction module to re-estimate the HMM models parameters using Baum-Welch method. After it has finished with the re-estimation of HMM models parameters, the recognizer recognizes the testing data which is not included in training data yet already labeled for verification purpose. The Viterbi algorithm is used to recognize the testing data. Lastly, the recognizer interprets and evaluates the result and generates report about the result. IV.

RESULTS AND ANALYSIS

A. Feature Extraction Table I shows the summary of result from feature extraction module in terms of time. Feature extraction module was running 5 hours 42 minutes and 35 second for extract the feature of 5000 Filipino sign language number videos. Feature extraction took a lot of time because it had to play all the video one by one. For playing video of number 1-9, it would take 2 seconds (±30 frames). Table I. Feature extraction results in terms of time Figure 3. Sample image with resulting ellipses

There are 6 color trackers, one color tracker for skin color (face and non-dominant hand) and 5 color trackers for dominant hand, one for each finger of dominant hand. The ellipse and its parameters are shown in Fig. 2. The whole processes are repeated until no more frame processed. Finally, the feature characterization converts ellipse and its parameters into feature vectors. The feature vectors contain the position of dominant-hand’s thumb in x and y coordinates and the x and y coordinates of other fingers

  

Time (HH:MM:SS)

Start time

2:32:55 PM

End time

8:15:30 PM

Duration

05:42:35

For playing the video of number 10-109, 201-209, 301309, 401-409, 501-509, 601-609, 701-709, 801-809, 901909, it would take 3 seconds (±45 frames). The video of remaining numbers took 4 seconds to play (±60 frames).

Thus, the total time for playing all the video was 5 hours 16 minutes and 50 seconds. The feature extraction module took a little longer time because it had to switch between one video to another video and saved the result to XML files. Table II shows the summary of result from feature extraction module in terms of accuracy. For each frame there are five objects to be tracked which represent five fingers. Non-trackable object means the color tracking module cannot track the object which is the finger. Incorrectly trackable object means the color tracking module detect the object but it found more than one object eventhough already using the merge algorithm. Table II. Feature extraction results in term of accuracy.

Result Correctly trackable objects Non-trackable objects Incorrectly trackable objects Total objects %Tracking

Objects 1,322,537 109,814

each sign language number as testing data and used the other samples of the same sign language number for training data. Thus, five-fold validation created five set of validation. Each set consist of 4000 data for training and 1000 data for testing. The second validation procedure was leave-one-out validation. Leave-one-out validation used all of the data except for one sample as training data, and the remaining one as test data. Leave-one-out validation done this for every possible permutation, and could take a very long time. For this research, leave-one-out validation creates the 120 sets of testing and training data. Initially, this research began with four-state HMM model and then increased the number of states of HMM model until found the maximum accuracy. After that, the experiment continued by adding skip states. The 10-state HMM without skip state has the highest average accuracy which is 85.52%.

4,664

Table III. Recognizer results

1,437,015

Set A

92.03%

The feature extraction module could track 1,322,537 objects of 1,437,015 objects. In other words, 92.03% of all objects could be tracked. Little finger had the most record as untrackable object because of small size. The second most untrackable object was the index finger because it was occluded by the thumb finger in the beginning and the end of each video. The color tracking also detected more than one object although already applied the Merge algorithms but this is in very small number (only 0.0032%). The causes of untrackable objects were occlusion, image blurring because of fast object movement, and changes in lighting condition. The occlusion happened when one finger occluded with another finger. For most of the time, the index finger was occluded by the thumb finger in the beginning and at the end of each video. Image blurring happened when the hand moved too fast for example in creating twin number signs (11, 22, 33, and so on) and every tens numbers (10, 20, 30, etc). The changes of lighting condition happened because the video was recorded using natural light from 9am to 3pm. The movement of the hand also created shadow and changed the lighting condition. B. Recognizer There are two types of validation method that were used in this research. The first validation method was fivefold validation. Five-fold validation generated five set of testing data and training data from five video samples for each number which is sample A, B, C, D, and E. For set A, it used first sample of each sign language number as the testing data and used the other samples of the same sign language number for training data. For set B, it used second sample and so on until set E which used the last sample of

Set B

Set C

Set D

Set E

Time (HH:MM:SS) HInit

32:8

31:32

31:36

31:45

31:33

HModels

2:50

3:1

3:13

3:15

3:8

Total

34:57

34:33

34:49

34:59

34:42

Correct

767

881

890

888

850

Wrong

233

119

110

112

150

Total Samples

1000

1000

1000

1000

1000

%Correct

76.70%

88.10%

89.00%

88.80%

85.00%

Accuracy

76.70%

88.10%

89.00%

88.80%

85.00%

Result

The recognizer using 10-state HMM without skip states could achieve 85.52% accuracy in average. The maximum accuracy was 89.00% using the set C as input. The minimum accuracy was with 76.70% using the set A as input. The set A has the lowest accuracy since it was the first video attempt of the model to do the sign. Thus, the first sample had significant difference with other sample. Set C has the highest accuracy probably because the model already gets used to do the sign and produce more constant/similar sign. The average accuracy of leave-oneout validation is 85.52% same with five-fold validation result. This happened because the samples are similar with another. V.

CONCLUSION

Sign language number recognition system in this research was able to design a model for recognizing sign language number that was suitable for number in Filipino

Sign Language. The sign language number recognition system was also evaluated in terms of accuracy and time. The feature extraction could track 92.3% of all objects in 5 hours 16 minutes and 50 seconds using Intel Core 2 Duo E4400 2 GHz computer with 2GB memory. It could be concluded from the feature extraction results that this research has already implemented computer vision techniques for robust and real-time color tracking which is used in feature extraction of dominant hand and skin (face and non-dominant hand). The recognizer also could recognize Filipino sign language number using the features from feature extraction module. The 10-state HMM without skip state has the highest average accuracy which is 85.52%. The total average running time for 10-state HMM without skip states was 34 minutes and 48 seconds. The leave-one-out validation for 10-state HMM without skip states results to the same accuracy, 85.52%. This research is the pioneer in sign language recognition in the Philippines. Thus, it is far from perfect but this research gave framework to be extended in future research. The video that was used as input in this research could be improved because the framing of the model seems too far. The model had her hand too far down. In natural discourse, the placement of the dominant hand is about 3-4 inches to the side (and and inch or so in front of) of the mouth which is called the finger-spelling space. Deaf signers who converse never look at the interlocutor's hand but at the eyes. This close placement of the hand to the face enables the signer to use peripheral vision in catching the manual signal completely. The signing space should be in the 3-dimensional space mid-torso to the top of the head, with a third of the shoulder in addition on either side. The video sample could also include all the variants of each number. For instance, there are 2 ways of signing 10, 16, 17, 18, 19 each. There are also additional unique signs for 21, 23, 25 (with internal movement). For further research, it is advisable that the research uses other color system such as YCrCb, CIE, etc. instead of HSV and using more advanced color tracking algorithm such as K-Means algorithms or other tracking algorithms such as Lucas Kanade Feature Tracker. The other possibility is using only skin color without gloves and fingertip detection algorithms for feature extraction. The recognizer module could use another machine learning algorithm for time series data such as fuzzy clustering, neuro fuzzy, etc. The exploration of grammar features of Hidden Markov Models Toolkit is also possible for further research.thanks to ACM SIGCHI for allowing us to modify templates they had developed. REFERENCES [1] Philippine Federation of the Deaf (2005). Filipino Sign Language: A compilation of signs from regions of the Philippines Part 1. Philippine Federation of the Deaf.

[2] Philippine Deaf Resource Center & Philippine Federation of the Deaf (2004). An Introduction to Filipino Sign Language. Part I-III. Philippine Deaf Resource Center, Inc, Quezon City, Philippines.

[3] Zhang, L.-G., Chen, Y., Fang, G., Chen, X., & Gao, W. (2004). A vision-based sign language recognition system using tied-mixture density hmm. In ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces, pages 198–204, New York, NY, USA. ACM.

[4] Wang, Q., Chen, X., Zhang, L.-G., Wang, C., & Gao, W. (2007). Viewpoint invariant sign language recognition. Computer Vision and Image Understanding, 108:87–97.

[5] Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. Transactions on Pattern Analysis and Machine Intelligence, 20(12):1371–1375.

[6] Vogler, C. & Metaxas, D. (2004). Handshapes and movements: Multiple-channel ASL recognition. In Springer Lecture Notes in Artificial Intelligence. Proceedings of the Gesture Workshop'03, Genova, Italy., pages 247–258.

[7] Fabian, E. A., Or, I., Sosuan, L., & Uy, G. (2007). Vision-based hand mimicking system. In ROVISP07: Proceedings of the International Conference on Robotics, Vision, Information, and Signal Processing, Penang, Malaysia.

[8] Aguilos, V. S., Mariano, C. J. L., Mendoza, E. B. G., Orense, J. P. D., & Ong, C. Y. (2007). APoL: A portable letter sign language translator. Master's thesis, De La Salle University Manila.

[9] Brashear, H., Henderson, V., Park, K.-H., Hamilton, H., Lee, S., & Starner, T. (2006). American sign language recognition in game development for deaf children. In Assets '06: Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility, pages 79–86, New York, NY, USA. ACM.

[10] Argyros, A. A. & Lourakis, M. I. A. (2004). Real time tracking of multiple skin-colored objects with a possibly moving camera. In the European Conference on Computer Vision (ECCV’04), volume 3, pages 368–379, Prague, Chech Republic. Springer-Verlag.

[11] Argyros, A. A. & Lourakis, M. I. A. (2006). Vision-based interpretation of hand gestures for remote control of a computer mouse. In ECCV Workshop on HCI, pages 40–51, Graz, Austria. Springer Verlag. LNCS 3979.

[12] Intel Software Product Open Source (2007). Open Source Computer Vision Library [online] Available: http://www.intel.com/technology/computing/opencv/ (March 6, 2008)

[13] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK Book. [online] Cambrigde University Engineering Department. Available: http://htk.eng.cam.ac.uk/docs/docs.shtml (March 6, 2008)