HMM-Based Arabic Sign Language Recognition Using ... - IEEE Xplore

3 downloads 254 Views 1MB Size Report
The Tenth International Conference on Digital Information Management (ICDIM 2015) ... recognition of Arabic Sign Language (ArSL) using Microsoft. Kinect.
The Tenth International Conference on Digital Information Management (ICDIM 2015)

HMM-Based Arabic Sign Language Recognition Using Kinect Noha A. Sarhan College of Engineering and Technology

Yasser EI-Sonbaty College of Computing and Information Technology

Sherine M. Youssef College of Engineering and Technology

Arab Academy for Science and Technology Alexandria 1029, Egypt noha [email protected]

Arab Academy for Science and Technology Alexandria 1029, Egypt [email protected]

Arab Academy for Science and Technology Alexandria 1029, Egypt [email protected]

_

approach. Vision-based techniques either use simple colored gloves, or more natural approaches work on free hands, which is in turn more complex. Both methods require feature extraction techniques that are invariant to scale, rotation and translation changes [2].

Abstract-With research in Arabic Sign Language Recognition (ArSLR) still in its infancy, we present a method for recognition of Arabic Sign Language (ArSL) using Microsoft Kinect. A new dataset for ArSL words was collected using Kinect due to the lack of existence of one. With the intention of use in medical hospitals to aid communication between a deaf or hard­ of-hearing patient with the doctor, the foremost goal was presenting a robust system, which does not impose any constraints on either the signer or the background. The proposed system combines skeletal data and depth information for hand tracking and segmentation, without relying on any color markers, or skin color detection algorithms. The extracted features describe the four elements of the hand that are used to describe the phonological structure of ArSL: articulation point, hand orientation, hand shape, and hand movement. Hidden Markov Model (HMM) was used for classification using ten-fold cross-validation, achieving an accuracy of 80.47%. Singer­ independent experiments resulted in an average recognition accuracy of 64.61 %.

In comparison to recognition of other SLs, such as American SLR [3], German SLR [4], Korean SLR [5], and Australian SLR [6], research in ArSLR is still in its infancy. A survey on current research trends and open problems in SLR is presented in [7]. The development of affordable sensors such as Microsoft Kinect [8] and its success in hand gesture analysis [9] led to its employment in various SLR systems. The use of Kinect tackles the most critical problem faced in computer-vision techniques, that is hand detection. Using the depth information provided by Kinect, we present a robust segmentation technique that uses no color markers and does not rely on skin detection algorithms, therefore its invariant to illumination changes. In addition, segmentation is also unaffected by cluttered backgrounds (whether they have static or dynamic objects), signer's clothing, or other visible body parts (ex: face, neck, arms, etc). The use of Kinect does not only grant an unrestricted environment, but also provides a natural human­ computer interaction.

Keywords-Arabic Sign Language Recognition; Kinect; RGB­

D; Human-Computer Interaction; Computer Vision; HMM

I.

INTRODUCTION

Being a visually transmitted means of communication, Sign Language (SL) has become the primary means of communication amongst the deaf and hard-of-hearing. A combination of hand gestures, body postures, and facial expressions conveys the message to be delivered, which gives rise to complete, highly structured Sign Languages (SLs). Similar to spoken languages, different countries and regions have developed their own SLs.

To the best of our knowledge, Kinect has not been employed in ArSLR, therefore, we developed a new dataset composed of 16 ArSL words captured using Kinect. The dataset will be available upon request. The dataset is explained in details in section IV. The proposed method is also signer independent. Ten-Fold cross-validation test was then carried out using HMM to test the system's accuracy.

The lack of awareness of the rest of the society to SL drove researches to develop Sign Language Recognition (SLR) systems in order to break communication barriers between all citizens [ 1]. While earlier work in the field involved the use of complex gloves, advances in computer vision caused research to shift towards more natural HCI (Human-Computer Interaction) where a video camera is used to capture the hand. Image processing, segmentation techniques and feature extraction comprise the major steps in the vision-based

978-1-4673-9152-8/151$3l.00 ©20151EEE

The remainder of this paper is structured as follows: Section 2 illustrates recent work on ArSLR, and recognition of other SLs using Kinect. In Section 3, characteristics of ArSL are briefly explained. The proposed method is explained in detail in Sections 4 to 8, followed by experimental results in Section 9. Finally, Section 10 concludes the paper.

169

TABLE I.

Number 1 2 3 4 5 6 7 8

II.

RELATED

LIST OF ARSL WORDS AND THEIR ENGLISH TRANSLA nON

Word

Number 9 10 11 12 13 14 15 16

,.11 (Pain) lii (TIMe) k! (After)

d (You) 4 (Liver) ,-?}S (Kidney) � (Stomach ache) Jb (Cough)

WORK

Word .b� (CorrectlPrecise) �(Low) J;§ (before) u:'':il.� (Ear drops) �(Age) � (Fever) tl.l....o (Headache) '-?foJl u-y (Diabetes)

To allow for scalability, Almeida et al [ 15] presented a method for recognition of Brazilian SL using features that relate to the phonological structure of the SL. The extracted features were vision-based and SVM was employed to classity the signs resulting in above 80% accuracy on average.

In this section we focus on works in the literature that use Kinect in SLR. However, a thorough survey on ArSLR systems can be found in [ 10], which covers both vision-based and sensor-based approaches.

HMM was the classifier of choice by Zafrulla et al. [ 16] and Akram et al. [ 17].Zafrulla et al. recognized a total of 1000 American SL phrases relying on skeletal and depth data, resulting in 76.2% sentence verification rate. Akram et al. used skin color algorithm, and enforced the user to wear long sleeves in order to recognize 94 Swedish SL words. An accuracy of 94% was achieved in signer dependent mode, and a much lower accuracy of 47% in signer independent mode.

The increased interest in more natural and convenient HCI lead to advances in HGR of which SLR is one application. Oszust and Wysocki[II] tested two sets of features for recognition of Polish SL. The first involved only skeletal joints, while the second combined skin color detection along with depth infonnation in order to segment the hand. Features describing the hand shape were then extracted. Applying Dynamic Time Warping (DTW) in order to align variable length time series, they were able to recognize 30 words with an accuracy of 89.33% using the first approach, and 98.33% using the second feature set. The use of skin color detection is unfavorable as it is affected by changes in lighting conditions, moreover, the used dataset involved only one signer; a dataset with more than one user needs to be tested. Capilla[ 12] also used DTW to recognize 14 homemade signs using skeletal and depth data, reaching an accuracy of 95.24%, nevertheless, hand shape was not taken into consideration.

In attempt to tackle one of the main challenges of developing SLR system, Mohandes et al. [ 18] worked on a medium-size dictionary set of 300 signs by extracting geometrical features of the hands. HMM was used for classification, achieving 95% recognition accuracy. Although this algorithm gives high recognition rates, face detection and region growing algorithms were used, which are both computationally expensive. Memis and Albayrak[ 19] recognized dynamic gestures of Turkish SL using motion differences and accumulation of temporal gesture analysis. Discrete Cosine Transform was applied to transform to the spatial domain. Using KNN for classification with Manhattan distance resulted in over 9 1% recognition accuracy. Compressing sets of similar captured images is sometimes needed to only retain the principle images with the most significant features [20]

Agarwal and Thankur[13] relied on the processing of only depth images. For every gesture, depth and motion profiles were extracted from the depth input stream. Recognition using multi-class Support Vector Machine (SVM) of ten digits of Chinese SL was promising, achieving 92. 13% accuracy. Geng et al [ 14] recognized 20 Chinese SL words using depth data and skeleton joints by extracting the 3D trajectories of the right hand, wrist and elbow, achieving an accuracy of 69.32%. Utilizing only depth and skeletal data allows segmentation process to be robust against illumination variations.

SLR using Kinect is expanding, other attempts include [2123].Object matching techniques used in 20 [24] can be

Fig. 1. ArSL for the word "c:...i.; " (You). RGB image (Left); Depth Image (Center); Skeletal joints' locations superimposed on color image (Right)

170

or the background.

promising if extended on 3D data acquired form Kinect. With its tracking capabilities, the use of Kinect improves the system's robustness, complexity, and interactivity. It also helps tackle challenging computer vision problems including object detection, tracking, and segmentation. III.

CHARACTERISTICS OF

V.

After acquiring the three input streams, first, the frames are extracted from the color and depth videos at rate of 30 fps; the same rate with which the data was acquired.

ARSL

Although both the RGB and the depth camera are synchronized, the acquired images must be aligned. This problem occurs due to the difference in position of both cameras. Therefore, the second step prior segmentation, the Kinect color image is aligned to its corresponding depth image.

With over 350 million speakers worldwide, Arabic is a widespread language, albeit in different dialects according to country or region. Similarly, ArSL differs as well. However, there are endeavors in order to unify ArSL [25].The main components of ArSL are: (i) the hands, (ii) the face, (iii) the eyes, and (iv) the body.

VI.

The most essential component in conveying the sign is the hands. Its shape, orientation, articulation point, and type of movement together form the gesture. Various body postures, lip patterns, eye gaze and movement reinforce the meaning of the performed sign. Being the prime element in expressing the sign, the hand is the main focus of this paper. IV.

DATA PREPROCESSING

SEGMENTATION

In order to acquire a final image containing only the Region of Interest (ROI) , which is the right hand performing the gesture, each extracted frame undergoes a two-step segmentation phase: Background Removal, and Hand Detection and Segmentation.

A. Background Removal

DATA ACQUISITION

The signer is segmented from the background using the segmentation mask provided by the Kinect sensor, where the pixels corresponding to the tracked human body are set to 1, and other pixels are set to O. Morphological operations are applied to the mask a priori to further ensure that no pixels of interest are lost (as depicted in Fig. 2).

A single Kinect for Windows sensor was used to capture videos of signers performing isolated ArSL words. For each sample, three input streams were recorded: (i) RGB video, (ii) depth video, and (iii) skeleton joints. The frame rate of both the RGB and depth cameras was set to 30 frames per second, and a resolution of 640x480 pixels.

Background removal ensures that other objects (stationary or dynamic) in the background will not affect any further processing. Furthermore, it doesn't impose the restriction that the signer must be the closest object to the camera, since any pixel not belonging to the human body, regardless of its depth, has been discarded.

In collaboration with Asdaa' (Association of Serving the Hearing Impaired), four participants (3 males and 1 female) volunteered to perform the signs for the generation of the dataset. It is noteworthy that the female participant wears a veil, which is common amongst Arabs, thereby proving the applicability of our ArSLR in Arab Countries.

B. Hand Detection and Segmentation

The dataset is made up of 16 words (Table 1), which can be used in a hospital. Each signer performed each gesture at least three times, changing their position and distance from the camera, the background, and lighting, thereby verifying the strength of the recognition against such changes. A total of 2 15 instances were gathered, a sample of the word "Uil" (You) is shown in Fig. 1.

The hand is located using the coordinates of the skeleton joint representing the right hand. The hand is segmented solely based on the depth and skeletal information as follows: 1) A square mask comprising 6% of the total body area is prepared, with the location of the right hand as its center. 2) The mask is applied to the corresponding depth image. The resulting image includes the hand, along with other unwanted pixels that may belong to either the body or the already segmented background. 3) The pixels representing the hands are extracted by validly assuming that the hand is no the nearest object to the camera (i.e. with the smallest depth) in that ROI.

Owing to the nature of the chosen words, it is probable that they will be used in cases of emergencies; therefore it was of great importance to enforce no restrictions on the signer. All participants used their free hands, without wearing any color markers. In addition, there were no constraints on their clothing

The strength of the applied segmentation technique lies in its ability to accurately segment the hand when placed next to the body (i.e. at the same depth level), or at an even greater depth from the body. This allowed for the inclusion of gestures such as "Kidney", where the signer directs their hands towards their kidney. VII.

For every gesture, eleven features are extracted from each frame. All of the chosen features are scale, rotation, and

Fig. 2. ArSL for the word J.l (Before) after background segmentation,RGB image (Left); Depth image (Right). "

FEATURE EXTRACTION

"

171

shape).

translation invariant. The selected features relate to the phonological structure of the hands in ArSL.

(4)

A. Features Representing Articulation Point Location o{the Hands (r, e, rp) [J2}. The hand location (x, y, z) is acquired for the skeletal joints, and is expressed relative to the shoulder center joint, since it remains static throughout the gesture. Thus, hand location is invariant to translation changes (i.e. the user can stand in any position in the room).

Solidity (s). Describes the extent to which the shape convex or concave.

The position of the hand is then transformed to spherical coordinates (r, 8, Y' (Diabetes)

172

FP Rate 0.010 0.010 0.010 0.000 0.000 0.000 0.025 0.035 0.020 0.000 0.010 0.020 0.030 0.025 0.000 0.013

Accuracy 76.92% 75.00% 85.71% 76.92% 66.66% 92.31% 60.00% 100.00% 78.57% 92.31% 92.86% 84.62% 62.23% 100.00% 78.57% 66.67%

Fig. 3. Showing the similar words

VIII.

CLASSIFICATION

"

Jb (Cough) on the left, and "..foJ1 u-"yo" (Diabetes) on the right "

USING HMM

This section briefly describes HMM. While several other classifiers such as Neural Networks, Support Vector Machine, Bayesian Networks, Template Matching, etc., HMM have been the most widely used and proven successful. This is due to their ability to work on temporal sequences of data and model time-varying patterns, where temporal-domain information needs to be preserved. Detailed information on HMM can be found in [26].

X.

A set of initial probability distribution,



Matrix B describing the observation probability matrix or PMF at each of the states

11:

CONCLUSION AND FUTURE WORK

With the intention of use in medical hospitals, the primary focus of the proposed system was its robustness and applicability in situations where no restrictions on the surrounding environment, or the signer need to be imposed. A new dataset for ArSL was collected using Kinect for the first time.

=

('IT, A, B) is used to defme a HMM.

Refraining from the use of color markers and skin color algorithms alleviated the need to enforce any rules to appropriately use the system. First, no restrictions were

A 3-state, 4-state and 5-state HMM were tested. 5-state HMM gave the highest recognition accuracy and was therefore used throughout the experiment. Left-to-right HMM architecture was employed, where it is only possible to transition to the immediate right neighboring state, or stay in the same state.

TABLE III.

EXPERIMENTAL RESULTS

SIGNER-INDEPENDENT RESULTS

Test Signer Signer I Signer 2 (Girl) Signer 3 Signer 4 Average

Once the features were collected, 10-fold cross-validation test was carried out on 3-state, 4-state, and 5-state HMM in order to assess its strength and evaluate the accuracy of the proposed system. The overall accuracy of the three different models yielded the following overall accuracies: •

5-state HMM: 80.47%.

For signer-independent experiment, the classifier was trained using all samples from three signers, leaving one signer out for testing. This was repeated 4 times for each signer. An average accuracy of 64.41% was achieved. The recognition accuracy for signer 2 (Girl) is the lowest. This is because this signer is proficient in ArSL, and omitting her data from the training sample had a significant impact.

transition



IX.



Analyzing the results, we notice that the word "Jt.,u.o " (Cough) has the lowest recognition accuracy of 60%. This is due to the great similarity between the way this gesture is performed and the gesture for "fi-J I U-Y' " (Diabetes), where 2 1% of the instances are misclassified as Diabetes. The only difference between both gestures is that Diabetes has the index finger up, while Cough does not, as shown in Fig. 3.

A HMM consists of N states, Sj, where the transition from one state to another, at each time step t, is a probabilistic process. A set of state transition probabilities form a transition probability matrix A = {au}, where au is the transition probability from state Sj to sf.

Using the above elements, the notation A=(n,A,B)A

4-state HMM: 78%

Since 5-state HMM achieved the highest recognition rate, for each individual word, the True Positive (TP) Rate, False Positive (FP) Rate, and accuracy using 5-state HMM are shown in Table 2.

HMM is a probabilistic model representing a given process with a set of states and transition probabilities between states. These states are hidden and all what can be seen is a sequence of observations.

A set of elements are used to define a HMM: • Matrix A describing the state probabiIities



3-state HMM: 73.06%

173

Accuracy 70.83% 56.25% 63.46% 67.92% 64.41%

imposed on the background or the surrounding objects. Second, extracted features are not affected by illumination variations or lighting conditions. Third, the singer is free to wear any clothes; neither bare skin of the face, neck, or arms will affect segmentation, nor will the color of clothing. Offering a constraint-free environment increases the naturalness of Hel. In addition, the computationally expensive hand detection and segmentation task in color images is bypassed by relying only on the skeletal and depth data to track and segment the hands. The implemented segmentation method is robust against situations where the hands are over the face, or at the same depth level as other body parts.

[8]

L. Cruz, D. Lucio, and L. Velho, "Kinect and RGBD images: challenges and applications," in 2th SIBGRAPI Conference on Graphics, Patterns, and Image Tutorials,pp. 36-49,2012.

[9]

1. Han,L. Shao,D. Xu,and 1. Shotton, "Enhanced computer vision with Microsoft Kinect sensor: a review," IEEE Transactions on Cybernetics, pp. 1318-1334,2013.

[II] M. Oszust, and M. Wysocki, "Polish sign language words recognition with Kinect," in The sixth International Conference on Human System Interactions (HSI),pp. 219-226,2013. [12] D. Capilla, "Sign language translator using Microsoft Kinect XBOX 360," MSc Thesis Department of Electrial Engineering and Computer Science Vision Lab,University of Tennessee,2012. [13] A. Agarwal, and M. Thankur, "Sign language recogniton using Microsoft Kinect," in The Sixth International Conference on Contemporary Computing,pp. 181-185,2013. [14] L. Geng, X. Ma, B. Xue,H. Wu,1. Gu, and Y. Li, "Combining features for Chinese sign language recognition with Kinect," in IIIh IEEE International Conference on Control & Automation, pp. 1393-1398, 2014.

The lack of similar dataset or a benchmark dataset on ArSLR makes it difficult to compare our results. Existing research on ArSLR uses different acquisition methods, vocabulary, and dataset, and the thus comparison would be irrelevant in this case. Our proposed system achieves a recognition accuracy of 80.46%, while providing a natural, convenient setup for the signer regardless of their physical characteristics and gender.

[15] Moreira, S. Almeida, F. Guimaraes, and J. Artuto Ramirez, "Feature extraction in Brazilian sign language recognition based on phonological structure and using RGB-D sensors," Expert System Applications, pp. 7259-7271,2014. [16] Z. Zairulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti, "American sign language recognition with the Kinect," in 13tl' International Conference on Multimodallnterfaces,pp. 279-286,2011. [17] S. Akram, J. Beskow, and H. Kjellstrom, "Visual recognition of isolated Sweidsh sign language signs," arXiv preprint,2012.

The presented work can be further extended to incorporate more vocabulary. In addition, hand tracking and segmentation can be enhanced to handle occlusion of the hand. Information about other body parts, such as the arms and facial expressions can be included to improve recognition accuracy. Finally, developing a real-time version would be of great benefit.

[18] Mohandes et aI., "A signer independent Arabic Sign Language recognitoin system using face detection, geometeric features and a Hidden Markov Model ",Computers and Electrical Engineering,Vol. 38, 2012 [19] Memis and Albayrak, "A Kinect Based Sign Languaeg Recognitoin using Spatio-temporal Features ", Proceedings of the SPIE, Vol 9067, 2013

REFERENCES

[2]

S. Kausar, and M. Y. Javed, "A survey on sign language recognition," in Frontiers of Information Technology, pp. 95-98, 2011.

[10] M. Mohandes, M. Deriche, and 1. Liu, "Image-based and sensor-based approaches to Arabic sign language recognition," IEEE Transactions on Human-Machine Systems,pp. 551-557,2014.

On the other hand, the data acquired from Kinect posses some problems. Due to its low resolution, both the RGB and depth images suffer from noises [9]. The depth data also suffers from holes; where some pixels have no depth information [8]. Additionally, the skeletal joints are not always accurately located, and skeletal tracking is seldom affected by processing delays. As a result, recognition accuracy is negatively affected.

[1]

[7]

M.E. Al-Ahdal, and N.M. Tahir,"Review in sign language recognition systesm,"IEEE Symposium on Computers & Informatics, pp. 52-57, March 2012. Y. El-Sonbaty, and M.A. Ismail, "Matching occluded objects invariant to rotations, translations, reflections, and scale changes," 13tl' Scandivanian Conference on Image Analysis,pp. 836-843,2003.

[3]

T. Starner, 1. Weaver, and A. Pentland, "Real-time American sign language recognition using desk and wearable computer based video," IEEE Transactions on Pattern and Machine Intelligence, pp. 1371-1375, August 2002.

[4]

B. Bauer, and H. Hienz, "Relevant features for video-based continuous sign langauge recognition," IEEE International Conference on Automatic Face and Gesture Recognition,pp. 440-445,2002.

[5]

J-S. Kim, W. Jang, and Z. Bien, "A dynamic gesture recognition system for the Korean sign language (KSL)," IEEE Transactions on Systems, Man,and Cybernetics,Part B: Cybernetics,pp. 354-359,2002.

[6]

E.-1. Holden, G. Lee, and R. Owens, "Automatic recognition of colloquial Australian sign language," Seventh IEEE Workshops on Application of Computer Vision,pp. 183-188,2005.

174

[20] Y. EI-Sonbaty, M. Hamza and G. Basily, "Compressing sets of similar medical images using multilevel centroid technique ", In Proceedings of Digital Image Computing: Techniques and Applications,2003 [21] K. Fujimura,and X. Liu, "Sign language recognition using depth image streams," in 7th International Conference on Automatic Face and Gesture Recognition,pp. 381-386,2006. [22] S. Lang, M. Block, and R. Rojas, "Sign language reocgniton using Kinect," LNCS,Heidelberg,2012. [23] HD. Yang, "Sing langaue recognition with the Kinect sensor based on conditional random fields. " Sensors,pp. 135-147,2014. [24] Y. El-Sonbaty, M. A. Ismail, and E. A El-Kawe, "New algorithm for matching 20 objects," In Electronic Imaging. International Society of Optics and Phonetics,pp. 340-347,2002 [25] M. Jemni,S. Semreen,A. Othman,Z. Tmar,and N. Aouiti, "Toward the creation of an Arab gloss for Arabic sign language annotation," in 4tl' ICTA,pp. I-5,2013. [26] L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," in Proceedings of IEEE,Volume 77, pp. 257-286,1989.