Real-Time Arabic Sign Language (ArSL) Recognition

58 downloads 343 Views 588KB Size Report
provides a feasible solution to Arabic Sign Language (ArSL) .... both offline and online modes. .... different kinds of shape signature which can be used.
Real-Time Arabic Sign Language (ArSL) Recognition Nadia R. Albelwi

Yasser M. Alginahi

College of Computer Science and Engineering, Taibah University Madinah, Saudi Arabia [email protected]

College of Computer Science and Engineering, Taibah University Madinah, Saudi Arabia [email protected]

for few attempts which will be explained in the related work section.

Abstract— This paper presents a vision based system that provides a feasible solution to Arabic Sign Language (ArSL) recognition of static gestures of alphabets. The proposed method doesn’t require that signers wear gloves or any other marker devices to simplify the process of hand segmenting. Haar-like algorithm is used to track the hand in the consecutive frames, and then bounded images become the region of interest after employing special preprocessing techniques: skin detection and size normalization. The resultant images are transformed into the frequency domain using Fourier transformations to form the feature vectors. Classification is then assessed by K Nearest Neighbor (KNN) algorithm. The proposed system achieved a recognition accuracy of 90.55 %.

In this work a vision-based approach is proposed to develop an automatic ArSL recognition system with the help of advanced image processing approaches. The paper is organized as follows: section 2 presents the related work on the topic, section 3 gives a brief overview of the proposed system, section 4 explains the data collection and preprocessing, section 5 enumerates the feature extraction methodology, section 6 demonstrates the proposed classification method, section 7 discusses the experimental results and section 8 concludes this paper.

Keywords- Arabic Sign Language (ArSL); Fourier descriptors; Haar-like; skin detection; K nearest neighbor.

I.

II.

INTRODUCTION

Achieving real-time requirements for many applications is an important area of research in computer vision. There are several applications which have been created based on this concept in security, education, criminology etc. Intelligent parking systems for different vehicles are a well-known example of such applications. Tracking the moving objects in videos is an attractive task, but the most important is to classify these moving objects into meaningful classes. Object recognition in a dynamic situation is not that easy provided that many effective factors such as object size, illumination, angle of viewing and the movement speed need to bear in mind when designing a recognition system [1]. Since the fast response is the distinct characteristic of Real–time applications, a rapid object detection method is needed. Sign language has the most structured sets of gesture types; it is the only mean for natural way of communication between deaf all over the world. There is an obvious communication gap between deaf and hearing people due to the difficulty faced by normal people to learn and comprehend sign languages and the difficulty faced by deaf to learn an oral language [2]. Several solutions have been suggested to reduce the gap. One of the solutions is to have an interpreter; however, this is not always available. Technology as well played significant rule in this field. There have been lots of researches with different approaches which concentrated on developing automatic sign language recognition. Unfortunately, ArSL did not get as much focus as other languages, such as ASL (American Sign Language) except

© ICCIT 2012

RELATED WORK

There has been surging interest in recognizing human hand gestures. Most of the image processing researches have taken this path in order to find a powerful way to recognize Sign Language presented by deaf into a form understandable by others. The recognized sign is expressed either in text or in spoken format. There are several methods suggested to recognize hand gestures which vary in the way of treating the problems. However, no method has been chosen to be the best. The methodologies used in Sign Language recognition can be categorized into several types based on feature extraction methods, input type and the hardware dependency. Traditionally, there have been three main types of sign language recognition: hand shape classification, isolated sign language recognition, and continuous sign classification [3]. Hand-shape classification is considered one of the most important topics in sign language recognition. Previously, during the period of 1994-1998, sign language was just a string of sign represented by the hand-shape. Many researchers focus on isolated sign since it is the main basic unit in the sign language [3-4]. Based on how the features of gestures are extracted; sign language recognition can be either vision-based or device-based methods. A video camera with an image processing system to recognize and classify the sign within the image is used in vision-based system. This approach has an advantage that the user does not have to wear clumsy devices and its ability to incorporate the facial

497

expression. However, it depends on a complex image processing and large amount of data. The obvious weakness of this approach is its lower accuracy and high computing power consumption [5]. In a device-based system, the user needs to wear a device to measure the physical features of gestures, e.g. dimensions, angles, motions, and colors. These devices equipped with a number of sensors which generate a set of electrical signals that characterize the intended sign. Instrumented gloves, e.g. the Cyber glove have been conceived as a useful device for recognizing sign languages [6].

system in a natural way. The system was able to recognize the 30 Arabic manual alphabets with an accuracy of 93.55%. AL-Rousan, et al. [12] introduced the first automatic ArSL recognition system based on HMMs in 2009. They used large set of samples to recognize 30 isolated words from the standard ArSL. Their system did not rely on input devices such as gloves which make deaf perform gestures freely and naturally. The system obtained a high recognition rate for both offline and online modes. The recognition system presented in this paper differs from previous work it uses Fourier descriptors as the feature extraction method which has not been chosen before in the field of Sign Language.

Attempts at machine sign language recognition began to appear in the literature in the 1990’s. Charaphayan and Marble [5] investigated a way using image processing to understand ASL. This method involves hand motion detection, hand location tracking based on the motion and classification of signs using adaptive clustering of stop positions, simple shape of the trajectory, and matching of the hand shape at the stop position. This system can recognize 27 of the total 31 ASL symbols correctly. Takahashi and Kishino used a range classifier to recognize 46 Japanese Kana manual [7]. Based on experiments, the hand gestures were encoded with data ranges for joint angles and hand orientations. Their system can recognize 30 out of 46 hand gestures correctly, but the remaining 16 signs could not be reliably recognized. Since isolated words are considered as the basic unit in sign language, many researchers focus on isolated sign language recognition. In regard to ArSL recognition, there is very little work for developing such applications. In 2004, Al-Buraiky [8] proposed a system for automatically recognizing ArSL using an instrumented glove as an interfacing device and the support vector machine algorithm as a classification algorithm. The support vector machine algorithm is chosen because it is a relatively new approach to machine learning and because it has many attractive features compared to competing approaches. In 2006, Ibrahim Mohammad [9] represented a pioneering work on the automation of the ArSL recognition using the CyberGlove as an interface device and principal components analysis as the feature extraction algorithm. Mohandes, et al. [10] proposed an image based system for ArSL recognition in 2007. They used a Gaussian skin color model to detect the signer’s face. The centroid of the detected face is used as a reference to track the hands movement using region growing from the sequence of images comprising the signs. A number of features are then selected from the detected hand regions across the sequence of images. They perform the recognition stage using a Hidden Markov Model and achieved a recognition accuracy of about 93% for a data set of 300 signs with leave one out method. In 2007, Al-Jarrah and Halawani [11] used neuro-fuzzy systems to recognize gestures in ArSL. They have designed a collection of Adaptive neuro fuzzy inference system (ANFIS) networks, each of which is trained to recognize one gesture. Their system did not rely on using any gloves or visual markings; instead, it deals with images of bare hands, which allows the user to interact with the

III. SYSTEM ARCHITECTURE The general architecture design for an ArSL recognition system is shown in Figure 1. High-resolution video camera is used to capture real-time video as an input to the recognition system. Generally, the architecture can be divided into two stages: hand detection and sign recognition. In hand detection, a Haar-Like classifier is used to track the hand movement per frame to decide on the final region to be processed. Then, motion detection is applied to determine the absolute time of recognition. However, in sign recognition, normal image processing steps are involved. At the beginning, image preprocessing methods are used, these are: size normalization and skin detection. Then, features are extracted and finally classification is preformed through KNN algorithm.

Figure 1. System Architecture.

IV.

DATA COLLECTION AND PREPROCESSING

Four sets of images are used either to generate the classifiers or as training samples in the system. A total of 200 positive samples for open and closed hands were collected. The pool of positive samples are then expanded by generating new samples derived from the original ones using Createsamples, free utility from OpenCV (Open Source Computer Vision) library to increase positive samples with respect to some factors. On the other hand, negative samples are obtained from a ready database designed for the task of eye detection [13]. Positive and negative samples in addition to the generated positive samples are shown in Figure 2. Six alphabets signs are used as training set in the system: ‫ ي‬,‫ ل‬,‫ ز‬,‫ ب‬,‫ ك‬,‫ص‬. Three adult-individuals provided the training data. The total number of alphabets was 6 and each individual provided about 35 samples per alphabet. The collected samples were then labeled and stored to be used for training. In practice, a training set cannot cover all variants of hand and thus, may not correctly

498

classify unknown test hand. To overcome this problem, test images were processed and saved as new training images.

step is to apply these classifiers to a region of interest having the same size as the images used in the training step, the classifier outputs “1” if the region is likely to show the object (hand) and “0” otherwise. To search for the object in the whole image one can move the search window across the image and checks every location using the classifier. The classifier is designed so that it can be easily “resized” in order to be able to find the objects of interest at different sizes which is more efficient than resizing the image itself. So, to find an object of an unknown size in the image the scan procedure should be done several times at different scales [16]. Since using the Haar only for setting the object box position in the image, two classifiers are needed for open or closed hand. B. SIZE NORMALIZATION In practice, robust classification requires low dimensional vector space to be applied. However, a Binary image of size x × y is often represented by a vector in an x • y dimensional vector space. Normalization of size in addition to some of the dimension reduction techniques are widely used to resolve this problem. It is a crucial preprocessing stage in the development of robust object recognizers in which significant size variations in training or testing patterns is clear. So, prior to feature extraction, it is necessary to apply size normalization on images. The size of the images is normalized to 150 x 150. This procedure reduces the amount of data needed to be processed while keeping the layout the same. Size normalization is applied to the training images in the database as well as implemented in the program to get test samples of varying sizes normalized into the standard output size.

Figure 2. (a) Set of the positive samples (contains hand shape). (b) Set of the orginal negative samples. (c) Set of the generated samples from a and b using Createsamples utility.

A. HAND TRACKING The Haar Classifier is a learning method for object detection originally developed by Viola and Jones [14]. The algorithm uses simple features reminiscent of Haar basis functions which have been used by Papageorgiou et al. [15]. Figure 3 shows the three main Haar features needed, two-rectangle features provide the difference between the sum of the pixels within two rectangular regions, three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle and four-rectangle feature computes the difference between diagonal pairs of rectangles.

C. SKIN DETECTION Prior to any recognition step, the object should be localized in the image and separated from the background. Therefore, and due to the characteristic color values of human skin, the invariant properties and the computational simplicity, color cue is a good choice [17]. In general, to detect the skin color in a colored image, it should first be converted to HSV (hue, saturation, value) color space. Then, set the color of all pixels into black and white according to the values of both H and S channels i.e. if the value exists in the predefined range of the skin color, set to White otherwise black. Ranges differ according to the people skin color in various regions. The resultant image shows the whole skin color in the image including the desired object, hand in our case.

Figure 3. Haar-like features.

The algorithm is described in three steps. First, the integral image representation is used to rapidly calculate the values of the rectangular features. The integral image at location x, y contains the sum of the pixels above and to the left of x, y, equation (1).

V.

FEATURE EXTRACTION

Selecting the right set of features is the decisive key in order to avoid ambiguity in all pattern recognition systems. These features should discriminate between any patterns in the sample space as an item of one category. In case of Sign language, it’s necessary to choose features that reflect the phonetic structure of the language. In order to gain better recognition, shift and rotation invariant features are extracted [18]. Each image of the training samples is decomposed into its components. In the case of hand gesture, the abstract

(1) The second step is training the classifier using a large training set of positive images, containing the object, and negative images with the same size. Given these sets in addition to the features set, any number of classifier can be trained. The final

499

representation of the image is important, Figure 4. Therefore, contour-based methods are usually used through which only shape boundary information are exploited with a complete ignorance of the interior information.

VI.

CLASSIFICATION

Assigning a new feature vector to some predefined categories in order to recognize the sign is the task of classification. The category consists of a set of features obtained during the training phase using number of training images. Classification mainly concentrates on finding the best matching features vector for the new vector among the set of reference features. KNN is one of the most commonly used methods in sign language recognition systems. It uses feature vectors generated during the training phase to get the KNN in a dimensional space. The features vector is classified by a majority vote of its neighbors. Neighbors are taken from a set of objects for which the correct classification is known [20]. Euclidean distance measures is used to calculate the difference between the query and the target shape feature vectors and return the number of approximate nearest neighbors.

Figure 4. Set of hand gesture shapes.

Fourier descriptors (FDs) have many advantages over other counter shape representations including easy normalization, computation complexity, robustness, and retrieval performance [19]. FDs are simply generated by applying the Fourier transform on a shape signature. The coefficients used for transformation are called the FDs of the shape. The shape signature is a one dimensional function derived from the coordinates of the shape boundary. Although there are different kinds of shape signature which can be used separately to obtain FDs such as complex coordinates, curvature function, cumulative angular function and centroid distance, merging more than one kind leads to FDs with significant different performance on shape retrieval. Obtaining shape boundary coordinates (x (t), y (t)), t = 0, 1 N1... where N is the number of boundary points, is the first step towards computing FD. The boundary coordinates can be extracted by using an 8-connectivity contour tracing technique during the preprocessing stage. Then, as shown in equation (2), the distance, between each of the boundary points from the centroid ( , ) of the shape is calculated [19].

VII. EXPERMINTAL RESULTS For our experiments, we have used a single LifeCam VX-5500 camera to capture the image sequences. The camera acquires 12 frames per second at a resolution of 640x480 pixels, 1.3 megapixel stills. The application have been built using the express edition of visual c# in addition to other tools provided by OpenCV library to create and edit hand samples that is used in Haar classifier. None of the training set samples was used for testing purposes. Instead, all captured samples during the execution of the system are used for this purpose. Those which get recognition rate over 90% assigns as new training image in its specific class. We selected fifteen samples from two different signers for each of the six signs used in the system. Thus, we have a total of 180 signs. These collected signs were of different shapes, scales and brightness. As shown in Table 1, out of the 180 samples used in testing the system, only seventeen different samples were misclassified, resulting in a recognition accuracy of 90.55.

(2) TABLE I.

Where

Class Error rate 1/30 Saad (‫)ص‬ Baa (‫)ب‬ 5/30 Laam (‫)ل‬ 3/30 Total = 17/180 = 9.45 Accuracy rate = 90.55 %

Then, the coefficients are used to derive the Fourier descriptors (FDs) of the shape after calculating the discrete Fourier transform of by equation (3).

RECOGNITION RESULTS

Class Kaaf (‫)ك‬ Za’a (‫)ز‬ Ya’a (‫)ي‬

Error rate 0/30 4/30 4/30

VIII. CONCLUSION

(3)

This paper presented a vision-based automatic sign language recognition system for Arabic letters. Experimental results of the system showed that the system is capable of recognizing Arabic alphabets effectively with no need for any additional hardware such as gloves or sensors. The proposed method uses predefined Haar classifiers to detect the hand’s position and a set of processes to prepare the image for the feature

Due to the similarity in the representation of ArSL, large number of FDs is required to uniquely identify the shape.

500

extraction task. After normalizing the bounded image, skin detection is applied to detect skin color in HSV color space. Transforming images into frequency domain using Fourier Transform enables us to form the entire signs database into a single vector based on FDs to ease the matching process. This scheme for feature extraction allowed for the use of a simple classification technique, KNN. Classification using the proposed methodology achieved up to 90.55 % recognition accuracy at real time. Further development for the proposed system will be conducted to extend the proposed system to build a complete Arabic sign language recognition system including all alphabets, numbers and most common gestures as well as enhance feature extraction methodology in order to achieve higher recognition rate.

[12]

[13] [14]

[15]

[16] [17]

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Zhang L., Li S., Yuan X. and Xiang S.," Real-time Object Classification in Video Surveillance Based on Appearance Learning”, 2007. Manar Maraqa, Raed Abu-Zaiter, "Recognition of Arabic Sign Language (ArSL) Using Recurrent Neural Networks". Sandjaja I., Marcos N., "Sign Language Number Recognition," Fifth International Joint Conference on INC, IMS and IDC, pp.1503-1508, 2009. Wang, Q., Chen, X., Zhang, L.-G., Wang, C., and Gao, W., "Viewpoint invariant sign language recognition. Computer Vision and Image Understanding", 108:87– 97, 2007. Charayaphan C. and Marble A., “Image processing system for interpreting motion in American Sign Language,” Journal of Biomedical Engineering, Vol. 14, pp.419-425, 1992. Kramer, J., and Leifer, L., "The Talking Glove: An Expressive and Receptive "Verbal" Communication Aid for the Deaf, Deaf-Blind, and Nonvocal", SIGCAPH 39, pp.12-15(spring 1988). Takahashi T. and Kishino F., “Gesture coding based in experiments with a hand gesture interface device,” ACM SIGCHI Bulletin, Vol. 23, pp. 67-73, 1991. Al-Buraiky S., "Arabic sign language recognition using an instrumented glove", Master’s thesis, King Fahd University of Petroleum and Minerals, 2004. Wray A., Cox S., Lincoln M. and Tryggvason, J., “A formulaic approach to translation at the post office: reading the signs,” journal of Language & Communication, pp. 59–75, 2004. Mohandes M., Quadri S., Deriche M., "Arabic Sign Language Recognition an Image–Based Approach", 21st International Conference on Advanced Information Networking and Applications Workshops ,2007. Al-Jarrah O. and Al-Omari F., "Improving gesture recognition in the Arabic sign language using texture analysis ", Applied Artificial Intelligence, Vol. 21 Issue 1, January, 2007.

[18]

[19]

[20]

501

AL-Rousan M., Assaleh K. and Tala'a A., "Videobased signer-independent Arabic sign language recognition using hidden Markov models", Applied Soft Computing, Vol. 9 Issue 3, June 2009. < http://note.sonots.com> visited: 21 Dec, 2010. P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," in Computer Vision and Pattern Recognition, 2001. CVPR 2001, Proceedings of the 2001 IEEE Computer Society Conference on, 2001, pp. I-511-I-518 vol.1. P. Song, S. Winkler, S. Gilani, and Z. Zhou, "VisionBased Projected Tabletop Interface for Finger Interactions," in Human–Computer Interaction, 2007, pp. 49-58. < http://cgi.cse.unsw.edu.au> accessed Dec. 3, 2010. A. Albiol, L. Torres, and E. J. Delp. “Optimum color spaces for skin detection.” In proceedings of the international conference on image processing, Vol. 1, pp. 122-124 ,2001. Dengsheng Zhang and Guojun Lu, "An Integrated Approach to Shape Based Image Retrieval", Gippsland School of Computing and Information Technology, 2002. Andre Folkers, Hanan Samet, "Content-based Image Retrieval Using Fourier Descriptors on a Logo Da", Quebec City, Canada, August 2002. Thierry Messer, Static hand gesture recognition, Department of Informatics, 2009.