A novel approach to American Sign Language .ASL ... - IEEE Xplore

3 downloads 0 Views 3MB Size Report
{zahoor brashear harley.hamilton thad}@cc.gatech.edu. Abstract. We propose a ... We perform leave-one-signer-out cross validation on a dataset of 420 ...
A novel approach to American Sign Language �ASL) Phrase Verification using Reversed Signing Zahoor Zafrulla Helene Brashear Harley Hamilton Thad Starner School of Interactive Computing, College of Computing, Georgia Institute of Technology {zahoor� brashear� harley.hamilton� thad}@cc.gatech.edu

Abstract We propose a novel approach for American Sign Langauge �ASL) phrase verification that combines confidence measures �CM) obtained from aligning forward sign models �the conventional approach) to the input data with the CM’s obtained from aligning reversed sign models to the same input. To demonstrate our approach we have used two CM’s, the Normalized likelihood score and the Log-Likelihood Ratio �LLR). We perform leave-one-signer-out cross validation on a dataset of 420 ASL phrases obtained from five deaf children playing an educational game called CopyCat. The results show that for the new method the alignment selected for signs in a test phrase has a significantly better match to the ground truth when compared to the traditional approach. Additionally, when a low false reject rate is desired the new technique can provide a better verification accuracy as compared to the conventional approach.

(a)

(b)

1. Introduction (c)

We have developed a game CopyCat, for deaf signing children, that requires American Sign Language (ASL) phrase verification to enable interaction. The game serves as a pratice tool for deaf children and helps improve their working memory and sign language skills. The game apparatus is shown in Figure 1. A deaf child is seated facing a camera (Figure 1a) and wears colored gloves with 3-axis accelerometers attached (Figure 1b). The Flash game presents scenarios (Figure 1c), and the child is required to sign the appropriate phrase. If the signing is verified the game proceeds to the next stage. The current version of CopyCat is an extension of a previous game developed by Brashear et al. [3] and Lee et al. [7]. The game has a 19 sign vocabulary and 59 different phrases. The phrase lexicon follows the grammar:

Figure 1: CopyCat game apparatus: (a) Kiosk (b) Gloves with accelerometers (c) Flash game

may require the child to sign four or even five sign phrases. For three sign phrases adjectives are optional. Four sign phrases require only the second adjective, whereas for 5sign phrases both are required. In CopyCat, the game scene directly determines the null hypothesis for the verification task. For the scenario presented in Figure 1c our hypothesis is that the input is a representation of the phrase [GREEN] ALLIGATOR BEHIND [BLUE] WALL. Signing data is collected by a pushto-sign method where the child clicks on the start button (yellow circle in Figure 1c) to begin signing and clicks the button again to stop signing. We then force-align the input with the hypothesis phrase and employ a confidence mea-

[adjective] subject predicate [adjective] object The corresponding phrase for Figure 1c is ALLIGATOR BEHIND WALL. With increasing difficulty the scenarios 1

978-1-4244-7030-3/10/$26.00 ©2010 IEEE

48

sure (CM) on the individual signs to determine if a sign is accepted or rejected. If all signs in the phrase are accepted then the phrase is verified. For force-alignment we prune the word lattice by using a constrained grammar that corresponds to the null hypothesis phrase. In principle, now there is a single search path through the word lattice that results in the production of an alignment for the various signs in the phrase. However, as we describe in Section 4.1 the deaf children may use several acceptable variations of the same sign which results in more than one search path being explored to align the null hypothesis with the input.

where: H0 (null hypothesis): U is truly a representation of W H1 (alternative hypothesis): U is a representation of something other than W The null hypothesis H0 is accepted if LLR > β, where β is the critical threshold. In this paper, we generate results by first applying the CM during forward pass testing and compare the results with those obtained by combining the CM’s from the forward pass and the reverse pass. To summarize the contributions of this paper:

In this paper we employ a novel approach to generate an additional CM that helps us make a more informed decision about (i) accepting or rejecting a sign and (ii) choosing between two alternative alignments for the same sign. We introduce the notion of using reversed signing i.e. reversing the temporal order of the input signing for both training and testing in addition to the regular process (forward training and testing). For the rest of the paper we will refer to the regular process as forward pass training/testing and the reversed signing method as the reverse pass training/testing. Our initial motivation for employing the reverse pass was to use it solely as an alternative to forward pass in order to avoid false starts made by the children. We observed that on several occasions children would make false starts committing several mistakes at the beginning of the phrase but ultimately signing the correct phrase. For example, a child might sign SNAKE BEHIND SPIDER ALLIGATOR ON WALL where the correct phrase was ALLIGATOR ON WALL. The first three signs are incorrect but these could be rejected easily if we use a wild card in the restricted grammar of the reverse pass. The restricted grammar would be WALL ON ALLIGATOR [], where �ANY� denotes any sign in the vocabulary (including a garbage model), the angle brackets denotes one or more repetitions and the square brackets denote optional items [19]. The other solution is to use a unrestricted grammar (i.e., ), but this approach results in an explosion of search paths through the word lattice and reduced verification performance. During experimentation we found that rather than just using the reverse pass, verification performance could be improved by combining CM’s from the reverse pass and forward pass.

1. We introduce the notion of using reverse pass training and testing to provide an additional CM that can be combined with the forward pass CM to make a more informed decision about (i) accepting or rejecting a sign and (ii) choosing between two alternative alignments for the same sign.

We have experimented with two types of CM’s, namely the Normalized likelihood score and the Log-likelihood ratio (LLR). The LLR test is employed as a typical solution to speech Utterance Verification �UV) formulated as a statistical hypothesis test [6, 11, 13]. If an utterance U has been recognized as a word W, then the LLR is given as follows [6, 20]: LLR =

p�U �H0 ) p�U �H1 )

(1)

2. We show that combining the forward pass and reverse pass CM’s can lead to a gain in verification performance in some cases; more importantly the chosen alignment for the signs matches better with the ground truth as compared to the forward pass alignment.

2. Related Work Sign language verification remains relatively unexplored compared to recognition, particularly at the phrase level. The SignTutor system developed by Aran et al. [1] verifies isolated signs in a two stage process. In the first stage a general HMM is used to perform recognition and select one candidate class and a cluster of signs that were confused with this class during a previous cross-validation process. In the second stage the earlier likelihood is combined with the likelihood obtained from a more dedicated model to make the final decision. There has also been some work performed on sign spotting, which is related to verification. Yang, Sclaroff, and Lee [17] have proposed a method to spot signs from a continuous stream of data using a conditional random field (CRF) based threshold model. Their method first detects and then recognizes signing patterns and is able to spot signs from continuous data with 87.0% accuracy and recognize isolated signs with 93.5% accuracy. Hidden Markov Models (HMMs) are popularly used for speech recognition [6, 13] as well as for sign language recognition [2, 3] since they provide a powerful architecture to build statistical models of temporally varying, limited and noisy data. In the past, researchers have used HMMs to model signing data obtained from various kinds of sensors ranging from single camera systems [12] to data gloves [5] and motion capture systems [15]. Gao et al. [5] have used data gloves and 3D position trackers to develop a Chinese Sign Language recognition system that achieved a word recognition accuracy of 91.9% on 1500 test sentences

49

with a vocabulary of 5113 signs. It has been shown that ASL recognition can be significantly improved by combining data from two different sensors like cameras and accelerometers [4, 9]. Confidence Measures (CM) have been used in speech recognition for nearly two decades [8, 14, 18]. Rose et al. [11] were the first to formulate the utterance verification problem as a statistical hypothesis test and proposed the use of the likelihood ratio test. Sukkar and Lee [13] expanded on this work and presented a framework for discriminative utterance verification. We refer the readers to Jiang [6] who has provided an extensive survey of several works in speech recognition that employ confidence measures.

α constant The verification criteria is: V �2) : CM 2 > β

(5)

where β is the threshold. Next we define CM 3 as the sum of the forward and reverse pass values of CM 1 . Similarly, we define CM 4 as the sum of the forward and reverse pass values of CM 2 . We have (6) CM 3 = CMf1 + CMr1 CM 4 = CMf2 + CMr2

(7)

The verification criteria are

3. Verification For a given test phrase � verification is performed independently on each sign s1 , s2 , ... sn of the phrase. If all the signs have passed the verification criteria then we say that � has been verified. We use two confidence measures to define our verification criteria. First, we use the Normalized likelihood score as a CM. We will refer to this as CM 1 and is defined as follows: CM 1 = log[L�Oi �si )] − �µi − γσi )

(2)

(3)

n�=i

where: L�Oi �sn ) the likelihood score of observation Oi given sign sn N size of the vocabulary

(4)

V �4) : CM 4 > 0

(9)

�CMf � Af ) ← {forward pass Confidence Measures and Alignments for signs in �} �CMr � Ar ) ← {reverse pass Confidence Measures and Alignments for signs in �} for all signs (s) do CM s ← CMfs + CMrs if CM s > 0 then accept � if CMfs ≥ CMrs then choose Asf as the alignment f or � else choose Asr as the alignment f or � end if else reject � end if end for

Second, we use the Log-likelihood Ratio as another CM, we will refer to this as CM 2 . For CM 2 we use Equation (7) of Sukkar and Lee [13] and follow their notation. If we compare Equation 1 and 4 we can see that L�Oi �si ) models p�U �H0 ) and the geometric mean of the second term models p�U �H1 ). The geometric mean can viewed as a measure of the likelihood of an antisign model of si [13]. CM 2 = log[L�Oi �si )] � �1�α N 1 � − log exp�α log[L�Oi �sn )]) N − 1 n=1

(8)

Figure 2 outlines the general procedure for applying the verification criteria V �3) and V �4) . The signs in the test phrase � are verified if the combined CM is > 0 and an alignment is selected based on the maximum of the forward and reverse pass CM’s.

where: log[L�Oi �si )] the normalized log likelihood score of observation Oi given sign si µi the mean of log[L�Oi �si )] obtained from training the standard deviation of log[L�Oi �si )] σi obtained from training γ parameter to scale σi The verification criteria is: V �1) : CM 1 > 0

V �3) : CM 3 > 0

Figure 2: Procedure for applying the verification criteria V �3) or V �4) and choosing alignments for the signs in �

4. Experiment 4.1. Data Collection We collected signing data from 5 deaf students in the age group of 6-9 years from a nearby elementary school. The CopyCat game was played in “Wizard of Oz” mode in which a human observer played the role of the verifier. A

50

total of 420 ASL phrases were collected, which were then hand labeled to provide sign boundaries for training. For this work we collected signing data only from right-hand dominant deaf children in order to eliminate the variation that would occur by including left-hand dominant signers. However, 3 of the 19 signs in our vocabulary still showed consistent variations in our dataset and were signed in two different but acceptable ways. To accommodate these variations we increased our vocabulary to 22 signs.

4.2. Features Our features combine vision and accelerometer data. Video frames are captured at 20 fps, and the accelerometers, which have a range of +2g to -2g, are sampled at 40 Hz. Vision features are obtained by tracking the child’s eyes and the colored gloves he wears while playing the game. The corresponding accelerometer features are obtained by matching the accelerometer sample that is nearest in time to the video frame. For a complete list of features generated for each hand see Table 1. Type Blob

Hand Shape

Description second moment shape descriptors (length of major and minor axes, eccentricity, orientation of major axis) shading based fatures obtained by performing PCA on concatenated histograms of V (from HSV) from a 4x4 grid of the extracted hand region. dx and dy of the blob center

2D Image Motion Acceleration x, y & z acceleration values and frequency domain representation of each axis. Pose (2D angle formed between the blob center geometry) and the horizontal passing through the midpoint between the eyes Table 1: Feature types

4.3. Training and Testing Using Forward and Reversed Signed Phrases We trained left-to-right four-state Hidden Markov Models (HMMs) with one skip transition for each of the 22 signs. The skip occurs from state one to state three. The Georgia Tech Gesture Toolkit (GT2K) [16] was used for training and testing. In addition to the regular forward pass training process we trained sign models by reversing the input features (reverse pass training). The N th feature becomes the 1st feature, the �N −1)th the 2nd , and so on. The

hand labeled sign boundaries are then translated to correspond to the reversed features. The HMMs are also changed accordingly, the skip transition now occuring from state two to four. Our intuition for using the reverse pass is as follows. Model parameters for both the forward and reverse are estimated iteratively using the Baum-Welch method [10]. Baum-Welch does not guarantee the same results given different orders of input, and it is unlikey for it to generate the same values for reverse versus forward input. The children’s sign is highly variable, and sometimes the end of a sign has much less variance than the beginning, or vice versa. Thus, depending on whether the models are trained on the forward or reverse data, alignment of the models’ states can vary wildly. Thus, the models can be of better or worse quality depending on the variance observed in the beginning or end of the sign. A similar artifact might be observed during Viterbi decoding. In theory, the Forward algorithm should pick an optimal path (assuming that the problem is first order Markovian, which may not hold here). However, in practice, pruning is needed due to memory and processor limits. If a sign has high variance at the beginning, all paths through the Viterbi lattice could be of approximately the same low probability. Thus, promising paths may be pruned inopportunely. However, with reversed input, the “correct” paths should have significantly higher probability values than the incorrect paths and therefore avoid pruning. Thus, the correct paths may be better preserved in one direction of Viterbi decoding than in the other. We hypothesize that this situation could directly impact the CM for each sign in the forward and reverse directions. We can take advantage of this fact and combine the measures in a manner already described in Figure 2 Leave-one-out cross-validation was performed by training on data from 4 students and testing on data from the 5th student. Our focus is on phrase level verification; first, the test phrase � is force-aligned with the null hypothesis; then we check to see if each sign in the phrase has passed the verification criteria outlined in Section 3. In the CopyCat game the current scene (see Figure 1c) directly determines the null hypothesis to be used. During data collection we stored the index of the game scene along with the phrase sample, which allowed us to obtain the null hypothesis for the test phrase during offline testing. Table 2 lists the the number of training and test phrases used in leave-one-out cross-validation for each student. The “x 2” indicates that a second simulated test set of same size, but having incorrect signing, was used to determine the false alarm rate. To simulate incorrect signing we chose phrases for comparison from the test student’s own naturally signed phrase examples that had one or more signs different from the phrase currently being verified. It was our goal to select as many comparison phrases as possible that were different by only one sign in order to maximize the difficulty of the

51

1

0

0

2 3 4 # errors

5

49

0 1

(a)

0

0

2 3 4 # errors

5

13 1

(b)

21 6

0

0

2 3 4 # errors

5

(c)

44

57

4 1

0

0

2 3 4 # errors

5

(d)

Student 5 (total=93) 64 # phrases 0 20 50

1

43

Student 4 (total=105) # phrases 0 20 50

48

Student 3 (total=40) # phrases 0 20 50

41

Student 2 (total=92) # phrases 0 20 50

# phrases 0 20 50

Student 1 (total=90)

28 1 1

0

0

2 3 4 # errors

5

(e)

Figure 3: Distribution of simulated sign errors for each student Student 1 2 3 4 5

#Training 330 328 380 315 327

#Testing 90 x 2 92 x 2 40 x 2 105 x 2 93 x 2

Table 2: Training and Testing split for leave-one-studentout cross-validation Example 1. test phrase error phrase Example 2. test phrase error phrase

ALLIGATOR BEHIND WALL SNAKE BEHIND WALL ALLIGATOR BEHIND WALL ALLIGATOR ON BLUE WALL

Table 3: Examples of one-sign-error phrases. (errors are shown in bold)

false alarm testing. Table 3 list two examples of comparison phrases for the phrase depicted in Figure 1c. The first example is straightforward, with ALLIGATOR being replaced by SNAKE; however, in the second example BLUE is not an error since the grammar used by CopyCat allows for optional adjectives for 3-sign phrases and in Figure 1c BLUE is, in fact, the color of the wall. For the complete distribution of comparison phrases with differences chosen for each student see Figure 3. With the exception of Student five, the majority of phrases had 2 sign differences.

5. Results and Discussion We obtained leave-one-student-out cross-validation results by applying the four verification criteria outlined in Section 3. For CM 1 and CM 3 the value of γ was varied between -5 to15. For CM 2 and CM 4 we set the value of α to 0.5 and varied the value of β between -70 to 100. Fig-

Student 1 2 3 4 5

V �1) vs V �3) Accmax Winner 76% V �3) 65% V �1) 86% V �3) 82% tie 69% V �3)

V �2) vs V �4) Accmax Winner 86% V �4) 71% V �4) 88% V �4) 94% V �2) 81% V �2)

Table 4: Head-to-Head Wins V �1) vs V �3) and V �2) vs V �4)

ure 4 shows the ROC plots of False Alarms vs False Rejects for each student. We see that V �2) and V �4) in general perform better than V �1) and V �3) . This result is not surprising given that, according to Neyman-Pearson theory, if the exact densities L�U �H0 ) and L�U �H1 ) are known then the Likelihood Ratio Test (LRT) is the best available test that gives the least false alarm rate for a given false rejection rate. However, in practice the probability density functions are not easily obtained, but they can be approximated as in the case of Equation 4 [13]. Table 4 lists the winning verification criteria in terms of the maximum achieved accuracy (Accmax ) for each student. V �3) scores 3 wins and ties once with V �1) , whereas V �4) and V �2) seem equally matched, taking 3 and 2 wins respectively. However, since the number of students is small we will reserve judgement about V �4) ’s performance until further analysis is conducted. From this point forward, we will restrict our discussion to comparing V �2) and V �4) . Figure 5 shows accuracy plots that provide a better perspective of how V �4) performs against V �2) . The blue and green shaded regions should allow the reader to make better connections between the three plots. Looking at Figure 5b, if the false reject rate is to be maintained below 20%, we can clearly see that there is a significant advantage by choosing V �4) over V �2) . However, the corresponding false alarm rates are above 20%. If one were to limit the false alarm rates to below 20% (Figure 5a) the advantage is minor and at the same time the false reject rates get prohibitively higher. From the perspective

52

60

80

100

80

0

20

40

60

80

False Rejection (%)

0

Sentence Verification − Student 5

80

100

10

15

20

25

60 40 20

5

0 0

20

40

60

80

False Rejection (%)

(d)

100

0

20

40

60

80

False Rejection (%)

(e)

V (1)

100

30

False Alarms (%)

60 40

60

80

Sentence Verification − Overall ntence Verification − Ov

80

100

False Alarms (%)

20 0

40

False Rejection (%)

60

(c)

80

100 80 40

60

False Alarms (%)

20 0

20

40

False Rejection (%)

(b)

Sentence Verification − Student 4

0

20

50

(a)

100

False Alarms (%)

40

False Rejection (%)

10

20

100

0

0

20

40

60

False Alarms (%)

80 40

60

False Alarms (%)

20 0

0

20

40

60

False Alarms (%)

80

100

Sentence Verification − Student 3

100

Sentence Verification − Student 2

100

Sentence Verification − Student 1

100

(f)

V (2)

V (3)

V (4)

Figure 4: ROC plots for leave-one-student-out cross-validation of the CopyCat game we would like to keep the false alarm rate as low as possible to avoid any negative impact on the student’s language development. Keeping β at -20 as seen from Figure 5c gives an equal error rate of 20%, which is a good compromise between false alarms and false rejects. However, with β ≥ -20 there appears to be no significant advantage in terms of accuracy with V �4) over V �2) . Strictly speaking with the likelihood ratio test we should not be considering β < 0 since in general the likelihood ratio will be ≥ 0 when the observation sequence matches the null hypothesis and will be negative otherwise. So where does this leave V �4) ? We show with further analysis that indeed there are advantages with using V �4) , less in terms of accuracy but more in terms of how well the chosen alignments match with the hand labled ground truth boundaries for the signs. Interestingly, on analyzing the true positive cases we found that the sentences verified by V �4) and V �2) were not all the

same. At β = 0 Table 5 gives the exact numbers of �All� phrases verified, the �Common� ones, and the phrases �Exclusive�ly chosen by V �4) and V �2) . Overall there are 420 test cases across the five students. V �4) accepts 278 phrases, and V �2) accepts 275 phrases. We see that 259 of these are common to both, 16 are exclusively chosen by V �2) , and 19 exclusively by V �4) . To measure the distance between the ground truth and the test case alignment we define a new cost function on the basis of the labels in Figure 6 d=

ds + de l1 + l2

(10)

Remember that according the decision rule in Figure 2 either the forward pass or reverse pass alignments may be chosen depending on which one of the confidence measures is the higher positive value. This rule means that the alignments for the signs in phrases chosen by V �4) could be significantly different from those of V �2) , even for

53

Accuracy vs  False Alarms

Accuracy vs β β 80

Accuracy (%)

70 65 60 55

60

60

65

65

70

70

Accuracy (%)

Accuracy (%)

75

75

75

80

80

Accuracy vs  False Rejects

10

20

30

40

False Alarms (%)

50

10

20

30

40

False Rejection (%)

(a)

−50

50

−20 0

(b)

β

50

100

(c)

V (2)

V (4)

Figure 5: V �2) vs V �4) Accuracy (a) Accuracy vs False Alarms (b) Accuracy vs False Rejects (c) Accuracy vs β

0.30

V (2) V (4)

0.20

All Common Exclusive

V �2) V �4) �2) �P V � = 275 �P V � = 278 �2) �4) �P V ∩ P V � = 259 �4) �2) V �2) V �4) �P -P � = 16 �P V - P V � = 19 �2)

0.00

0.10

Table 5: True positives cases for V �2) and V �4)

ALL

Common

Exclusive

Figure 7: Mean Alignment Error (V �2) vs V �4) ) Figure 6: Alignment distance

those phrases that are common to both. Figure 7 shows the mean alignment error computed using Equation 10. We can clearly see that in all three cases the phrases chosen by V �4) are significantly better aligned with the ground truth. Particularly in the �Exclusive� case the alignment performance of V �4) is far superior than that of V �2) .

6. Conclusion In this paper we have introduced a novel approach for sign language phrase verification that utilizes models of reversed signing to provide an additional confidence measure

during testing. We have shown that if a low false rejection rate is desired the new method can provide better verification accuracy compared to forward pass verification. The biggest advantage with using the new method is that the alignments chosen for the signs match more closely to the ground truth. Most previous methods for recognition/verification only care if an alignment can be generated but dont care about the correctness of the alignment. By comparing performance of the verification methods in terms of alignment errors we have tried to address an issue that mostly gets overlooked. We expect that in the future more research efforts will be made dedicated to addressing this issue.

54

References [1] O. Aran, I. Ari, L. Akarun, B. Sankur, A. Benoit, A. Caplier, P. Campr, A. Carrillo, and F. Fanarda. Signtutor: An interactive system for sign language tutoring. MultMedMag, (1):81–93, January 2009. 2 [2] B. Bauer, H. Hienz, and K. Kraiss. Video-based continuous sign language recognition using statistical methods. pages Vol II: 463–466, 2000. 2 [3] H. Brashear, K.-H. Park, S. Lee, V. Henderson, H. Hamilton, and T. Starner. American sign language recognition in game development for deaf children. In ASSETS 06, pages 79–86, New York, NY, USA, 2006. 1, 2 [4] H. Brashear, T. Starner, P. Lukowicz, and H. Junker. Using multiple sensors for mobile sign language recognition. In ISWC ’03, pages 45–52, Washington, DC, USA, October 2003. IEEE Computer Society. 3 [5] G. Fang, W. Gao, and D. Zhao. Large-vocabulary continuous sign language recognition based on transition-movement models. SMC-A, 37(1):1–9, January 2007. 2 [6] H. Jiang. Confidence measures for speech recognition: A survey. Speech Communication, 45(4):455 – 470, 2005. 2, 3 [7] S. Lee, V. Henderson, H. Hamilton, T. Starner, H. Brashear, and S. Hamilton. A gesture-based american sign language game for deaf children. In CHI ’05, pages 1589–1592, New York, NY, USA, 2005. ACM Press. 1 [8] L. Mathan and L. Miclet. Rejection of extraneous input in speech recognition applications, using multi-layer perceptrons and the trace of hmms. In Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pages 93 –96 vol.1, apr 1991. 3 [9] R. McGuire, J. Hernandez-Rebollar, T. Starner, V. Henderson, H. Brashear, and D. Ross. Towards a one-way american sign language translator. In AFGR04, pages 620–625, Washington, DC, USA, May 2004. IEEE Computer Society. 3 [10] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989. 4 [11] R. Rose, B. Juang, and C. Lee. A training procedure for verifying string hypotheses in continuous speech recognition.

[12]

[13]

[14]

[15]

[16]

[17]

[18] [19]

[20]

55

In Acoustics, Speech, and Signal Processing, 1995. ICASSP95., 1995 International Conference on, volume 1, pages 281 –284 vol.1, may 1995. 2, 3 T. Starner and A. Pentland. Visual recognition of american sign language using hidden markov models. In AFGR95, pages 189–194, 1995. 2 R. Sukkar and C.-H. Lee. Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition. Speech and Audio Processing, IEEE Transactions on, 4(6):420 –429, nov 1996. 2, 3, 5 R. Sukkar and J. Wilpon. A two pass classifier for utterance rejection in keyword spotting. In Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, volume 2, pages 451 –454 vol.2, apr 1993. 3 C. Vogler and D. Metaxas. Asl recognition based on a coupling between hmms and 3d motion analysis. In ICCV98, pages 363–369, 1998. 2 T. Westeyn, H. Brashear, A. Atrash, and T. Starner. Georgia tech gesture toolkit: supporting experiments in gesture recognition. In ICMI ’03, pages 85–92, New York, NY, USA, 2003. ACM Press. 4 H. Yang, S. Sclaroff, and S. Lee. Sign language spotting with a threshold model based on conditional random fields. PAMI, 31(7):1264–1277, July 2009. 2 S. Young. Detecting misrecognitions and out-of-vocabulary words. 3 S. J. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Woodland. The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge, UK, 2006. 2 D. Yu, Y. C. Ju, and A. Acero. An effective and efficient utterance verification technology using word n-gram filler models. In Proceedings of Interspeech 2006—ICSLP: 9th International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 2006. 2