Person-Independent 3D Sign Language Recognition

0 downloads 0 Views 364KB Size Report
Results 85 - 98 - realize this module, it was necessary to build a sign language recognizer that can judge the ... bulb) onto the white wall behind the touchscreen.
Person-Independent 3D Sign Language Recognition Jeroen F. Lichtenauer1 , Gineke A. ten Holt1,2 , Marcel J.T. Reinders1 , and Emile A. Hendriks1 1

Information and Communication Theory Group, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, the Netherlands 2 Human Information Communication Design, Delft University of Technology, Landbergstraat 15, 2628 CE, Delft, the Netherlands {j.f.lichtenauer, g.a.tenholt, m.j.t.reinders, e.a.hendriks}@tudelft.nl

Abstract. In this paper, we present a person independent 3D system for judging the correctness of a sign. The system is camera-based, using computer vision techniques to track the hand and extract features. 3D co-ordinates of the hands and other features are calculated from stereo images. The features are then modeled statistically and automatic feature selection is used to build the classifiers. Each classifier is meant to judge the correctness of one sign. We tested our approach using a 120-sign vocabulary and 75 different signers. Overall, a true positive rate of 96.5% at a false positive rate of 3.5% is achieved. The system’s performance in a real-world setting largely agreed with human expert judgement.

1

Introduction

Sign languages are natural languages that emerge in Deaf communities and possess their own grammars and vocabularies. For (pre-lingually) deaf children, sign language is the only language that can be acquired naturally, and as such is important for their development [1]. However, most deaf children are born into hearing families [1]. This means that their parents often have to learn sign language themselves first, and that the amount and quality of natural language available to a deaf child is poor compared to hearing children. We aim to build an interactive learning environment (ELo) [2] for young deaf children to practise their sign language vocabulary. With such a system available at home or in the classroom, children would have an extra source of sign language input and extra opportunity to practise signing and receive feedback on their signs. ELo consists of several modules. In one module, the child is asked to make a certain sign, and the system gives feedback on the correctness of the sign. To realize this module, it was necessary to build a sign language recognizer that can judge the correctness of a sign. This recognizer is the subject of this paper. Because of its purpose, there are several requirements for the recognizer. It must work real-time and person-independently. It must be mobile and work in different surroundings (at school, at home). It must be vision-based, because we do not want to encumber the children with sensors or markers. And it must

deal with variation and sloppiness. Since ELo’s goal is vocabulary training, the recognizer only needs to handle isolated signs. Previous work in the field of automatic sign language recognition included approaches using Hidden Markov Models (HMMs) [3–5], and various machine learning techniques [6–8]. The recognition rates of these early systems were typically around 80-90%, and they were trained and tested on single signers (except for [6], who used six signers). More recent projects, using (H)MMs representing whole words [9, 12] or parts of words [10, 11], achieved better results: 85-98%. However, none of them provide person-independent test results. [13] achieved a recognition rate of 92% with whole word HMMs, but this dropped to 84% when six signers were used. Similarly, [14] achieved 98% accuracy with one signer, but only 55% with multiple signers. Clearly, the variation in sign execution between persons remains a problem for sign recognition. Current attempts to remedy this problem include adapting basic sign models to specific signers [14, 15], and gathering an appropriate sign language corpus [16]. All projects described used medium-sized vocabularies (size 10-262), except [13] who worked with 5,113 signs. Some projects were vision-based, others ([4, 6, 7, 11, 13]) worked with gloves and/or trackers. For ELo, we need person-independent sign recognition — different children must be able to work with it. It must be noted, however, that our aim is slightly different from that of the aforementioned projects: instead of distinguishing between a set of signs, we want to judge the correctness of a sign. This means that we want to distinguish each sign from “anything else”. We therefore create a one-class classifier for each sign in our vocabulary. Its purpose is to take an input sign and judge whether this was the expected sign or not. This makes our task both easier and harder than simple distinguishing. Easier, because we know which sign we expect, but harder, because we must be able to exclude anything else, even movements we have never seen before. In the next section, we give an overview of our system. The subsequent sections discuss the system’s components in detail. Recognition results are given in section 6, and section 7 presents the discussion.

2

System Setup

Figure 1(a) gives an overview of the physical setup of the system. The child is seated in front of a touchscreen, through which it interfaces with the system. Above the screen, stereo cameras are placed to record the child’s signs. To control the environment, the system is set inside a cube-shaped tent. This ensures that lighting, background and distance to the cameras are controlled, which makes it possible to use the system in different locations. Indirect lighting is provided by shining 4 11W-lights (with a total light emittance equivalent to a 240W light bulb) onto the white wall behind the touchscreen. The computer running the system is outside the tent. There is room for an adult supervisor on the side. Since skin colour is used to track the hands, the child must wear long-sleeved clothes. Figure 1(b) shows the setup in reality.

(a)

(b)

Fig. 1. (a) Overview of the system. The child is seated at a table behind a touchscreen. His/her signs are recorded by stereo cameras. Indirect lighting is provided by shining light onto the white wall. There is room for an adult supervisor. A tent (made to look like a play castle on the outside) encloses the system, ensuring no interference from other light sources/persons. (b) The setup in reality.

In figure 2 the components of the recognition system are shown schematically. Signs are recorded with two calibrated digital cameras, Allied Vision Technologies ’Guppies’, at 25 fps, resolution 640 x 480. Currently, the start and end of a sign must be indicated by putting the hands in a fixed position on the table top. The hands and head are found and tracked using a skin colour model and various tracking techniques, which are discussed in the next section. From the tracked hands, several properties are measured, such as position, size and angle of the hand blobs in both cameras. From these properties, a set of features can be calculated, among which are the 3D co-ordinates of the hands. Different examples of a sign must then be synchronized. These steps are described in section 4. After that, a classifier can be trained using feature selection, as described in section 5.

Fig. 2. Flow diagram of the sign recognition system. Input from the 2 cameras is combined to 3D features in the feature extraction step.

3

Image Analysis

The image processing operations used to measure 3D hand locations can be divided into two layers: single-camera tracking, followed by disparity refinement. In this order, computing a complete disparity map can be omitted. This saves a significant amount of redundant computation, as we only need disparity measurements of hands and face.

(a)

(b)

(c)

Fig. 3. Image processing example. (a) shows the left camera image with the backprojected 3D hand positions as squares, the size of which represents the estimated depth. (b) motion segmentation. (c) the skin segmentation where a buffered face image is used to remove skin pixels of the face.

3.1

Single-camera Tracking

The hands and face are found around their previous location by a combination of blob tracking and template searching. This is done separately for both cameras (2D), but the depth from the previous time frame is used as prior information on hand size. When possible and necessary, tracking is automatically (re-)initialized by assigning the skin blobs to hands and head according to their position. Blob Tracking finds the blob whose center of gravity is closest to the previous hand location. Blobs are connected components of a skin color segmentation. As long as no occlusion occurs and segmentation is reliable, blob tracking is both fast and robust. To get a reliable skin segmentation, we use our adaptive model described in [17]. The method fits a 3-part piecewise linear model to the positive samples in RGB space. The model is robust against intensity offset and ambient lighting color, and a model estimated from one person applies to a large range of other skin colors, depending on lighting conditions and skin color difference. In a semi-automatic initialization procedure, skin and other (non-skin) samples are collected from a camera image. These samples are used to build the skin color model and to find thresholds on the distributions. For that, the opposite

corners of a few rectangles have to be indicated manually: one containing the inner face (positive samples) and two containing all skin regions of face and hands respectively. The negative samples are all the pixels outside the indicated rectangles. The initialized model provides a skin likelihood for any RGB color tuple. However, simply thresholding this likelihood results in a lot of false positive skin detections. So instead, two different segmentation thresholds are applied: A high (H) and a low (L) threshold. The H segmentation contains few false positives, but many misses. L covers almost all skin area, but also contains many false positives in the background and the clothes. False positives are reduced to a minimum by using the positive detections of L only around areas with positive detections of H. This is usually done using hysteresis thresholding. To limit computation time spend on dilations, we reduced this to only one big dilation after the first threshold. Sporadic false positives in H are removed by a density filtering Fd that sets a lower threshold on the number of positive pixels in a local neighborhood around each positive pixel, using the integral image. The final skin segmentation is obtained by: n \ o Ss = C D(Fd (H)) L (1) T Where denotes a logical AND, D a dilation and C a morphological closing to connect falsely detached segments. For computational efficiency, the H and L thresholds are applied off-line to all possible RGB tuples C = [CR , CG , CB ]T ∈ {0, .., 255}3 and stored in lookup tables TH (CR , CG , CB ) and TL (CR , CG , CB ), respectively. This also makes it possible to combine the likelihood model (a simplified generalization of reality) with a histogram of the positive and negative initialization samples. Negative values in TL that coincide with a high number of positive samples are added to TL to reduce false negatives. Positive values in TH that coincide with a high number of negative samples are removed from TH to reduce false positives. To reduce data size for effective caching, RGB space is quantized into 64x64x64 color bins and the boolean table values are packed into 32 bit words, resulting in 64kB of data all together. To further reduce false negatives, TH is applied to a larger image size (320x240) and the result saved to a 160x120 segmentation in which one pixel is positive if at least one of four corresponding pixels in the larger image is positive. Template Tracking finds the local maximum correlation with a template copied from the hand location in the previous frame. In the template search, the template value differences are weighted by a Gaussian function after limiting them to a maximum of 20, to reduce the effect of outlier and background pixels inside the template and search area. The template search is automatically adapted to the situation. The search grid scale is linearly dependent on the distance to the camera of each hand in the previous frame, and the grid size (number of points) is reduced significantly if no motion is detected at the previous hand location. Furthermore, only grid points within the skin segmentation

are considered. When motion is detected at the previous hand location, this is further reduced to grid points at areas with both skin and motion. Motion areas are segmented by a threshold on the local sum of absolute frame differences, using the integral image method. Figure 3 (b) shows a motion segmentation example. The noise threshold for motion segmentation is determined in the same initialization procedure as in paragraph 3.1. Because it is very cumbersome and unpractical to get an image containing no moving persons at all (especially with a wide angle camera) the 50% most still regions were used for setting the threshold. This assumes that at least 50% of the image contains no motion (only noise). This is usually the case for a normal situation where only one person sits in front of the camera. It is very difficult to track a hand in front of the face. Especially when the hand changes shape. Therefore, the search area is further reduced by face segmentation. In each video frame where no hand is near the head area, the area of the head is copied from the gray image. The pixels in each new frame that are similar enough to the buffered face image are removed from the skin segmentation, resulting in a face-less skin segmentation image, used to reduce tracking search space. Figure 3 (c) shows a face-less skin segmentation example.

Combined Blob/Template Tracking The results of blob and template tracking are combined depending on the situation. When a hand blob is free from the other hand and outside of the head/hair area, the blob center is considered most reliable. It is averaged with the template search result to get a more precise estimation, but only if the two are close enough. Otherwise, only the blob center is used. When a hand blob is merged with the other hand blob, or close to the head/hair, only the template search result is used. When two hand blobs are merged, and their difference in depth is large, only the hand closest to the camera is tracked, while the other is assumed to be still.

3.2

Disparity Refinement

For the result of single-camera tracking in one camera, the stereo disparity is measured by a coarse-to-fine block search of the located hand patch along the epi-polar curve (distorted line) in the other camera image, with a range slightly wider than the maximal expected displacement from the previous 3D location. If the single-camera tracking results are good and the estimated disparities from left to right and right to left are correct, they should be very close to each other. In that case, the disparities are averaged to get a more precise and stable estimate of the hand location. If the two results do not correspond, the result that is closest to the previous 3D location of the hand is used. If the result is physically impossible (too far or too fast), it is ignored and the previous 3D location and templates are retained. The refined 3D hand locations are projected back to camera coordinates to facilitate tracking in the next frame.

Table 1. Feature types extracted for classification hl/r (t) = [Xl/r (t), Yl/r (t), Zl/r (t)]T ˜ hand motion h˙ l/r (t) = S{||dhl/r (t)/dt||, ch˙ } ˜ ¨ l/r (t) = S{||d2 hl/r (t)/dt2 ||, c¨ } hand acceleration h h sideways orientation θSl/r (t) = arcsin(dXl/r (s)/ds) upward orientation θU l/r (t) = arcsin(dYl/r (s)/ds) forward orientation θF l/r (t) = arcsin(dZl/r (s)/ds) hand motion curvature κ ˜ l/r (t) = S{κl/r (t), cκ } ˜˙ l/r (t) = S{d˜ hand motion curvature change κ κl/r (t)/dt, cκ˙ } ˜ ˙ hand size change Bl/r (t) = S{dBl/r (t)/dt, cB˙ }

left/right hand coordinates left/right left/right left/right left/right left/right left/right left/right left/right

4

Feature Extraction

Feature extraction for each sign consists of two steps: converting the measured data into relevant feature types and time warping to obtain fixed-length synchronized feature vectors. 4.1

Feature types

The measurements and feature types obtained for each video frame are shown in table 1, where X, Y, Z are horizontal, vertical and depth co-ordinates respectively (the median face location was taken as the origin for the hand coordinates), t is the time frame number, s is arc length of the hand motion path, and B is hand blob size in pixels. No hand shape features other than size change could be robustly extracted from the skin blobs. Several features are mapped with a sigmoid function S(f, c): S(f, c) =

1 1 + exp(−f /c)

(2)

where c is a scaling parameter that determines where the sigmoid flattens out. In the time derivative features, sigmoid mapping acts as a soft threshold to obtain invariance to signer speed. For curvature κ ˜ , it reduces a logarithmic infinite-range scale measurement into a limited feature range. 4.2

Time Warping

The features are aligned with a fixed-length feature model by a time warping procedure. This is done using ‘Statistical Dynamic Time Warping’ (SDTW), first introduced by Bahlmann and Burkhardt [18]. The difference of SDTW with normal DTW is that instead of comparing two signals using a fixed distance measure, SDTW compares a new signal t = [t1 , ..., tNt ] to a sequence of statistical feature models R = [R1 , ..., RNR ]. The distance function for time frame i of a gesture t now depends on the time frame (or state) j of the model R.

(a)

(b)

Fig. 4. SDTW example. (a) shows how Statistical Dynamic Time Warping (SDTW) maps a gesture (upper) onto the feature model (lower). The signals have been shifted vertically for visualization. (b) When mapping is done for all signs, we get a distribution of values (dashed lines) for each time instance of the feature model. Three distributions are shown. This example shows only one feature type (y-position of the right hand), but the SDTW match is made using all properties combined, and distributions are modeled for all features types at all model-frames.

This makes SDTW more robust against variation than normal DTW. The applied distance function is the inverse log probability, based on a Gaussian model Rj = {µj , Σj }: 1 (3) (ln(|2πΣj |) + (ti − µj )T Σ−1 j (ti − µj )), 2 where Σj is a covariance matrix and µj is the mean of model-frame j. The optimal time warping Φ∗ is found by minimizing: d(ti , Rj ) =

− ln(p(t, Φ|R)) =

Nφ X

d(tφt (n) , RφR (n) ).

(4)

n=1

Where Φ = {φt (1), ..., φt (Nφ ), φR (1), ..., φR (Nφ )} are the steps of the path through the 2D correspondence matrix of the time frames of t and R, constrained by transitions [(φt (n + 1) − φt (n)), (φR (n + 1) − φR (n))] ∈ {[0, 1], [1, 0], [1, 1]}, corresponding to horizontal, vertical and diagonal steps, respectively. Note that we have left out transition probabilities as they did not improve the result. Equation 4 is minimized efficiently using the Viterbi algorithm. The SDTW model R is trained on a set of examples by iteratively warping all training samples and recomputing each µj and Σj from the aligned observations, until convergence, starting with an initial model R0 . When multiple frames of t are mapped onto the same frame of R the respective feature vectors are averaged in the final warped signal. Figure 4(a) gives an example of synchronisation through SDTW. 4(b) shows examples of Rj = {µj , Σj } at three time points in R.

5

Sign Classification

Since not all features of a time-warped sign are relevant for classification, a feature selection procedure determines which features to use. The selected set of features is classified with a classifier that assumes that all features have been perfectly aligned. 5.1

Feature Selection

A feature fj (m) of type m (see table 1) corresponding to the normalized time frame j is selected for classification only if the middle 50% of its distribution over the training examples of the correct sign (positive examples) has an overlap of less than 25% with the distribution of the training examples of incorrect signs (negative examples). 5.2

Feature Classifier

After feature selection, a relatively large number of features still remains (around 500), and it is difficulty to obtain a large multi-signer training set (variation of a single signer does not generalize well to others). Because of the curse of dimensionality, we assume independence between features. The classification is based on the same measure as equation 3, but with a warped signal tˆ and an independent variance per feature type: !  (tˆj (m) − µj (m))2 1 2 ˆ ln(2πσj (m)) + , (5) ln p(tj (m)|Rj (m)) = − 2 σj2 (m) However, instead of computing a total log likelihood by the sum of feature log likelihoods, the feature likelihood distributions are first converted into partial uniform functions:   1, ln p(tˆj (m)|Rj (m)) ≥ Tj (m) − Tg q(tˆj (m), Rj (m)) = (6) 0, ln p(tˆj (m)|Rj (m)) < Tj (m) − Tg

where Tg is the gauge parameter that will determine the operating point of the final classifier and Tj (m) is the threshold that accepts 90% of the positive training data for a particular feature at Tg = 0. By using a piece-wise uniform likelihood function, all outliers are penalized equally, no matter how great their distance to the mean feature value. Furthermore, the flat top makes it possible to accept sloppy but completely correct signs, while rejecting incorrect signs that are very similar to a subset of the feature models (e.g. incomplete signs). The classifier output is generated by: Q(t, R) =

NR X Nm X

j=1 m=1

sj (m)q(tˆj (m), Rj (m)).

(7)

Where sj (m) is 1 for selected features and 0 otherwise, and Nm is the number of feature types, equal to 25 (see table 1). A sign is classified by:  correct, Q(t, R) >= TC C(t, R) = (8) incorrect, Q(t, R) < TC where TC is fixed to the value that classifies 50% of the positive training set correctly at Tg = 0 (median of Q).

6

Results

Our vocabulary consisted of 120 signs from the standard lexicon of Dutch Sign Language. We recorded these signs from 75 different adult persons (all righthanded), giving us 75 examples per class, a total of 9,000 signs. We trained and tested our system using 5-fold cross-validation. In each cycle, we used 60 positive examples for training, and 15 others for testing. As for the negative examples, in each cycle we used 96 signs (of all 75 persons) for training, and the other 23 (also of all 75 persons) in the test set. This to ensure that the negative (incorrect) examples in the test set had not been seen in training, so that we could test our ability to reject movements we had never seen before. Figure 5(a) shows the results in the form of a confusion matrix. The rows are the 120 detectors, the columns the 120 classes of test sign. The intensity of the cells shows how often the detector judged the test sign ‘correct’. Ideally, this figure would be black along the diagonal and white everywhere else. We tested the system at different settings of the tolerance Tg (as opposed to TC ). We made an ROC curve for each individual detector and averaged them. The result is shown in figure 5(b). If we want our recognizer to detect e.g. 96.5% of the correct signs, then 3.5% of incorrect signs will also be detected. For comparison, we included the ROC curve of a standard HMM approach on our dataset (whole word HMMs with 40 states, comparable to those used by [14]). Our system clearly outperforms the HMM approach. The recognizer was also tested in a real world situation as part of ELo. ELo was set up at a school for deaf and hard-of-hearing children and ten children worked with it in eight 15-minute sessions over a period of four weeks. During the test the recognizer processed the children’s signs real time and judged their correctness. The movies of three children were also shown to a sign language instructor experienced in working with young children. She was asked to give an evaluation of the correctness of the children’s signs. Her judgement was then compared to the judgement of the recognizer. 78 signs were tested this way. The instructor judged 66 signs correct; of these, the recognizer judged 60 correct. Of the 12 signs found incorrect by the instructor, the recognizer judged 6 incorrect.

7

Discussion

In this paper, we presented a person-independent sign recognition system. Because we want the system to judge the correctness of a sign, we built a set of

Average ROC Curve over all signs 1

True Positive Rate

0.8 Our method HMM, Bakis, 40 states 0.6

0.4

0.2

0 0

(a)

0.02

0.04 0.06 False Positive Rate

0.08

0.1

(b)

Fig. 5. (a) Confusion matrix of the sign recognition. Rows represent the one-class classifiers for each sign, columns are the classes of the test signs. The colour of the cells indicates the number of times a detector judged a test sign of that class as correct. (b) Average ROC curve of the sign recognition at varying tolerance levels. At an operating point of 96% true positives, about 3% of false positives will occur. For comparison, a standard HMM approach (whole word HMMs with Bakis topology and 40 states) was also used. The HMM ROC curve was created by varying the recognition threshold.

one-class classifiers, one for each class. Tests show that the system not only works well for adults, scoring significantly better than a standard HMM approach, but also for the target group, young deaf children, even though it was trained on adults only. Confusion between signs often arises when signs only differ in handshape. To deal with such pairs, we need to collect more detailed information on the handshape. Compared to a human expert, the recognizer appears to be too strict for correct signs. Because of the small number of incorrect signs in the test, it is difficult to draw general conclusions about these. However, the instructor will reject some signs based on incorrect handshape, and this is information the recognizer does not have, which may explain the incorrect acceptances. Larger tests are necessary to draw reliable conclusions about the recognizer’s performance on incorrect signs. By changing the tolerance parameter Tg , the recognizer can be made less strict, so that its judgement of correct signs conforms to that of a human expert. However, it is possible that human experts are too accepting in some cases, as a form of positive reinforcement (rewarding the attempt instead of judging the result). In these circumstances, it may be preferable to let the recognition device retain its own, consistent measures of acceptability, not copy human teachers too much. The recognizer can of course maintain different tolerance settings for different age groups. Within the group, however, it would maintain a fixed threshold, against which progress can be measured accurately. To achieve this, however, the recognizer’s ‘blind spot’ for handshapes must be remedied.

References 1. Schermer, G., Fortgens, C., Harder, R., De Nobel, E.: De Nederlandse Gebarentaal. Van Tricht, Twello (1991) 2. Spaai, G.W.G., Fortgens, C., Elzenaar, M., Wenners, E., Lichtenauer, J.F., Hendriks, E.A., de Ridder, H., Arendsen, J., Ten Holt, G.A.: A computerprogram for teaching active and passive sign language vocabulary to severely hearing-impaired and deaf children (in Dutch). Logopedie en Foniatrie 80, 42–50 (2004) 3. Grobel, K., Assam, M.: Isolated sign language recognition using hidden markov models In: IEEE Int. Conf. on Systems, Man and Cybernetics, pp. 162–167. IEEE (1997) 4. Liang, R.H., Ouhyoung, M.: A real-time continuous gesture recognition system for sign language In: 3rd Int. Conf. on Face & Gesture Recognition, pp. 558–565. IEEE Computer Society, Los Alamitos (1998) 5. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE TPAMI 20, 1271–1375 (1998) 6. Waldron, M., Kim, S.: Isolated ASL sign recognition system for deaf persons. IEEE Transactions on Rehabilitation Engineering 3, 261–271 (1995) 7. Kadous, W.: Machine recognition of auslan signs using powergloves: Towards largelexicon recognition of sign language. In: Workshop on the Integration of Gesture in Language and Speech, pp. 165–174. (1996) 8. Holden, E.J., Owens, R., Roy, G.: Adaptive fuzzy expert system for sign recognition In: Int. Conf. on Signal and Image Processing, pp. 141–146. (1999) 9. Zieren, J., Kraiss, K.F.: Non-intrusive sign language recognition for humancomputer interaction. In: IFAC-HMS Symposium. (2004) 10. Bauer, B., Kraiss, K.F.: Towards an automatic sign language recognition system using subunits. In: Goos, G., Hartmanis, J., van Leeuwen, J. (eds) GW2001. LNCS, vol. 2298, pp. 123–173. Springer, Heidelberg (2001) 11. Vogler, C., Metaxas, D.: Handshapes and movements: Multiple-channel American sign language recognition. In: Carbonell, J.G., Siekmann, J. (eds) GW2003. LNCS, vol. 2915, pp. 247–258. Springer, Heidelberg (2004) 12. Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: Kanade, T. et al. (eds) 8th ECCV. LNCS, vol. 3021, pp. 390–401. Springer, Heidelberg (2004) 13. Chen, Y., Gao, W., Fang, G., Yang, C., Wang, Z.: Cslds: Chinese sign language dialog system. In: IEEE Int. Workshop on Analysis and Modeling of Faces and Gestures, pp. 236–237. IEEE (2003) 14. von Agris, U., Schneider, D., Zieren, J., Kraiss, K.F.: Rapid signer adaptation for isolated sign language recognition. In: Conf. on Comp. Vision and Pattern Recognition Workshop, p. 159. IEEE Computer Society, Los Alamitos (2006) 15. Wang, C., Chen, C., Gao, W.: Generating data for signer adaptation. In: Int. Workshop on Gesture and Sign Language based Human-Computer Interaction (2007) 16. von Agris, U., Kraiss, K.F.: Towards a video corpus for signer-independent continuous sign language recognition. In: Int. Workshop on Gesture and Sign Language based Human-Computer Interaction (2007) 17. Lichtenauer, J., Hendriks, E., Reinders, M.: A self-calibrating chrominance model applied to skin color detection. In: Int. Conf. on Computer Vision Theory and Applications (2007)

18. Bahlmann, C., Burkhardt, H.: The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping. IEEE TPAMI 26, pp. 299–310 (2004)