Full Body Person Identification Using the Kinect Sensor

9 downloads 91701 Views 309KB Size Report
For this purpose, the sensor and its API automatically extracts a skeleton from .... subjects was the 2010 model for the X-Box 360, and the SDK was version 1.0 ...
Full Body Person Identification Using the Kinect Sensor Virginia O. Andersson

Ricardo M. Araujo

Center for Technological Development Federal University of Pelotas Pelotas, RS, Brazil Email: [email protected]

Center for Technological Development Federal University of Pelotas Pelotas, RS, Brazil Email: [email protected]

Abstract—Identifying individuals using biometric data is an important task in surveillance, authentication and even entertainment. This task is more challenging when required to be performed without physical contact and at a distance. Analyzing video footages from individuals for patterns is an active area of research aiming at fulfilling this goal. We describe results on classifiers trained to identify individuals from data collected from 140 subjects walking in front of a Microsoft Kinect sensor, which allows tracking 3D points representing a subject’s skeleton. From this data we extract anthropometric and gait attributes to be used by the classifiers. We show that anthropometric features are more important than gait features but using both allows for higher accuracies. Additionally, we explore how different numbers of subjects and numbers of available examples affect accuracy, providing evidences on how effective the proposed methodology can be in different scenarios.

I.

I NTRODUCTION

Biometric identification systems are present in several applications, from access restriction and surveillance tasks, to ecommerce, entertainment services and profile customization services [1]. These systems can be classified as active or passive, where active systems require the subject to interact with an interface or collector mechanism; fingerprints and iris biometric systems are examples of such systems. On the other hand, passive systems do not require subject interaction or contact with an interface to allow data extraction; the subject may even be unaware that a biometric identification is taking place. Examples of passive biometric systems are voice and face recognition. A more recent trend is to use full-body measurements to perform passive identification, which is harder for an individual to mask the observed attributes. The measurement of body parts (anthropometry) and gait analysis are examples of techniques that leverage such approach. Passively extracting body measurements and gait features typically requires video processing and complex image analysis. The process often involves extracting an individual’s skeleton - a set of key points that can be useful for identification purposes (e.g. body joints). In the end of 2010, Microsoft introduced the Kinect sensor for its X-Box 360 videogame, which allows for a user to control the console using gestures. For this purpose, the sensor and its API automatically extracts a skeleton from the user or users standing in front of the sensor. Such automatic extraction has the potential to simplify the

process of extracting useful anthropometric and gait features for biometric identification. In this paper, we improve on previous work and investigate the usefulness of data provided by a Kinect sensor to perform biometric identification using a combination of anthropometric gait data from a novel and extensive data set. We captured skeleton data of 140 individuals walking in front of the sensor and trained three classifiers with the data, aiming at automatically identifying each person from the attributes provided by the sensor. We tested the accuracy of the classifiers using data extracted from individuals’ gait, along with anthropometric information (i.e. body parts’ lengths). The main contribution of this paper is to provide results on gait and anthropometric recognition over a novel and extensive data set composed of features extracted from a Kinect sensor, taking into account different group sizes and number of available examples. This data set uses free cadence walks, thus removing the interference and artificiality introduced by the use of treadmills and contains far more subjects than most previous approaches - e.g. [2]–[4]. It also contributes with a detailed methodology to extract, process and use the captured data. II.

R ELATED W ORK

Anthropometry as a biometric system was first proposed by Alphonse Bertillon, from the France police, in the mid-XIX century. This technique was popularly known as Bertillonage and it consisted of taking measurements of several body parts and then using them to identify offenders [5]. This method was soon replaced by analysis and storage of fingerprints, which was proved to be more versatile [6]. Using human gait as a biometric feature was motivated by evidences that individuals describe unique patterns during their gait cycles [7], [8]. Human gait has more than 20 distinct components composing the identity pattern [7]. In [8], gait analysis was restricted to the use of sagital (a vertical plane that divides the human body in right and left part) rotations of hips, knees, and ankles during gait cycle, due to limitations of computer vision techniques at the time. Later, works such as [9] presented evidences that these lower segment joints concentrate the main features that provide the individuality of gaits. Spatiotemporal and kinematic parameters of the gait cycle are some of the most important components of the human gait [8], [9].

Spatiotemporal parameters are those described as a function of time, such as time spent executing a gait cycle, velocity of displacement and duration of the gait cycles intermediate stages. In [9], the authors used the stride length, gait cycle phase, gait cycle duration time and speed of the individual as spatiotemporal parameters. Kinematic parameters are generally associated with the angles described by the body segment’s joints during gaits. These angles are measured from one segment to another. In order to extract and analyze the human gait, the approaches used can be classified as (i) model-based, where the human gait is described by gait theory fundamentals and is reconstructed through a model, e.g. stick figure, that fits the person in every gait sequence frame; the spatiotemporal and kinematic parameters, then, are extracted from the model during gait cycle; and (ii) model-free, where features based on movement behavior, shape and silhouette of the individual are used as attributes in gait pattern recognition, without explicit considering human gait fundamentals [10]. Model-free approaches include the concept of Gait Energy Images [11]. In [4], Gait Energy Volumes was proposed as an extension and a database containing depth information on 15 subjects walking towards the Kinect sensor in two different speeds was used to test the methodology. Perfect accuracy was obtained for such small number of subjects. In [12], the authors proposed the use of Depth Gradient Histogram Energy Image to improve identification when many more subjects are being identified, reporting high accuracy (81%-92%); however, the database used contains only very few gait cycles due to a fixed sensor placement. Therefore this approach focused on generic object-motion characteristics, without considering gait signature information. In [3], the Kinect sensor was also used to capture videos of 10 individuals executing two different actions: walking and running. They proposed an approach to train a Support Vector Machine, capable of differentiating between the two actions. They also proposed an identity classifier using anthropometric measures and motion patterns of twenty 3D joint provided by a Kinect sensor. The classifier achieved 93% accuracy for action recognition and 90% for identity recognition. These high accuracies are possibly due to the small number of subjects being used. In [2], only anthropometry attributes were used to train classifiers. The authors captured 8 subjects walking in front of the sensor and trained a model to identify individual subjects using only static measurements of body parts. Even though encouraging results were provided (upwards of 90%), the data set was too small for strong conclusions to be drawn. Compared to previous approaches, the present paper makes use of a more comprehensive data set (140 users), extends the attributes to include model-based dynamic gait information and provides more in-depth experiments on the data. III.

M ETHODOLOGY

A. The Kinect Sensor The Kinect sensor is a human movement tracker that does not require special markers or direct contact with the subject.

!  

Fig. 1. Semi-circular path used for subjects’ walks, showing the Kinect sensor at the center equipped with a spinning dish to allow for tracking.

The sensor and its API simplify the steps of capturing video stream data, image pre-processing and application of computer vision algorithms to create a skeleton model from a subject. The sensor is equipped with a RGB camera, a depth sensor composed of an infrared light emitter and a camera sensible to this infrared. This hardware is associated with a software library called NUI API (Natural User Application Programming Interface) that retrieves data from the sensors and controls the Kinect device. In particular, the NUI Skeleton API is responsible for providing detailed information about the location, position and orientation of individuals located in front of the sensor. This information is provided to the application as a set of three-dimensional points called “skeleton points”. These points approximate the main joints of human body and the actual position of the individual in front of the sensor. The API provides data in the form of frames. Each frame contains an array containing all the extracted points at the moment of the capture. The sensor is able to provide up to 30 frames per second. The Kinect sensor used to capture the subjects was the 2010 model for the X-Box 360, and the SDK was version 1.0 beta2. B. Capturing Methodology In order to capture the movement of individual subjects, volunteers walked in front of the sensor while data was being recorded. Subjects perform a semi-circular trajectory, as illustrated in Fig. 1. A spinning dish is used to help move the Kinect sensor to follow the person during the walk. This combination of trajectory and Kinect’s pan camera movement allows several gait cycles to be captured per individual without distortions caused by subjects moving in or out of the sensors field of view. Each subject executes a round trip free cadence walk, starting on the left of the sensor, walking clockwise to the right of the sensor. The subject then returns, walking counterclockwise and stopping at the initial point (Figure 1). Each subject performs the round trip walk 5 times, generating 5 walk samples containing the 3D points of the tracked joints.

for each frame captured, using the pendulum model proposed in [8] and depicted in Figure 3 (b). Furthermore, we calculate the foot angle, described in [7], depicted in Figure 3 (c) and the spatiotemporal parameters described in [9]: the step length, stride length (or “gait cycle size”), cycle time and velocity. As shown in Figure 3 (b), the angle θ is formed during a gait cycle between the segments of the thigh and a projection of the hip 3D coordinates. Between the leg and the knee projection the angle γ is defined; α is the angle of the ankle rotation formed by the foot segment and the ankle projection and the foot angle β is formed by the opening of the foot in relation to the axis of the heel. These angles describe periodic curves during a walk, which can have useful characteristics for biometric recognition [5].

Fig. 2. Raw skeleton angles and ARMA skeleton angles. The filter is a central moving average filter, where the output is a weighted average of N = 4 past and M = 4 future inputs.

The volunteers were recruited for the experiment at the university campus. The majority of subjects were college students, with ages between 17 and 35 years old. Each of the subjects that accepted to participate in the experiment provided her gender, height and weight for further analysis and experiments. They were wearing light clothing, since the captures were conducted during summer. The captures were conducted in an empty classroom at day time with mostly artificial lighting. A total of 140 individuals were captured using the proposed methodology (95 men and 45 women). In most cases, each individual generated about 500 to 600 frames and completed between 6 and 12 gait cycles per walk. A database was created containing all the information provided by the sensor for each captured frame, along with the subject’s anonymized identification, gender, height and weight. C. Skeletal Joint Smoothing Raw skeleton data often presents noise in the joint positions due to errors in the tracking process. In order to reduce this noise we applied an Auto Regressive Moving Average (ARMA) filter [13] with 4 past and future frames, totaling a window of size 8, set in an ad hoc fashion by observing a visual reconstruction of walks before and after the filter. This filter was applied to all walk samples before being used. Figure 2 shows an example of the result of applying the ARMA filter to raw data. D. Examples and Attributes An example is composed of attributes extracted from each walk. These attributes are divided in two sets: gait attributes and anthropometric attributes. In what follows we describe how each attribute is defined. The full data set, along with raw data, is available at http://ricardoaraujo.net/kinect/. 1) Gait Attributes: Model-based gait analysis considers the human gait theory to help extract parameters from human walk. The angles described by the joints of the hips, knees and ankles, known as kinematic parameters, were calculated

The periodic curves generated by the lower joint angles are composed by flexion and extension phases, visually noticed by peaks (flexion) and valleys (extension) [7], as depicted in Figure 4. The arithmetic average and standard deviation were computed for the flexion peaks and extension valleys in order to characterize the curves of each individual. Lower and higher flexion peaks and extension valleys were treated separately, generating an arithmetic average and standard deviation for each high and low phase. Each lower joint was considered independent of the others, generating attributes equally independent. Spatiotemporal parameters were calculated based on the step length and frame rate of the Kinect sensor. The step length was obtained by averaging the highest values of the difference between the right and left heels. In addition, we use as attributes the stride length (Eq. 1), average stride length over all n strides (Eq. 2), cycle time (Eq. 3) and velocity (Eq. 4). A total of 60 gait attributes were defined. strideLength = 2 ∗ stepLength

avgStrideLength =

n X strideLength i=1

cycleT ime =

velocity =

n

avgStrideLength 30

avgStrideLength cycleT ime

(1)

(2)

(3)

(4)

2) Anthropometric Measurements: For each frame captured the measurements of several body segments, shown in Figure 5, were calculated using the Euclidean distance between joints, in a similar fashion to the methodology employed in [2]. The subject’s height was defined as the sum of the neck length, upper and lower spine length and the averages lengths of the left and right hips, thighs and lower legs. The mean and standard deviation of each body segment and height over all frames of a walk were calculated. Measurements beyond two standard deviations from the mean were discarded and attributed to noise. The recalculated means for each part were used as attributes, totaling 20 anthropometric attributes.

Fig. 3.

Angles tracked to compose gait attributes.

Fig. 5.

Tracked joints (circles) and body segments used as attributes.

E. Classifiers We consider three machine learning algorithms to train classifiers: a K-Nearest Neighbor (KNN) algorithm, a Multilayer Perceptron (MLP) and a Support Vector Machine (SVM). Parameters for each algorithm were set by systematically varying their values while trying to maximize the resulting accuracy using a 10-fold cross-validation [14]. All attributes provided by the sensor are real values and were normalized before use, by mapping their values to the range −1, 1. KNN was set to K = 5 and Manhattan distance as the distance metric. Neighbors’ distances di were weighted with a weight of 1/di . For the MLP, we only considered networks with a single hidden layer and the number of hidden units was set to 40. Sigmoidal activation function was used for all units. Training was performed using the Backpropagation [15] algorithm with momentum set to 0.2, learning rate to 0.3 and 1000 maximum epochs. The SVM was trained using the Sequential Minimal Optimization (SMO) algorithm [16], using a polynomial kernel and C = 100.0.

Fig. 4. Hip, knee and ankle rotation angles, in degrees, from the right leg of a subject, corresponding to a complete gait cycle. The valleys and peaks are separated in low and high, corresponding to extension and flexion phases described in [7].

We use 10-fold cross-validation to validate the trained models i.e. the data set was randomly partitioned in 10 subsets and training was performed ten times, each time leaving one partition out of the training process, which was used for testing; the reported accuracies are the averages of these ten executions. When required, statistical significance tests are performed using a Wilcoxon signed-rank test [17] and the resulting pvalue is shown.

TABLE I.

C LASSIFIERS ’ ACCURACY USING DIFFERENT ATTRIBUTES .

Classifier SVM KNN MLP

Gait 62.9% 59.5% 59.2%

IV.

Anthropometric 84.7% 85.4% 79.7%

All 86.3% 87.7% 84.7%

R ESULTS

A. Classifiers Accuracy All classifiers displayed roughly the same performance, as measured by general accuracy. Table I summarizes the results over all subjects and different combinations of attributes. Overall, we can observe that Gait attributes alone leads to a poor performance for all classifiers, with the SVM showing a slightly better accuracy, if not very statisticaly significant (p = 0.032), when compared to KNN. Using only Anthropometric attributes allows for much improved accuracies. The MLP shows a statistically significant poor performance compared to KNN and SVM, while KNN shows the best response. Finally, using both Gait and Anthropometric attributes provides the best accuracies, with the MLP again trailing a bit behind and the KNN showing the best overall performance (p = 0.048 when compared to the SVM).

Fig. 6.

Classifiers’ accuracy for different group sizes.

Fig. 7.

Accuracy for different number of training examples per individual.

The comparatively slight increase in accuracy when combining both types of attributes shows that gait attributes do not provide much value beyond anthropometric attributes, an evidence that the two are somewhat correlated. It is clear that the latter is responsible for most of the response, with gait attributes contributing only an average of 3.3 percentage points. Nonetheless, gait attributes do show a measurable contribution and by themselves are reasonably useful (much better than random) for person identification. B. Group Size While the results in Table I are reasonable and useful for a number of application (possibly excluding strict authentication), they may seem worse than results presented in similar previous works - e.g. [2] where upwards of 98% accuracy was reported, but for only 9 subjects. One key missing aspect of these previous work is an account for how accuracy varies with the size of the group being identified. Figure 6 shows how accuracy evolves when subjects in increasingly large groups must be identified. Each point (except for 140) is the average over 10 groups with the same size and randomly drawn from the complete data set. For very small groups accuracies are close to 98% steadily converging to the results seen in Table I. Additionally, we can observe that KNN actually performs worse for small groups, only becoming better than SVM and MLP for groups larger than 20 individuals. C. Number of Examples For each individual we captured 5 complete walks, leading to 5 examples per subject. An important question is how the number of examples affect the observed performances, since this has a direct impact on how long, or how often, a subject must be observed for proper classification. To better understand this, we varied the training set size by limiting

the number of examples per individual and testing on the remaining examples. Figure 7 shows the results, where each point is the average over all possible combinations of training and testing examples. It is possible to observe that accuracy increase quickly with the number of examples for all classifiers. The MLP suffers the most with fewer examples. The plot also hints that all classifiers appear to converge in performance as more examples are made available and that adding even more examples could lead to improved performance - an indication that examples still contain considerable amount of variation, attributable to noise from the sensor, even after averaging over several frames and removing outliers. Nonetheless, a pattern of diminishing returns is evident. A similar pattern appears when applying the same methodology to Anthropometric or Gait attributes only. D. Relevant Attributes Not all attributes contribute in the same way to the observed results. We applied a correlation-based feature subset selection

TABLE II. ATTRIBUTES SELECTED FROM A C ORRELATION - BASED F EATURE S UBSET S ELECTION AND HOW MUCH REMOVING EACH OF THE SET AFFECTS ACCURACY, IN PERCENTAGE POINTS .

Attribute Stride Length (gait) Right Foot Right Hand Neck Upper Spine Right Shoulder Left Hip Height Right Leg Left Angle Peaks (gait) Left Forearm Right Thigh Left Thigh Left Leg

Average Accuracy Change -4.4% -3.2% -2.9% -2.6% -1.9% -1.7% -1.7% -1.4% -1.3% -0.7% -0.6% -0.6% 0.0% 0.0%

[18] to the data and the resulting subset only performs an average of 0.1 percentage point below the case where all attributes are used. This subset contains only 14 attributes (from a total of 80) and is shown in Table II, along with the change in accuracy when removing each attribute (averaged over the change for all classifiers). It is possible to observe that 12 of the selected attributes are anthropometric and only 2 are related to gait. This reinforces our previous observation that gait features are largely correlated to anthropometric features but less reliable for identification. Nonetheless, Step Length, a gait attribute, provides the largest drop in accuracy when individually removed from this set. V.

C ONCLUSIONS

We presented an investigation on the use of skeleton points tracked by and retrieved from a Microsoft Kinect sensor to compose a person identification system that uses gait and anthropometric information. A large data set of walking subjects was created with the raw 3D joints data retrieved using the Kinect NUI Skeleton API. A total of 140 individuals were captured walking in semi-circular trajectories, allowing for longer walks when compared to previous data sets. Anthropometric measurements were calculated for each individual and a model-based approach was proposed to extract a set of gait attributes. By training different classifiers, we showed evidences that KNN is suitable for the task of person identification, even for a large number of subjects. SVM showed comparable performance and did better when only gait attributes were available. MLP performed consistently worse, but not by a large margin. Both overperform KNN for smaller groups (N < 20). Considering the two types of attributes used, we showed that anthropometric attributes are more useful for identification than gait attributes. However, adding a few gait attributes improves the classifiers’ overall performance. The large data set used allowed for testing the performance of the classifiers for varying group sizes. As expected, accuracy drops as more subjects are being identified. KNN showed worse performance for small groups when compared to SVM and MLP. As important as the size of the group, the number

of examples available per subject was also shown to be an important factor to obtain high accuracies. This is evidence of the high amount of variation present in the data provided by the sensor and the need of better approaches to reduce noise. Finally, our general approach showed better results than previous work using very large groups. Future work includes applying better filters to the data, along with different gait attributes to improve accuracy. The latter is still a major challenge in gait analysis, especially when considering model-based attributes. Finally, given the decisive influence of group sizes and number of examples in the resulting accuracies, it becomes necessary to systematically replicate previous methodologies over the same data set and observe how each scales with these parameters. ACKNOWLEDGMENT This work is supported by CNPq (Brazilian National Research Council) through grant number 477937/2012-8. R EFERENCES [1]

[2]

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

L. Wang, “Some issues of biometrics: technology intelligence, progress and challenges,” International Journal of Information Technology and Management, vol. 11, no. 1/2, p. 72, 2012. R. M. Araujo, G. Gra˜na, and V. Andersson, “Towards skeleton biometric identification using the microsoft kinect sensor,” in the 28th Annual ACM Symposium. New York, New York, USA: ACM Press, 2013, pp. 21–26. B. C. Munsell, A. Temlyakov, C. Qu, and S. Wang, “Person identification using full-body motion and anthropometric biometrics from kinect videos.” in ECCV Workshops (3), ser. Lecture Notes in Computer Science, A. Fusiello, V. Murino, and R. Cucchiara, Eds., vol. 7585. Springer, 2012, pp. 91–100. [Online]. Available: http://dblp.uni-trier.de/db/conf/eccv/eccv2012w3.html#MunsellTQW12 S. Sivapalan, D. Chen, S. Denman, S. Sridharan, and C. Fookes, “Gait energy volumes and frontal gait recognition using depth images.” in IJCB, A. K. Jain, A. Ross, S. Prabhakar, and J. Kim, Eds. IEEE, 2011, pp. 1–6. [Online]. Available: http: //dblp.uni-trier.de/db/conf/icb/ijcb2011.html#SivapalanCDSF11 G. G. Harrap, Alphonse Bertillon: Father of scientific detection. New York: Abelard-Schuman, 1956. A. K. Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 1, pp. 4–20, Jan. 2004. M. P. Murray, A. B. Drought, and R. C. Kory, “Walking patterns of normal men,” The Journal of Bone Joint Surgery, vol. 46, pp. 335–360, 1964. D. Cunado, M. S. Nixon, and J. N. Carter, “Automatic extraction and description of human gait models for recognition purposes,” Computer Vision and Image Understanding, vol. 90, no. 1, pp. 1–41, Apr. 2003. J.-H. Yoo and M. S. Nixon, “Automated markerless analysis of human gait motion for recognition and classification,” ETRI Journal, vol. 33, no. 2, pp. 259–266, Apr. 2011. H. Ng, H.-L. Tong, W. H. Tan, and J. Abdullah, “Improved gait classification with different smoothing techniques,” International Journal on Advanced Science, Engineering and Information Technology, vol. 1, no. 3, pp. 242–247, 2011. J. Han and B. Bhanu, “Individual recognition using gait energy image,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 2, pp. 316–322, 2006. M. Hofmann and S. Bachmann, “2.5d gait biometrics using the depth gradient histogram energy image,” in Proceedings of the IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), 2012, pp. 399 – 403. M. Azimi, “Skeleton joint smoothing white paper,” Microsoft Inc., Tech. Rep., 2012. [Online]. Available: http://msdn.microsoft.com/ en-us/library/jj131429.aspx

[14] [15]

[16]

[17] [18]

T. Mitchell, Machine Learning, 1st ed. New York, NY, USA: McGrawHill, Inc., 1997. S. Haykin, Neural Networks and Learning Machines (3rd Edition), 3rd ed. Prentice Hall, Nov. 2008. [Online]. Available: http: //www.worldcat.org/isbn/0131471392 J. C. Platt, “Advances in kernel methods,” B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, ch. Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. [Online]. Available: http://dl.acm.org/citation.cfm?id=299094.299105 F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 12 1945. M. A. Hall, “Correlation-based feature subset selection for machine learning,” Ph.D. dissertation, University of Waikato, Hamilton, New Zealand, 1998.