Online Action Recognition by Template Matching

1 downloads 0 Views 162KB Size Report
Based on the extraction and template matching, we develop a system for online hu- man action segmentation and recognition in this paper. We proposed.
Online Action Recognition by Template Matching Xin Zhao1,2,3, Sen Wang2,3 , Xue Li2 , and Hao Lan Zhang1 1

NIT, Zhejiang University, 1 Xuefu Road, Ningbo, China School of ITEE, The University of Queensland, Australia The Australian E-Health Research Centre, CSIRO, Australia

2 3

Abstract. Human action recognition from video has attracted great attentions from various communities due to its wide applications. Regarded as an effective way to analyze human movements, human skeleton is extracted and represents human body as dots and lines, Recently, depth-cameras make skeleton tracking become practical. Based on the extraction and template matching, we develop a system for online human action segmentation and recognition in this paper. We proposed a method to generate action templates that can be used to represent intra-class variations. We then adopted efficient subsequence matching algorithm for online process. The experimental results demonstrated the effectiveness and efficiency of our system. Keywords: Action recognition, Subsequence matching, Depth-camera.

1

Introduction

In this paper, we create a system which can recognize indoor human actions from video and record these actions as form of text in real time. This technique can be easily applied in smart homes and nursing cares. For example, the elders’ abnormal actions can be detected in real time and can be monitored/displayed from their children’s cell phones. In our system, we use human skeletons tracked from video to recognize actions represented with these skeletons. Skeleton of human body is represented as dots and lines. Dots represent human joints, and lines represent human body parts. As Johansson [1] suggested, skeleton itself is sufficient to distinguish different types of human actions. The major advantage of skeleton is that it can be used to analyze detailed levels of human movements [2]. Furthermore, the recent introduction of cost-effective depth camera and the related motion capturing technique [3] enable ot track human skeleton efficiently. This phenomenon has brought on a new trend of research on human skeleton based action recognition [4]. In human action recognition from videos, four kinds of intra-class variations may affect the effectiveness: viewpoint, anthropometry, execution rate, and personal style. Viewpoint variation describes the relationship between actor and viewpoint of camera. Anthropometry variation is related to the size of actor, G. Huang et al. (Eds.): HIS 2013, LNCS 7798, pp. 269–272, 2013. c Springer-Verlag Berlin Heidelberg 2013 

270

X. Zhao et al.

which refers human physical attributes and doesn’t change with human movements. Execution rate variation is related with the temporal variability caused by actor or by differing camera frame-rates. Personal style is also required to be considered as people may perform same action in different styles. Furthermore, online recognition from unsegmented stream is another challenge. Because streaming data does not provide pre-segmented instances, we need to find a way to segment a right number of frames for recognition. There are two kinds of traditional methods for segmentation. The first method uses fixedsize sliding window technique [4] [5]. Each segment is treated as one instance. Then machine learning techniques are used to do classification. Unfortunately, fixed-size sliding window suffers poor performance when there is execution rate variation. The second method matches pre-defined action templates from stream. Each template represents one type of action. Each matched subsequence treated as one instance and assigned with the same action label as the corresponding template. Sakurai et al in [6] proposed an algorithm to solve the problem of efficient subsequence matching in stream and showed the potential of their algorithm to handle the problem of human action recognition. However, they simply manually chose one segmented instance as the template of this type of action. This template cannot represent personal style variation. In this paper, we develop a system for online human action recognition. We proposed a method to generate action templates which can represent these four kinds of intra-class variations. The algorithm proposed in [6] is adopted for efficient subsequence matching.

2

Methodology

Our system can be divided into three parts: skeleton preprocessing to generate features for recognition, template learning to obtain templates for matching, subsequence matching to give recognition results: action labels of each frame in stream. Skeleton preprocessing. In this system, we use the skeleton model of openni developed by Primesense and the depth-camera kinect released by Microsoft. The skeleton of human body is a tree structure. Each node in the tree represents one joint position associate with 3 coordinates (x, y, z), and each edge between two connected nodes means one body part. There are 14 joints and 13 body parts in openni skeleton model. Root joint is specified as the joint “spine” s . Body axis a is defined as the line from “spine” to the middle point of “left (l+r)/2−s . The body orientation o shoulder” l and “right shoulder” r, i.e., a = ||(l+r)/2−s|| is the normal vector of triangle with vertices “spine”, “right shoulder” and “left (s−l)×(l−r) , where symbol ‘×’ is cross product of vectors. shoulder”, i.e., o = ||(s−l)×(l−r)|| For human action recognition, joint positions should be transformed into joint angles, which are viewpoint invariant and anthropometry invariant. Firstly, we transform joint positions into the normalized coordinate system, where s = (0, 0, 0) and a = (0, 1, 0) and o = (0, 0, −1) after transformation. Then we transform joint positions into joint angles. Assume p and q are two connected joints.

Lecture Notes in Computer Science: Authors’ Instructions

271

q−p p is the parent of q. Joint angle of q is ||q−p|| . Angle of root joint is omitted. For simplification, we use skeleton data to refer to both joint positions and joint angles in this paper.

Template learning. We learn one template with pre-segmented instances of one action type. One template is consisted with two sequences: one is average sequence which represents standard movement, and the other is deviation sequence which represents personal style variation. We compute the optimal alignment of each instance with initial template by applying Dynamic Time Warping (DTW) distance to eliminate execution rate variation. Initial template is one instance randomly selected. According to these optimal alignments, the instances are locally stretched or contracted, where time stretching is simulated by duplicating columns, while time contractions are resolved by forming a average of the columns. These instances after alignment and the initial template are used to compute template. The average over these instances is the average sequence of template, the deviation over these instances is the deviation sequence of template. The left curves in Figure 1 illustrates action templates. The green skeletons are the average ones. The red range on each joint illustrates the deviations. The templates of curves from top to bottom are actions “kick with right leg” (“kickSideR” for short), “wave” and “walk” respectively. Subsequence matching. In order to match a template from unsegmented stream, we use a variant of DTW. Let X = (x1 , x2 , ..., xn , ...) be skeleton data stream, where xn is the most recent skeleton data and n increases with every new time-tick. T = (A; B) is a template of one action type. We aim to identify the optimal subsequence of X with ending position at current time-tick n. Around all subsequences of X with ending position at n, the DTW distance between template and optimal subsequence is the minimum. The stream should be treated online fashion. Inspired by [6], we calculate optimal subsequence according to each template at every time-tick in stream. At time-tick n, the frame is identified as the action type of corresponding template which obtains minimal distance and the distance is smaller than the given threshold . Figure 1 shows an example of template matching from stream. Axis X represents time-ticks of skeleton data stream, and axis Y represents the DTW distances between optimal subsequences and templates. The recognized results are shown on top, which consist with the skeleton data stream. The red lines represent the threshold .

3

Outcomes

Datasets. We capture one skeleton stream performed by one subject. There are 1000 frames and three types of actions: “kickSideR”, “wave”, “walk”. We choose 3 instances of each type of action to learn its template. We capture another skeleton stream performed by another subject for evaluation. This stream consists of 1000 frames and contains 6 “kickSideR”, 6 “wave”, 43 “walk” instances. The ground truth is manually labeled.

272

X. Zhao et al. x3 x1

x8

x3

x5 x1

x8

x3

x 4x1

x1 x4 x1 x1

x 11

Fig. 1. Example of template matching from stream.

Effectiveness. The classification result of one instance is decided by the majority frames with same label. The threshold  is set to 2.5. As shown in Figure 1, all instances are correctly classified. Efficiency. We preform our experiments with hardware of “i7 860 CPU” and “4G RAM”, and softwares of Matlab hybrid with parts of C code. With our approach, more than 30 frames can be processed per second. Acknowledgments. The work reported in this paper is partially supported by Ningbo Natural Science Foundation (No. 2012A610025, No.2012A610060), Ningbo Soft Science Grant (No. 2012A10050), Ningbo International Cooperation Grant (No. 2012D10020) and the National Natural Science Fund of China (No. 71271191, No 61272480).

References 1. Johansson, G.: Visual motion perception. Scientific American (1975) 2. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. CSUR 43, 16 (2011) 3. Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011) 4. Fothergill, S., Mentis, H.M., Kohli, P., Nowozin, S.: Instructing people for training gestural interactive systems. In: CHI (2012) 5. Zhao, X., Li, X., Pang, C., Wang, S.: Human action recognition based on semisupervised discriminant analysis with global constraint. Neurocomputing (2012) 6. Sakurai, Y., Faloutsos, C., Yamamuro, M.: Stream monitoring under the time warping distance. In: ICDE (2007)