Human Action Recognition in Videos Using Hybrid

0 downloads 0 Views 475KB Size Report
represented by optical flow of a frame and the correlations between words ... Keywords: Action recognition, Period, EHOM, Optical flow, Bag of words, Markov ...
Human Action Recognition in Videos Using Hybrid Motion Features Si Liu1,2 , Jing Liu1 , Tianzhu Zhang1 , and Hanqing Lu1 1

National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Science, Beijing 100190, China 2 China-Singapore Institute of Digital Media, 119615, Singapore

Abstract. In this paper, we present hybrid motion features to promote action recognition in videos. The features are composed of two complementary components from different views of motion information. On one hand, the period feature is extracted to capture global motion in timedomain. On the other hand, the enhanced histograms of motion words (EHOM) are proposed to describe local motion information. Each word is represented by optical flow of a frame and the correlations between words are encoded into the transition matrix of a Markov process, and then its stationary distribution is extracted as the final EHOM. Compared to traditional Bags of Words representation, EHOM preserves not only relationships between words but also temporary information in videos to some extent. We show that by integrating local and global features, we get improved recognition rates on a variety of standard datasets. Keywords: Action recognition, Period, EHOM, Optical flow, Bag of words, Markov process.

1

Introduction

With the wide spread of digital cameras for public visual surveillance purposes, digital multimedia processing has been received increasing attention during the past decade. Human action recognition is becoming one of the most important topics in computer vision. The results can be applied to many areas such as surveillance, video retrieval and human computer interaction etc. Successful extraction of good features from videos is crucial to action recognition. Yan et al. [1] extend the 2D box feature to 3D spatio-temporal volumetric feature. Recently, Ju Sun et al. [2] propose to model the spatio-temporal context information in a hierarchical way. Among all the proposed features, there is a huge family directly describing motion. For example, Bobick and Davis [3] develop the temporal template which captures both motion and shape. Laptev [4] extracts motion-based space-time features. This representation focuses on human actions viewed as motion patterns. Ziming Zhang et al. [5] propose Motion Context (MC) which captures the distribution of the motion words and thus summarizes the local motion information in a rich 3D MC descriptor. These motion based approaches have shown to be successful for action recognition. S. Boll et al. (Eds.): MMM 2010, LNCS 5916, pp. 411–421, 2010. c Springer-Verlag Berlin Heidelberg 2010 

412

S. Liu et al.

Acknowledging the discriminative power of motion features, we propose to combine period and enhanced histograms of motion words (EHOM) to describe motion in the video. Considering the large variation in realistic videos, our method is more feasible to extract compared to 3D volumes, trajectories, spatiotemporal interest points etc. Period Features: Periodical motion occurs often in human actions. For example, running and walking can be seen as periodical actions in the leg region. Therefore, a variety of methods use period features to perform action recognition. Cutler and Davis [6] compute an object’s self-similarity as it evolves in time. For periodic motion, the self-similarity measure is also periodic, and they apply Time-Frequency analysis to detect and characterize the periodic motion. What’s more, Liu et al. [7] also classify periodic motions. Optical Flow Features: Efros et al. [8] recognize the actions of small scale figures using features derived from optical flow measurements in a spatio-temporal volume for each stabilized human figure. Alireza Fathi et al. [9] develop a method constructing mid-level motion features which are built from low-level optical flow information. Saad Ali et al. [10] propose a set of kinematic features that are derived from the optical flow. All of them achieve good results. Hybrid Features: We strongly feel that period and optical flow are complementary for action recognition mainly for two reasons. First, optical flow only capture the motion between two adjacent frames thus bringing in local problems, while period can capture global motion in time domain. For example, suppose we want to differentiate walking from jogging. Because they produce quite similar optical flow, it is difficult to distinguish them based on optical flow features alone. Yet, the period feature can easily distinguish them because when somebody jogs, his/her legs move faster. Second, period information is not obvious in several actions such as bending. However, the optical flow of bending with forwarding components and rising up components are quite discriminative. To exploit the synergy, we choose to use hybrid features consisting of both period features (capturing global motion) and optical flow features (capturing local motion) to develop an effective recognition framework.

2

Overview of Our Recognition System

The main components of the system are illustrated in Fig. 1. We first produce a figure-centric spatio-temporal volume (see Fig. 1(a)) for each person. It can be obtained by using any one of detection/tracking algorithms over the input sequence and constructing a fixed size window around it. Afterwards, we divide every frame of the spatio-temporal volume into m × n blocks to make the proposed algorithm robust to noise and efficient to be computed. By doing this, we also implicitly maintain spatial information in the frame when constructing

Human Action Recognition in Videos Using Hybrid Motion Features

413

Input Video

(a) Spatio-temporal Cuboid

Frame based Optical Flow ...

t x

Optical Flow Bag

y

t x

y ...

...

...

Video based Period Feature (b)

Video based EHOM (c) Hybrid Motion Features ...

... SVM Classifier (d)

Fig. 1. The Framework of Our Approach

features. As a result, we get m × n smaller spatio-temporal cuboids consisting of all the blocks at the corresponding location in every frame(Fig. 1(b)). Sec. 3 addresses quasi-period extraction of the cuboid to describe the global motion in time-domain. The feature of all the cuboid are concatenated to form the period feature of the video. Sec. 4 introduces the EHOM feature extraction. Specifically, each frame’s optical flow is first assigned a label by k-means clustering algorithm. Based on these labels, Markov process is used to encode the dynamic information (Fig. 1(c)). Then the hybrid features are constructed and fed into the subsequent multi-class SVM classifier (Fig. 1(d)). The experimental results are reported in Sec. 5. Finally, the conclusions are given in Sec. 6.

3

Period Feature Extraction

Based on the spatio-temporal cuboid obtained by dividing the original video, our frequency extraction approach is appearance-based similar to [11]. Fig. 2 shows the block diagram of the module. First, we use probabilistic PCA (pPCA) [12] to detect the maximum spatially coherent changes over time in the objects appearance. The input data that are spatially correlated are grouped together. Different from pixel-wise approaches, pPCA considers these pixels as one physical entity. Hence the method is robust to noise. The final output consists of a

414

S. Liu et al.

fest Figure-centric cubiod

pPCA

Frequency Analysis perest

Fig. 2. Block diagram of period extraction module

combination of two indicators: the estimated period fest and the degree of periodicity perest . Next, we will describe the pPCA phase and frequency analysis phase respectively. pPCA for Robust Periodicity Detection: Let X D×N = [x1 x2 ....xN ] represent the input video, with D the number of pixels in one frame and N the number of image frames. The rows of an aligned image frame are concatenated ˆ of the data is given to form the column xn . The optimal linear reconstruction X ˆ = W U + X, ¯ where W D×Q = [w1 w2 ...wQ ] is the set of orthonormal basis by: X vectors, principal components matrix U D×Q is a set of Q-dimensional vectors ¯ the set of mean vectors x of unobserved variables(see Fig.3(b)) and X ¯ . Each eigenvector’s corresponding eigenvalue is indicated by Λ = diag(λ1 , λ2 , ...λd ) of the covariance matrix S of the input data X: S = V ΛV T , which is calculated by eigenvalue decomposition. The dimension Q is selected by setting the maximum percentage of retained variance we want to preserve in the reconstructed ˆ . matrix X Frequency Analysis: Periodogram is a typical non-parametric frequency analysis method which estimates the power spectrum based on the Fourier Transform of the autocovariance function. We choose the modified periodogram of the N −1 2   1   w(n)x(n) exp(−jn2πf ) , where N is the non-parametric class: Pq (f ) = N  n=0

frame length, w(n) is the window used and x(n) is principal component vector uTq from the pPCA(see Fig. 3(b)). By weighing the spectra Pq (f ) with the relative percentages λ∗q of the retained variance and summing them together, a spectrum Q  λq is obtained by P¯ (f ) = λ∗q Pq (f ), where λ∗q =  . D q=1

λd

d=1

In order to detect the dominant frequency component in the spectrum P¯ (f ) (see Fig. 3(c)), we first detect the peaks and local minima which define the peaks’ supports. The peaks with a frequency lower than fNs are discarded, with fs being the sampling rate of the video and N the frame length. Afterwards, starting from the lowest found frequency to the highest, each peak is checked against the others for its harmonicity. We require that a fundamental frequency

415

spectrum

Human Action Recognition in Videos Using Hybrid Motion Features

u1

u2

Frame number (a)

Frequency(mHz)

(b)

(c)

Fig. 3. (a) The spatio-temporal cubic of running is denoted in red. (b) The first 2 principal components of the cubic. (c) The weighted spectrum of running, the peaks is denoted in red and their supports in green.

should have a higher peak than its harmonics and a tolerance of fNs is used in the matching process. We select the one group k with the highest total energy to represent the dominant frequency component in the data. The total energy is the sum of the area between the left and right supports E(.) of the fundamental frequency peak fk0 and its harmonics fki :    0 i fest = arg max E(fk ) + E(fk ) (1) fk0

i

The estimated frequency fest of Fig. 3(c) is 120mHz, which means that the motion repeats itself every 8.33 frames. Note that no matter whether the data is periodical or not, as long as there exist some minor peaks in the spectrum P¯ (f ), the above method may still give a frequency estimate. So we adopt to compare the energy of all peaks found in P¯ (f ) with the total energy to separate the above cases: K  EΔ (fk )  ¯ , (2) perest = k=1 P (f ) f

where K is the number of peaks detected and EΔ (fk ) as the area of a triangle formed by the peak and its left and right supports. Note that the peak supports should have zero energy for the spectrum of periodic signal. By only using the triangle area for the nominator in eq.(2), we assign a lower perest value for quasiperiodic signal. The obtained perest and fest are then concancated to generate the period component of the hybrid feature.

416

S. Liu et al.

Optical Flow Extraction

Visual Words Generation

Markov Process

EHOM

Fig. 4. Block diagram of EHOM extraction module

4

Enhanced Histograms of Motion Words Extraction

As motion frequency is a global and thus coarse description of motion, we adopt a local and finer motion mode descriptor — optical flow as a complement. Fig. 4 shows the block diagram of the module. First, We extract the optical flow of every frame. Then we generate the codebook by clustering all optical flow in training dataset. Afterwards, we would have directly computed the histogram of words occurrences over the entire video sequence based on the obtained visual words, but by doing so the time domain information is lost. For action recognition, however, the dynamic properties of these object components are quite essential, e.g. for the action of standing up or airplane taking off. That is why we go one step further and combine a optical flow based Bags of Words representation with Markov process [13] to get EHOM. It is independent of the length of video and simultaneity maintains both the dynamic information and correlations between words in the video. To our best knowledge, we are the first to consider the relationship between motion words in action recognition. The Lucas and Kanade [14] algorithm is employed to compute the optical flow for each frame. The optical flow vector field F is then split into horizontal and vertical components of the flow, Fx and Fy . These two non-negative channels are then blurred with a gaussian and normalized. They will be used as our optical flow motion features for each frame. Blurring the optical flows reduces the influence of noise and small spatial shifts in the figure centric volume. For each frame, optical flow features of each block are concatenated to generate a longer vector. Next, we represent a video sequence as Bags of Words. Our method represents a frame as a single word. In other words, a “word” corresponds to a “frame”, and a “document” corresponds to a “video sequence” in our representation. Specifically, given the optical flow vector of every frame in the video, we construct a visual vocabulary with the k-means algorithm and then assign each frame to the closest (we use Euclidean distance) vocabulary word. In fig. 5(a), different colors mean the corresponding frames are assigned to different visual words. As we mentioned, we go one step further than Bags of Words by considering the relationship between the motion words using Markov process. Before going deep into details, we present some basic definitions in Markov chains. A Markov Chain [15] is a sequence of random observed variables with the Markov property. It is a powerful tool for modeling the dynamic properties of a system. The markov stationary distribution, associated with an ergodic Markov chain, offers a compact and effective representation for a dynamic system.

Human Action Recognition in Videos Using Hybrid Motion Features

417

2

1 t x

y 3

4

(a) Frames Assigned to Different Visual Words

1 2 3 " k (d) Markov Stationary Features

1 2 3 #

 žž žž žž žž žž žž žž žž žŸ

1

2

3

#

"

Ă Ă (b) Visual Words Transition Diagram

2

1

0

"

0

1

1

"

0

0

0

2

#

#

#

%

k 0 ¬ ­­ 0 ­­­ ­­ 0 ­­­ ­­ 0 ­­­ ­­ 0 ®­­

1 0 0 " k (c) Visual Words Occurrence Matrix

Fig. 5. Construction of E HOM

Theorem 4.1. Any ergodic finite-state Markov chain is associated with a unique stationary distribution (row) vector , such that πP = π. Theorem 4.2. 1) The limit A = lim An exists for all ergodic Markov chains, x→∞ where the matrix 1 (I + P + ... + P n ) (3) An = n+1 2) Each row of A is the unique stationary distribution vector π. Hence when the ergodicity condition is satisfied, we can approximate A by An . To further reduce the approximation error when using a finite n, π is calculated as the column average of An . For consecutive frames in a fixed-length time window with their codebook labels F and F  . we translate the sequential relations between these labels into a directed graph, which is similar to the state diagram of a Markov chain (Fig. 5 (b)). Here we get K vertices corresponding to the K codewords, and weighted edges corresponding to the occurrence of each transition between the words. We further establish an equivalent matrix representation of the graph(Fig. 5 (c)), and perform row-normalization on the matrix to arrive at a valid transition matrix P for a certain Markov chain. Once we obtain the transition matrix P and make sure it is associated with an ergodic Markov chain, we can use eq. 3 to compute π (Fig. 5(d)).

5

Experiments

Here we briefly introduce the parameters used in our experiments. In the period feature extraction phase, pPCA retained variance is 90%, Hanning window is used for periodogram smoothing. If perest is less than 0.4, in other words, the

418

S. Liu et al.

signal is not periodic, we assign the corresponding fest to zero. In EHOM extraction phase, the vocabulary size is set to be 100 and we use n = 50 to estimate A by An . The length of time window is 20 frames. For classification, we use support vector machine (SVM) classifier with RBF kernel. We adopt PCA to reduce the dimension of period feature to make it the same as the dimension of EHOM. To prove the effectiveness of our hybrid feature, we test our algorithm on two human action datasets: KTH human motion dataset [16] and Weizmann human action dataset [17]. For each dataset, we perform leave-one-out cross-validation. During each run, we leave the videos of one person as test data each time, and use the rest of the videos for training.

5.1

Evaluating Different Components in Hybrid Feature

We will show that both components in our proposed hybrid feature are quite discriminative. The period features of 6 activities in KTH database are illustrated in Fig. 6. We can see that the bottom three actions have different frequencies in the leg regions (denoted in red ellipses). Specifically, frunning > fjogging > fwalking , where f stands for the frequencies of leg regions. It conforms to the intuitive understanding. Fig. 7 shows the comparison of our proposed EHOM with traditional BOW representation and illustrates that better results are achieved by considering correlations between motion words. The following experiment is to demonstrate the benefit of combining period and EHOM feature. Fig. 8 shows the classification results for period features, EHOM features and the hybrid of them. The average accuracies are 80.49%, 89.38% and 93.47% respectively. It shows that the EHOM component achieves better result than the period component. We can also draw the conclusion that the hybrid feature is more discriminative than either component alone.

Fig. 6. The frequencies of different actions in KTH database

Human Action Recognition in Videos Using Hybrid Motion Features

419

Methods Mean Accuracy BOW 87.25% EHOM 89.38%

Methods Mean Accuracy period feature 80.49% EHOM feature 89.38% hybrid feature 93.47%

Fig. 7. The comparison between BOW and EHOM on KTH dataset

Fig. 8. The comparison of different features about mean accuracy on KTH dataset

5.2

Comparison with the State-of-the-Art

Experiments on Weizmann Dataset: The Weizmann human action dataset contains 93 low-resolution video sequences showing 9 different people, each of which performing 10 different actions. We have tracked and stabilized the figures using background subtraction masks that come with the dataset. In Fig. 9(a) we have shown some sample frames of the dataset. The confusion matrix of our results is shown in Fig. 9(b). Our method has achieved a 100% accuracy. Experiments on KTH Dataset: The KTH human motion dataset, contains six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping). Each action is performed several times by 25 subjects in four different conditions: outdoors, outdoors with scale variation, outdoors with

bend

1.0 .00 .00 .00 .00 .00 .00 .00 .00 .00

jack

.00 1.0 .00 .00 .00 .00 .00 .00 .00 .00

jump

.00 .00 1.0 .00 .00 .00 .00 .00 .00 .00

pjump

.00 .00 .00 1.0 .00 .00 .00 .00 .00 .00

run

.00 .00 .00 .00 1.0 .00 .00 .00 .00 .00

side

.00 .00 .00 .00 .00 1.0 .00 .00 .00 .00

skip

.00 .00 .00 .00 .00 .00 1.0 .00 .00 .00

walk

.00 .00 .00 .00 .00 .00 .00 1.0 .00 .00

wave1

.00 .00 .00 .00 .00 .00 .00 .00 1.0 .00

wave2

.00 .00 .00 .00 .00 .00 .00 .00 .00 1.0

p j r j s s w w w nd ack ump jum un ide kip alk ave ave p 1 2

be

(a) Weizmann dataset

(b) Weizmann Confusion matrix

Fig. 9. Results on Weizmann dataset: (a) sample frames. (b) confusion matrix on Weizmann dataset using 100 codewords. (overall accuracy=100%)

Methods Mean Accuracy Saad Ali [10] 87.70% Alireza Fathi [9] 90.50% Ivan Laptev [18] 91.8% Our method 93.47% Fig. 10. The comparison of different methods about mean accuracy on KTH dataset

420

S. Liu et al.

boxing

.98 .02 .00 .00 .00 .00

handclapping

.03 .96 .01 .00 .00 .00

handwaving

.01 .04 .95 .00 .00 .00

jogging

.00 .00 .00 .84 .13 .03

running

.00 .00 .00 .09 .91 .00

walking

.00 .00 .00 .02 .03 .95 ha

bo

xin

g

(a) KTH dataset

nd

ha

nd

cla

pp

ing

jog wa run lki gin nin ng g vin g g

wa

(b) KTH Confusion matrix

Fig. 11. Results on KTH dataset: (a) sample frames. (b) confusion matrix on KTH dataset using 100 codewords. (overall accuracy=93.47%)

different clothes and indoors. Representative frames of this dataset are shown in Fig. 11(a). Note that the person moves in different directions in the video of KTH database [16], so we divide each video into several segments according to the person’s moving direction. Since most of the previously published results assign a single label to each video, we will also report per-video classification on KTH datasets. The per-video classification is performed by assigning a single action label aquired from majority voting. The confusion matrix on the KTH dataset is shown in Fig. 11(b). The most confusion is between the last three actions: running, jogging and walking. We have compared our results with the current state of the art in Fig. 10. Our results outperform other methods. The reason for the improvement is the complementarity of the period and EHOM components in our feature. The other reason is the combination of Bags of Words representation and Markov process keeps the correlation between words and temporary information to some extent.

6

Conclusion

In this paper, we propose an efficient feature for human action recognition. The hybrid feature composed of two complementary ingredients is extracted. As a global motion description in time-domain, period component can capture the global motion in time-domain. As an additional source of evidence, EHOM component could describe local motion information. When generating EHOM, we integrate Bags of Words representation with Markov process to relax the requirement on the duration of videos and maintain the dynamic information. Experiments testify the complementary roles of the two components. The proposed algorithm is simple to implement and experiments have demonstrated its improved performance compared with the state-of-the-art algorithms on the

Human Action Recognition in Videos Using Hybrid Motion Features

421

task of action recognition. Since we have already achieved pretty good results in benchmark databases under controlled settings, we plan to test our algorithm in more complicated settings such as movies in future.

References 1. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV (2005) 2. Sun, J., Wu, X., Yan, S., Chua, T., Cheong, L., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: CVPR (2009) 3. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 4. Laptev, I., Lindeberg, T.: Pace-time interest points. In: ICCV (2003) 5. Zhang, Z., Hu, Y., Chan, S., Chia, L.-T.: Motion context: A new representation for human action recognition. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 817–829. Springer, Heidelberg (2008) 6. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 7. Liu, Y., Collins, R., Tsin, Y.: Gait sequence analysis using frieze patterns. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 657–671. Springer, Heidelberg (2002) 8. Efros, A., Berg, A., Mori, G., Malik, J.: Recognition action at a distance. In: ICCV (2003) 9. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: CVPR (2008) 10. Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 99 (2008) 11. Pogalin, E., Smeulders, A.W.M., Thean, A.H.C.: Visual quasi-periodicity. In: CVPR (2008) 12. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. of Royal Stat. Society, Series B 61 (1999) 13. Li, J., Wu, W., Wang, T., Zhang, Y.: One step beyond histograms: Image representation using markov stationary features. In: CVPR (2008) 14. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: DARPA Image Understanding Workshop (1981) 15. Breiman, L.: Probability. Society for Industrial Mathematics (1992) 16. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: CVPR (2004) 17. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: ICCV (2005) 18. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)