Ordered Trajectories for Human Action Recognition with ... - CiteSeerX

0 downloads 0 Views 2MB Size Report
Jul 21, 2015 - Action Classes from Videos in the Wild, in: CRCV-TR-12-01, 2012. [45] J. Zhu, Q. Zhou, W. Zou, R. Zhang, W. Zhang, A generalized pyramid.
Ordered Trajectories for Human Action Recognition with Large Number of Classes O. V. Ramana Murthy1 , Roland Goecke1,2 1 Vision

& Sensing, Human-Centred Technology Research Centre, ESTeM, University of Canberra, 2 IHCC, CECS, Australian National University

[email protected], [email protected]

Abstract Recently, a video representation based on dense trajectories has been shown to outperform other human action recognition methods on several benchmark datasets. The trajectories capture the motion characteristics of different moving objects in space and temporal dimensions. In dense trajectories, points are sampled at uniform intervals in space and time and then tracked using a dense optical flow field over a fixed length of L frames (optimally 15) spread overlapping over the entire video. However, amongst these base (dense) trajectories, a few may continue for longer than duration L, capturing motion characteristics of objects that may be more valuable than the information from the base trajectories. Thus, we propose a technique that searches for trajectories with a longer duration and refer to these as ‘ordered trajectories’. Experimental results show that ordered trajectories perform much better than the base trajectories, both standalone and when combined. Moreover, the uniform sampling of dense trajectories does not discriminate objects of interest from the background or other objects. Consequently, a lot of information is accumulated, which actually may not be useful. This can especially escalate when there is more data due to an increase in the number of action classes. We observe that our proposed trajectories remove some background clutter, too. We use a Bag-of-Words framework to conduct experiments on the benchmark HMDB51, UCF50 and UCF101 datasets containing the largest number of action classes to date. Further, we also evaluate three state-of-the art feature encoding techniques to study their performance on a common platform. Keywords: Action recognition, dense trajectories, large scale classification, Fisher vector, bag-of-words, SVM

1. Introduction Human action recognition requires the modelling of the coordinated motions – the actions – of different parts of the human body and their interaction with other objects or persons nearby. Offline applications include event detection, e.g. injury detection in team sport matches; efficient video retrieval, e.g. spotting a person in news footage; and video indexing. Online applications include automatic surveillance in public places, shopping malls, hospitals etc.; interactive applications, e.g. video games; robotics and more. In early research, human action recognition focussed on classifying only very few (6 − 11) action classes, which mostly involved motion of the whole body without much interaction with nearby objects or people. Experimental validations were mainly carried out on videos collected in very controlled environments, such as in a laboratory, staged by a single person on several occasions. However, with the widespread use of the internet these days, users are uploading millions of videos on social networking sites such as YouTube and Facebook. This has created challenges for developing robust techniques for action recognition in real-world videos, even more so when considering a much larger number of action classes. Real-world videos can contain movements of the entire body or of only some specific regions, e.g. facial expressions or moving a limb, possibly repetitive, whole-body movements such as walking and running, or a number of sequences of body movements, such as walking in a queue or cross-walking at an interPreprint submitted to Image and Vision Computing

section. It is of interest to investigate how to adapt, generalise or fuse the existing techniques to model any kind of human actions. In real-world videos, context (e.g. the environment / situation) and interaction of the body with the context (e.g. objects / persons) is also important to correctly classify the actions being performed. In this paper, we focus on a feature representation scheme for large scale human action recognition in realistic videos. We select the local representation based Bag-of-Words (BoW) technique to test our proposed hypothesis. Firstly, interest points are detected at different spatio-temporal locations and scales for a given video. Then, local feature descriptors are computed in the spatio-temporal neighbourhood of the detected interest points, which capture shape (gradient) or motion (optical flow) or similar measurements describing the human action dynamics. We observe that trajectories obtained by a KanadeLucas-Tomasi (KLT) tracker [1], either densely sampled [2] or sampled in one of the variants [3, 4, 5, 6], have been consistently performing well on several benchmark action recognition datasets. Hence, we focus our current work on trajectories only. We conjecture that objects moving for a longer duration contain more discriminative information for action classification. Hence, we propose a scheme to select such trajectories from the base (dense) trajectories in a very computationally and memory efficient way. We refer to such selected trajectories as ‘ordered trajectories’. We study and present our results on the large scale action datasets HMDB51, UCF50 and UCF101 containing at July 21, 2015

least 50 different action classes. All experiments are performed in a BoW framework using a Support Vector Machine (SVM) classifier. In this paper, we make the following contributions:

2.1. Trajectories and Variants Uemura et al. [14] proposed human action recognition based on the KLT tracker and the SIFT descriptor. Multiple interest point detectors were used to provide a large number of interest points for every frame. Sun et al. [15] proposed a hierarchical structure to model spatio-temporal contextual information by matching SIFT descriptors between two consecutive frames. Actions were classified based on intra- and intertrajectory statistics. Messing et al. [16] proposed an activity recognition model based on the velocity history of Harris 3D interest points (tracked with a KLT tracker). Matikainen et al. [17] proposed a model to capture the spatial and temporal context of trajectories, which were obtained by tracking Harris corner points in a given video using a KLT tracker. Wang et al. [2] proposed dense trajectories to model human actions. Interest points were sampled at uniform intervals in space and time, and tracked based on displacement information from a dense optical flow field. Kliper-Gross et al. [5] proposed Motion Interchange Patterns (MIP) for capturing local changes in motion trajectories. Based on dense trajectories [2], Jiang et al. [4] proposed a technique to model the object relationships by encoding pairwise dense trajectory codewords. Global and local reference points were adopted to characterise motion information with the aim of being robust to camera movements. Jain et al. [3] proposed another variant of dense trajectories recently, showing that significant improvement in action recognition can be achieved by decomposing visual motion into dominant (assumed to be due to camera motion) and residual motions (corresponding to the scene motions). Raptis [18] extracted spatiotemporal structures by forming clusters of dense trajectories to serve as candidates for the parts of each action. Peng et al. [19] developed a set of new descriptors: Spatial context descriptors to capture complex spatial structures of appearance and motion and temporal context descriptors to depict clear motion and appearance changes from successive patches. Both types of descriptors are computed from dense trajectories. Each technique has attempted to capture and model some type of characteristic information of objects moving in the video. However, none explicitly tried to capture the information of objects moving for a longer duration. We conjecture that valuable information for action recognition is contained in longer object trajectories.

1. A technique that captures information of objects with longer duration trajectories. 2. A feature selection like approach that selects about half of the dense trajectories, yet delivers better performance than the original and several other trajectory variants. 3. Removal of a large number of trajectories related to background noise. In the remainder of the paper, Section 2 contains a review of the latest advances in feature representations w.r.t. large scale action recognition. Section 3 describes the proposed framework. Section 4 details the local feature descriptors, codebook generation, classifier and datasets used to conduct our experiments. Section 5 presents and discusses the results obtained on the benchmark datasets. Finally, conclusions are drawn in Section 6.

2. Related Literature Several techniques to detect interest points and construct local feature based representation exist in the literature. Laptev and Lindeberg [7] first proposed the usage of Harris 3D corners as an extension of the traditional Harris corner points. Later on, cuboid detectors obtained as local maxima of the response function of temporal Gabor filters on a video were proposed by Doll´ar et al. [8]. Willems et al. [9] proposed a Hessian interest point detector, which is a spatio-temporal extension of the Hessian saliency measure for blob detection in images. Wang et al. [10] proposed a dense sampling approach wherein the interest points are extracted at regular positions and scales in space and time. Spatial and temporal sampling are often done with 50% overlap. Further, Wang et al. [2] extend the dense sampling approach by tracking the interest points using a dense optical flow field. We focus particularly on trajectory based techniques of the last five years as they have been found to perform better than most other techniques on larger action recognition datasets, e.g. UCF50 and HMDB51, with 50 or more action classes. For the sake of presentation of our work, we review trajectory based techniques from three perspectives. Firstly, we review those works that propose a variant of trajectories [4, 5] and/or new local feature descriptors [3]. Secondly, we review those works, which construct features in different scales/volumes obtained by dividing the video along height, width and time, and then aggregate these. This is akin to the popular Spatio-Temporal Pyramidal approach [11, 12, 13]. The trajectories and their variants can also be aggregated together. Thirdly, we review those works, where some trajectories are selected from those obtained by the above approaches. All three categories for feature representation are still in practice and found to yield good results on different action recognition datasets. They are described below.

2.2. Spatio-Temporal Pyramidal Approach To encode spatial information within the bag-of-features representation, the Spatial Pyramidal approach was first proposed by Lazebnik et al. [12] for object classification in images, but has since also been successfully used in videos. Here, the video sequence is split into sub-volumes and a histogram is computed for each sub-volume. The final histogram is obtained by concatenating or accumulating all histograms of the sub-volumes. Illustrations are given in Figure 1. For example, when the original video is divided into three horizontal sub-volumes, as shown in Figure 1(b), three separate codebooks are to be constructed for each sub-volume. Using those dense trajectories that fall in a particular sub-volume, feature histograms are generated. 2

These feature histograms are concatenated to yield the overall feature vector for the video. In this case, the feature vector will be 3 times larger than the one obtained on full-volume dense trajectories. In another case, as can be seen in Figure 1(f) where the original video is divided into eight sub-volumes, the final vector will be 8 times larger than the one obtained by the full-volume dense trajectories.

criminative and representative features. A main drawback with these kind of approaches is that they tend to be computationally expensive.

Figure 1: Spatio-temporal grids used by Wang et al. [6]

2.3. Interest Point Selection Nowak et al. [23] showed by extensive experiments that uniform random sampling can provide performances comparable to dense sampling of interest points. Another independent study [24] showed that action recognition performance can be maintained with as little as 30% of the densely detected features. Bhaskar et al. [25] presented an approach for selecting robust STIP detectors by applying surround suppression combined with local and temporal constraints. They show that such a technique performs significantly better than the original STIP detectors. Shi et al. [26] showed that with proper sampling density, a state-of-the-art performance can be achieved by randomly discarding up to 92% of densely sampled interest points. Peng et al. [19] proposed a motion boundary based dense sampling strategy to select a few dense trajectories, while preserving the discriminative power. These studies have motivated us to search for a smaller, yet richer set of trajectories from the dense trajectories.

Zhao et al. [20] divide each frame into cells, over which Doll´ar’s features [8] are computed. Additionally, motion features from neighbouring frames are used in a weighted scheme, which takes into account the distance of the neighbour from the actual frame. A spatial-pyramid matching, similar to [12], is then applied to compute the similarity between frames. Finally, frames are classified individually and a voting scheme is used for action recognition in the video. Ullah et al. [13] used six spatial subdivisions of a video and computed local features and models for each subdivision. They found significant improvement in the recognition rate compared to using the original video. Wang et al. [6] compute features using dense trajectories in a Spatio-Temporal Pyramid approach and show highest performance on most of the action recognition datasets. Zhu et al. proposed a Generalized Pyramid Matching Kernel (GPMK) based on a multi-channel BoW representation constructed from dense trajectories. It is an extension to the spatial-temporal pyramid matching (STPM) kernel. Instead of the predefined and fixed weights used in STPM, their technique computes channel weights of GPMK adaptively based on the kernel target alignment from training data. Duan et al. [21] proposed the Aligned Space-Time Pyramid Matching technique to effectively measure the distances between two videos that may be from different domains for the purpose of visual event recognition in videos. Each video clip is divided into space-time volumes over multiple levels and the pairwise distances between any two volumes is computed. Liu et al. [22] compute extensive pyramidal features (EPFs), which include the Gabor, Gaussian, and wavelet pyramids. These features were meant to encode the orientation, intensity, and contour information and, therefore, provide an informative representation of human poses. They further employ the AdaBoost algorithm to learn a subset of dis-

2.4. Moving Object Tracking Moving object tracking involves locating the objects of interest, maintaining their identities and building their trajectories throughout a given video sequence. One of the most successful techniques employed in this domain is tracking by data association. In this technique, initially a pre-learnt object detector is applied to detect the objects of interest in every frame. Then, linking of the detections is performed to obtain the trajectories of the objects of interest. The main challenges include missed detections, false alarms, inaccurate detections, occlusions, and similar appearance among multiple objects. Three popular strategies [27] in use to counter these challenges are: local data association, global data association, and hierarchical data association methods. The local data association strategy performs frame-by-frame tracking [28] based on local measures, such as cues from position, appearance, size, and colour of detections in neighbouring frames. Then, an algorithm is applied to match the detection responses. However, any noisy target detections affect the data association considerably and, thus, the local association methods are likely to lead to a drift in such situations. Global data association strategies utilise inference over multiple objects by seeking to resolve the drift problem over a longer period. For example, [29, 30] use the Hungarian algorithm [31] to simultaneously optimise all trajectories of moving objects. However, these approaches are computationally exponential, both in memory and time. The hierarchical association framework combines the local linking and global association methods for better tracking performance. These methods progressively connect the short detections or tracklets into longer ones. They typically split the data association into two separate optimisation problems: Linking detections locally in the lower stages and linking trajectories globally in the higher stages. For example, [32] proposed a

(a)

(b)

(c)

(d)

(e)

(f)

3

Figure 2: Base trajectories are extracted from a video. The proposed ordered trajectories are generated by searching through base trajectories for potential matches to other trajectories to create longer duration trajectories. Codebooks are constructed for each set of local feature descriptors of these trajectories. The One vs. All approach was applied and the class with the highest score used to predict the action label of given video.

two-stage association method, which combines local and global trajectories association to track multiple objects through occlusions. A particle filter was used to generate a set of reliable trajectories in the local stage and a modified Hungarian algorithm was used to optimise the data association in the global stage. Similarly, [27] proposed a unified framework for automatically relearning from local to global information. The local-to-global trajectory models were used to link detections from consecutive frames into trajectories and also link separated trajectories that belong to the same targets into long trajectories. Our proposed technique to find longer trajectories has close similarities to those employed in moving object tracking. Dense trajectories are used to obtain trajectories of moving objects at local level. Cues based on the position (of the trajectories) are taken for linking these dense trajectories to obtain longer trajectories. The difficulties in directly adapting moving object tracking to human action recognition are discussed in Section 4.1. The novelties of the proposed technique compared to tracking are described in Section 5.3.

results over several benchmark datasets. Points in homogeneous regions are removed, by applying the criterion of Shi and Tomasi [34], if the eigenvalues of the auto-correlation matrix are very small. The threshold T set on the eigenvalues for each frame I is T = 0.001 × max min(λ1j , λ2j ) (1) j∈I

where (λ1j , λ2j ) are the eigenvalues of point j in the frame I. The sampled points are then tracked by applying median filtering over the dense optical field. For a given frame It , its dense optical flow field ωt = (ut , vt ) is computed w.r.t. the next frame It+1 , where ut and vt are the horizontal and vertical components of the optical flow. A point pt = (xt , yt ) in frame It is tracked to another position pt+1 = (xt+1 , yt+1 ) in frame It+1 as follows pt+1 = (xt+1 , yt+1 ) = (xt , yt ) + (M ∗ ωt )|(xt ,yt )

(2)

where M is a 3 × 3 pixels median filtering kernel. The algorithm proposed by Farneback [35] is used to compute dense optical flow. To avoid drifting from their initial locations during the tracking process, tracking is performed on a fixed length L number of frames at a time. Through experimentation [6], L = 15 frames is found suitable. In a postprocessing stage, trajectories with sudden large displacements are removed.

3. Overall Framework and Background The overall layout of the proposed framework is shown in Figure 2. Firstly, base trajectories are detected. The proposed ordered trajectories are generated by searching through base trajectories for potential matches to other trajectories to create longer duration trajectories. Local descriptors – Motion Bound Histograms (MBH), Histograms of Oriented Gradients (HOG), Histogram of Optical Flow (HOF) – are collected from the base trajectories that are found to constitute ordered trajectories. Only the Trajectory Shape descriptor is computed for the generated ordered trajectories. Feature vectors are constructed from these local descriptors using a feature encoding technique. These feature vectors are used to learn a classifier. Separate codebooks are constructed for each type of descriptor. We use dense trajectories, proposed by Wang et al. [2] (and their improved version [33]), in our work as the base trajectories, which we briefly summarise next.

3.2. Local Feature Descriptors Four kinds of local feature descriptors are computed on the neighbourhood of the points derived by the above trajectories. These were MBH, HOG, HOF and Trajectory Shape. Each descriptor captures some specific characteristics of the video content. HOG descriptors capture the local appearance, while HOF descriptors capture the changes in the temporal domain. The space-time volumes (spatial size 32 × 32 pixels) around the trajectories are divided into 12 equal-sized 3D grids (spatially, 2 × 2 grids and temporally, 3 segments). For computing HOG, gradient orientations are quantised into 8 bins. For computing HOF, 9 bins are used with one more zero bin in comparison to HOG. Thus, the HOG descriptors are 96-dimensional and the HOF descriptors are 108-dimensional. The MBH descriptors are based on motion boundaries. These descriptors are computed by separate derivatives for the horizontal and vertical components of the optical flow. As MBH captures the gradient of the optical flow, constant camera motion is removed and information about changes in the flow field

3.1. Dense Trajectories Firstly, points uniformly spaced over each frame in 8 spatial scales are sampled. This ensures that points are equally spread in all spatial positions and scales. By experimentation, [6] report that a sampling step size of W = 5 pixels yields good 4

• Limitation of optical flow techniques that may not keep track of changes over long duration.

(i.e. motion boundaries) is retained. An 8-bin histogram is obtained along each component of x and y. Both histogram vectors are normalised separately with their L2 norm, each becoming a 96-dimensional vector. In our experiments, we built separate codebooks for MBH descriptors along x and y. The trajectory shape descriptor encodes local motion patterns. For a trajectory of given length L (number of frames) and containing a sequence of points (pt = (xt , yt )), the trajectory shape is described in terms of a sequence of displacement vectors ∆pt = (pt+1 − pt ) = (xt+1 − xt , yt+1 − yt ). The resulting vector is normalised by the sum of displacement vector magnitudes. T=

∆pt , ..., ∆pt+L−1 ||∆p j || ∑t+L−1 j=t

• With different length trajectories, the local descriptors computed around each of them would be having different dimensions. For example, trajectories A, B,C with a span of L + 2, L, L + 1 would result in a Trajectory Shape descriptor dimension of 34, 30, and 32, respectively. Handling these in a BoW setting is not easy and straightforward. We propose a technique to address these primary drawbacks. We hypothesize that any trajectory, even if it is of a longer duration (> L) should be constituted by base trajectories, i.e. trajectories of duration L. We can capture the information of moving objects with a longer duration if we can search for those base trajectories, which constitute the longer duration trajectories. Since the local descriptors of each of these base trajectories are of fixed dimensions, they are directly used in computing feature vectors for the given video. Hence, we do not need to recompute MBH, HOG, HOF descriptors for the trajectories of longer duration. For example, as seen in Figure 3, trajectories beginning from frame i can be compared with trajectories beginning from (i + 1) to find the trajectories that have longer duration L + 1. Further, the proposed technique can be recursively applied. Base trajectories spanning over L frames, can be compared with each other to obtain trajectories with span L + 1. These L + 1length trajectories can be compared with each other to further generate trajectories with span L + 2 (results are shown in Section 5.2.2). The search for base trajectories constituting longer duration trajectories can be performed in a very simple and efficient manner. We only need a sequential matching of trajectories beginning every two consecutive frames. In this way, we can collect all base trajectories very efficiently, even if the long duration trajectory has undergone major motion changes from where it began. We refer to such collected base trajectories as ordered trajectories. Moreover, as in Figure 3, we first compare trajectories beginning from frame i with trajectories beginning from frame (i+1) and trajectories beginning from frame (i + 1) with trajectories beginning from frame (i + 2) and so on. It can be noted that the matching of trajectories beginning from frame i is already being done with trajectories beginning from frame (i + 2). This can, hence, be treated as a more efficient way to match trajectories than other techniques such as clustering all trajectories together [18]. The two main stages involved in the generation of ordered trajectories – the matching stage and the generation stage – are now described in detail.

(3)

For L = 15 frames, a 30-dimensional trajectory shape descriptor is obtained. The dense trajectories code available online [2] is used in all experiments. 4. Proposed Ordered Trajectories In this section, we first present our hypothesis and later describe the generation of the proposed ordered trajectories from the extracted base trajectories. 4.1. Hypothesis Several trajectory based techniques exist to capture and model some type of characteristic information of objects moving in the video. For example, as shown in Figure 3, let trajectories A, B,C of fixed span L begin from frames i, (i + 1), (i + 2), respectively. Ordinarily, all these trajectories would have been considered when constructing the feature vector for the video. However, the following observations are made • Trajectories, such as A, that continue for a longer span or duration can be expected to capture the characteristic information of the primary object(s) motion. • Trajectories, such as B, and to some extent C, that fade out slowly can be expected to capture information about secondary or background object(s) motion.

Figure 3: Trajectories A, B,C beginning from frames i, (i + 1), (i + 2), respectively.

We hypothesize that longer duration trajectories contain more valuable information than all the remaining trajectories put together (as these contain noise, which can affect the training process). However, computing trajectories of longer duration has the following issues that need to be addressed

4.2. Search for a Matching Trajectory The objective in this stage is to search for trajectories who have a matching trajectory (beginning from the immediate next frame). Consider a trajectory of points Pi = {pi } with length L 5

distance given by j

2

j

2

j

2

||(pi+1 − qi+1 ) + (pi+2 − qi+2 ) + ... + (pi+L−1 − qi+L−1 ) ||2 (4) An algorithm, such as the Hungarian algorithm, can be used to optimise this cost function. The trajectory, Qki+1 , k ∈ j, which yields the lowest distance, can be considered as a match for the trajectory Pi . However, the number of dense trajectories is really large and the Hungarian algorithm can be expensive in time. Hence, we propose a simplified cost function and search algorithm as follows. The first term of Eq. 4 is taken as the simplified cost function, given as

Figure 4: A trajectory of points Pi starting in frame i searches j for all trajectories of points Qi+1 beginning in frame i + 1 based on the distance between the second point of trajectory Pi and j the first point of the trajectories Qi+1 .

j

2

||(pi+1 − qi+1 ) ||2

(5)

We select the second point pi+1 = (xi+1 , yi+1 ) from the trajecj tory Pi and compute its distance distance(pi+1 , qi+1 ) to all first j points of the trajectories Qi+1 (see Figure 4). The trajectory k Qi+1 , k ∈ j with the lowest distance is taken and compared with a threshold distance dth to ascertain if it is a matching trajectory for Pi or not. This threshold distance dth is fixed by using the help of the spacing gap of the dense trajectories. Dense trajectories are computed with a spacing gap of W = 5 pixels in (x,√y) directions. Hence, the threshold distance dth is fixed as W 2 in all experiments. More details on the efficiency (in terms of computation time) of the proposed algorithm over the Hungarian algorithm are given in Section 5.3.

beginning from frame i. Let the trajectories beginning from the j j immediate next frame i + 1 be denoted as Qi+1 = {qi+1 }. The objective can then be redefined as finding a trajectory in the set j Qi+1 that matches the trajectory Pi . Several solutions exist in the literature for this objective. For example, this is essentially similar to a detection-to-track association problem in the moving object tracking domain. Cues such as temporal, kinematic, appearance, colour of the object trajectories are used to find the matching trajectory. The trajectories in moving object tracking are usually obtained by tracking detected objects of interest. However, the dense trajectories obtain trajectories of any moving object using an optical flow algorithm (described in Section 3.1). Hence, dense trajectories also contain tracks of many non-interest objects; sometimes more than for the primary object itself. Thus, cues such as appearance and colour are not considered in our approach. Our j objective is now refined as finding a matching trajectory Qi+1 such that tracks of non-interest objects are also minimised. We focus on the kinematic cue – the position of the trajectories.

Algorithm 1: Generating ordered trajectories Input: Base trajectories detected in a video Output: Ordered trajectories for the video 1 N ←number of frames in the video 2 for i ← 1 to N − L − 1 do 3 Pi ∈ Trajectories beginning from frame i j 4 Qi+1 ∈ Trajectories beginning from frame i + 1 5 Count1 ← 1 6 for j ← 1 to J do 7 dist = 0 8 Count2 ← 1 9 for k ← 1 to K do j 10 dist[Count2]) = ||pi+1 − qi+1 )||2 11 Count2 ← Count2 +1

Based purely on kinematics, two trajectories can be tested for a possible match in several ways. For example, probabilities capturing the nature of the position, velocity, orientation, etc. have been used to constitute the cost function and optimised by the Hungarian algorithm for object trajectory assignment in moving object tracking. However, it has to be recalled that the dense trajectories are computed at every interval gap of W = 5 pixels in (x, y) directions and in multiple scales, so that there are many trajectories related to one main object of interest, e.g. a human body. However, in the moving object tracking domain, the trajectory of a human body is considered as a whole. Hence, the kinematic based cost function should be chosen properly in the action recognition domain.

12 13 14 15 16 17

We now present our intuition in selecting the kinematic based j cost function. The objective is to search for a trajectory Qi+1 j that can be match the trajectory Pi . Trajectories Qi+1 and Pi begin from frames i + 1 and i, respectively. The simplest cost function for two trajectories is a corresponding point-to-point

min distance = min(dist) if min distance < dth then PQi [Count1] = [pi mean(pi+1 , qki+1 ) ... ...mean(pi+L−1 , qki+L−1 ) qki+L ] Count1 ← Count1 + 1 return Ordered trajectories PQi

This process is repeated for every L = 15 frames, i.e. for every trajectory Pi , i = 1, 2, ...N − L − 1, where N is the total number of frames in the given video source. For some trajectories j Pi , there may not exist any matching trajectories. The pseudocode for matching trajectories for an entire video is shown in 6

Algorithm 1. For each video, this computation is performed j only once. All matching trajectories – Pi , Qi+1 – obtained are conjectured to constitute longer trajectories with duration L + 1.

dataset and a subset of 100,000 descriptors is selected randomly. This subset of descriptors is clustered to obtain the visual words. 4.5. Encoding The next step in constructing a BoW representation is encoding a feature using the codebook. In the literature, there exist several encoding methods, such as Hard Assignment (HA), Soft Assignment [36], Sparse encoding [37], Locality-constrained Linear encoding [38]. Most of these techniques are based on the count of descriptors belonging to different visual words. Most recently, the Vector of Locally Aggregated Descriptors (VLAD) and Fisher Vector (FV) encodings have been applied to action recognition datasets. These techniques are based on a measure determining how much a descriptor belongs to a particular (assigned) visual word. We apply HA, VLAD and FV encoding schemes in our experiments. Brief details of each encoding technique are given below.

4.3. Generation of an Ordered Trajectory The objective of this stage is to compute the local descriptors – MBH, HOG, HOF and trajectory shape – of the longer trajectories obtained in Section 4.2. MBH, HOG and HOF dej scriptors of the trajectories Pi and Qi+1 (which have been identified to constitute longer trajectories) are accumulated to construct feature vectors for MBH, HOG and HOF, respectively. These three types of descriptors are not recomputed. Only the trajectory shape local descriptor is recomputed for the longer duration trajectories, which is explained now. There are two steps involved in computing the trajectory shape of the longer trajectory. In the first step, the matching j trajectories Pi and Qi+1 are merged to form a longer duration trajectory PQi = {pqi }, i = 1, 2, ...i + L with duration L + 1. Merging is now explained, with the help of Figure 5. Remember that trajectories are a sequence of points with length L dej j fined as Pi = {pi } and Qi+1 = {qi+1 }. The first element of the longer duration trajectory {pqi } consists of pi = (xi , yi ). The last element of the longer duration trajectory {pqi+L } consists of qi+L = (xi+L , yi+L ), respectively. The second element of the longer duration trajectory {pqi+1 } consists of the mean of pi+1 and qi+1 . The third element of the longer duration trajectory {pqi+2 } consists of the mean of pi+2 and qi+2 , and so on.

• Hard Assignment (HA) HA has been the foremost and one of the most popular techniques for action recognition. This technique computes the frequency of the visual words (of the dictionary) for given video. In our experiments, dictionaries are constructed with k-means clustering. The number of visual words k is set to 4000, which was shown to give good results [10]. We initialised k-means 8 times and kept the result with the lowest error. After creating a codebook C = {µ1 , µ2 , ..., µk }, local descriptors are assigned to the closest visual word as q : Rd → C ⊂ Rd

(6)

x 7→ q(x) = arg min ||x − µ||2

(7)

µ∈C

where the norm operator ||.||2 refers to the L2 norm. • Vector of Locally Aggregated Descriptors (VLAD) This is a very recent technique applied successfully to action recognition by [3]. In this encoding, the difference between the descriptors and the closest visual word is collected as residual vectors. For each visual word of dimension d (dimension of the local feature descriptor), a subvector vi is obtained by accumulating the residual vectors as vi = ∑ x − µ (8)

Figure 5: The first point in the trajectory of frame i and the last point i + L in the trajectory of frame i + 1 are transferred as first and last points of the new ordered trajectory, respectively. The remaining trajectory points are merged in an ordered manner by taking the mean of the intermediate elements of trajectories Pi j and Qi+1 .

x:q(x)=µi

Once the sequence of points of the longer trajectory PQi is generated, the trajectory shape descriptor is computed using Eq. 3. As L = 16 now, the length of the trajectory shape descriptor is 32.

The obtained sub-vectors are concatenated to yield a Ddimensional vector, where D = k × d. Further, two-stage normalisation is applied. Firstly, the ‘power-law normalisation’ [39] is applied. It is a component-wise non-linear operation. Each component v j , j = 1 to D is modified as

4.4. Bag-of-Words Framework

v j = |v j |α × sign(v j ),

Separate dictionaries are built for each descriptor. We accumulate descriptors from a video of a particular class and randomly select 25% of the total number of descriptors. In this way, we collect information for each class from the training

(9)

where α is a parameter such that α ≤ 1. In all experiments, k = 256 and α = 0.2. Secondly, the vector is L2 v to yield the VLAD vector. normalised as v = ||v|| 7

(a) Clap

(b) Climb

(a) BaseballPitch

(b) Basketball

(c) Dribble

(d) Sword

(c) BenchPress

(e) Golf

(d) Biking

(e) CleanAndJerk

Figure 6: Sample action classes: Top HMDB51 and Bottom UCF101 datasets.

where Hic = hin and H cj = h jn are the frequency histograms of the visual word occurrences of the cth descriptor for the ith video; A is the average χ2 distance between the training samples. For N types of descriptors, a multi-channel setup is used

• Fisher Vector Encoding In this technique, the codebook is not obtained via kmeans as used in the above two techniques. Instead, a Gaussian Mixture Model (GMM) is fitted to the selected subset (100,000) of descriptors from the training set. Let the parameters obtained from the GMM fitting be defined as θ = (µk , ∑k , πk ; k = 1, 2, ..., K) where µk , ∑k and πk are the mean, covariance and prior probability of each distribution, respectively. The GMM associates each descriptor Xi to a mode k in the mixture with a strength given by the posterior probability 1 exp[− (Xi − µk )T ∑−1 k (Xi − µk )] 2 qik = 1 K exp[− (Xi − µt )T ∑−1 ∑t=1 k (Xi − µt )] 2

K(i, j) = exp(−

v jk =

(10) 4.7. Datasets

N x ji − µ jk 1 √ ∑ qik N πk i=1 σ jk

(11)

N x ji − µ jk 2 1 √ qik [( ) − 1] ∑ σ jk N 2πk i=1

(12)

The proposed technique is applied to three benchmark datasets: HMDB51 [42], UCF50 [43] and UCF101 [44]. HMDB51 contains 51 action categories. A complete list of actions and the data is available at 1 . Digitised movies, public databases such as the Prelinger archive, videos from YouTube and Google videos were used to create this dataset. For evaluation purposes, three distinct training and testing splits are specified in the dataset. These splits were built to ensure that clips from the same video are not used for both training and testing. For each action category in a split, 70 training and 30 testing clips indices are fixed so that they fulfil the 70/30 balance for each meta tag. UCF50 has 50 action classes. A complete list of actions and the data is available at 2 . This dataset consists of 6680 realistic videos collected from YouTube. As specified by [43], we evaluate the 25-fold group wise cross-validation classification performance on this dataset. The UCF101 dataset is an extension of UCF50 with 101 action categories, also collected from realistic action videos, e.g. from YouTube. Three train-test splits were provided for consistency in reporting performance. Sample classes are shown in Figure 6.

where j = 1, 2, ..., D spans the local descriptor vector dimensions. The FV is then obtained by concatenating the vectors (u jk ) and (v jk ) for each of the K modes in the Gaussian mixtures. Next, the FV is normalised by the ‘powerlaw normalisation’ defined in Eq. 9 with α = 0.5. Finally, v the vector is L2 -normalised as v = ||v|| to yield the FV vector. 4.6. Classification We use two types of classifiers according to the type of encoding technique used. For HA encoded features, we build a non-linear SVM LIBSVM [40] with an RBF-χ2 kernel k(Hic , H cj ) computed as k(Hic , H cj ) =

c c 2 1 (Hi − H j ) A Hic + H cj

(14)

For VLAD and FV encoded features, all features are concatenated and a linear SVM ( LIBLINEAR [41]) is used. The oneversus-all approach is applied in all cases and the class with the highest score is selected.

The mean (u jk ) and deviation vectors (v jk ) for each mode k are computed as u jk =

1 N ∑ k(Hic , Hjc )) N c=1

1 Dataset

(13)

2 Dataset

8

at http://serre-lab.clps.brown.edu/resources/HMDB/ downloaded from http://crcv.ucf.edu/data/UCF50.php

5. Results and Discussions

that may have ended in one scale, may have continued in the next frame on the upper or lower scale. Dense trajectories do not exclusively combine such trajectories. The number of dense trajectories computed for L = 16 and L = 17 are shown in Table 2. Although their number is slightly smaller than for L = 15, it is still nearly twice the number of the proposed ordered trajectories detected at L = 15.

In this section, the results obtained by applying the proposed ordered trajectories are presented. 5.1. Dense Trajectories as Base Trajectories This section contains results obtained by taking dense trajectories [2] as base trajectories.

Table 2: Number of trajectories for a sample video in different schemes.

5.1.1. Performance of Ordered Trajectories Results obtained by using the HA encoded ordered trajectories on the HMDB51, UCF50 and UCF101 datasets are presented in Table 1. The overall performance on HMDB51 and UCF50 datasets is 47.3% and 85.5%, resp., which is 0.7-1% (absolute) more than the traditional dense trajectories [2] computed at five different scales in space and time [6]. The results for the individual local feature descriptors shows that only the trajectory shape descriptor performs better than the SpatioTemporal Pyramidal approach by 3-4% (absolute) due to the trajectory descriptor being computed on the matching dense trajectories. In the case of the other descriptors – MBH, HOG, HOF – we only retain corresponding descriptors of matching trajectories.

Descriptor

UCF50

HMDB51

UCF101

Dense Trajectories

TrajShape MBH HOG HOF Combined

67.2% 82.2% 68.0% 68.2% 84.5%

28.0% 43.2% 27.9% 31.5% 46.6%

NA NA NA NA NA

TrajShape MBH HOG HOF Combined

72.4% 82.3% 66.2% 67.0% 85.5%

31.2% 41.0% 26.5% 30.9% 47.3%

47.1% 67.9% 51.4% 52.0% 72.8%

[2] Proposed ordered trajectories

Number of Trajectories 21,647 21,521 21,329 11,657

5.1.4. Computation Time We now give an idea of the time taken for computing dense and ordered trajectories. Our experiments have been carried out on a server with a 16-core Xeon processor (2.4GHz) and 64GB RAM. For a typical video containing 407 frames (13s duration) and frame size 320 × 240 pixels, 23748 dense (base) trajectories are detected, with a computation time of 2min 15s. The proposed technique to compute ordered trajectories from dense trajectories took 0.6s only. The ordered trajectories detected thus were 13060 in number, only about 50% of the actual dense trajectories. The code uses a single core only.

Table 1: Ordered trajectory based recognition rates. Approach

Reference Dense trajectories [2] at L = 15 Dense trajectories at L = 16 Dense trajectories at L = 17 Proposed ordered trajectories at L = 15

5.1.5. Influence of Different Encoding Techniques The performance of ordered trajectories for the three different encoding techniques is shown in Table 3. Hard Assignment captures only frequency information of the visual words (of the codebook) for a given video. The VLAD encoding technique captures the first order (deviations from the visual words) characteristics. Fisher Vectors capture first and second order (covariance deviation) statistics. In an recent study on large scale image classification [39], Fisher Vectors have been found to perform best. The same can be observed from Table 3: Fisher Vectors perform better amongst the three encoding techniques. The proposed ‘ordered trajectories’ perform better than the base (dense) trajectories in all situations.

5.1.2. Minimising Background Clutter The effect of the proposed ordered trajectories can be observed in Figure 7. The ordered trajectories have not lost the principal object (person). At the same time, some background clutter, e.g. steps in (c) ‘Push up’, windows in (e) ‘Shoot bow’ have been filtered out by the proposed technique.

Table 3: Performance of ordered trajectories with different encodings.

5.1.3. Ordered Trajectories on a Sample Video For a sample video of 13s duration, 819KB file size, 320 × 240 pixels resolution, the number of dense trajectories extracted is shown in Table 2. The number of dense trajectories computed at the original scale in space and time, as shown in Figure 1(a), is 21647. By the proposed technique, 11657 ordered trajectories are obtained, which is about 50% of the actual number of dense trajectories. A natural question that may follow is “Why not compute dense trajectories directly for L = 16?”. Actually, dense trajectories are first detected at different scales in space. A trajectory

Dataset

Approach

HA

VLAD

FV

HMDB51

Ordered traj. Dense traj.

47.3% 46.6% [2]

49.9% 48.0% [3]

53.8% 52.2% [33]

UCF50

Ordered traj. Dense traj.

85.5% 84.5% [2]

87.3% NA

89.4% 88.6% [33]

HA encoded features use non-liner SVMs. From a computational point of view, learning non-linear SVMs takes nearly 9

(a) Fencing

(b) Pull up

(c) Push up

(d) Ride bike

(e) Shoot bow

Figure 7: Original dense trajectories and corresponding ordered trajectories in a sample frame. 10

Table 4: Comparison with other HA encoded techniques. Approach

Brief description

UCF50

HMDB51

UCF101

Kuehne et al. 2011 [42] Kliper-Gross et al. 2012 [5] Khurram et al. 2012 [44] Wang et al. 2011 [2] Jiang et al. 2012 [4] Wang et al. 2013 [6] Feng Shi et al. 2013 [26] Liu et al. 2013 [22] Peng et al. 2013 [19] Zhu et al. [45] Proposed approach

C2 features inspired by visual cortex Motion interchange patterns Harris 3D corners Dense trajectories Proposed dense trajectory variant + dense trajectories Dense trajectories in spatio-temporal pyramid Random sampled 10000 dense trajectories Pyramidal features Dense trajectories + Spatial and contextual trajectories Pyramid matching kernel over dense trajectories Ordered trajectories

47.9% 72.6% NA 84.5% NA 85.6% 83.3% NA NA NA 85.5%

22.8% 29.2% NA 46.6% 40.7% 48.3% 47.6% 49.3% 49.2% 49.7% 47.3%

NA NA 44.5% NA NA NA NA NA NA NA 72.8% [46]

O(N 2 ) to O(N 3 ) operations, where N is the size of the training dataset. VLAD and FV encoded features use linear SVMs, whose learning cost is around O(N) only. Performance wise, we thus note that FV and VLAD encoding have advantage over HA encoded features. This is particularly useful in datasets with a large number of action classes.

(motion compensated dense trajectories) [3] results and, secondly, due to the VLAD encoding scheme. The same encoding scheme is used on the proposed ordered trajectory descriptors and included in the results in Table 5. Dense trajectories with VLAD encoding yield 48.0% [3] correct recognition, while ordered trajectories with VLAD encoding yield 49.9%. It can be observed that the results for the proposed ordered trajectories are 1.9% (absolute) better when VLAD encoding is used. Similar to the setup used in [3], we use the proposed ordered trajectories in conjunction with dense trajectories and obtain 53.3%.

5.1.6. Comparison with Other Techniques For a fair and consistent evaluation, the results are compared with the most recent studies in the literature. We also include a brief description of their methodology for ease of comparison. Studies were primarily divided based on the type of feature encoding type employed.

Table 5: Comparison with other VLAD encoded techniques

1. HA based Encoding The corresponding results are shown in Table 4. The performance of dense trajectories [2] on HMDB51 and UCF50 is 84.5% and 46.6%, respectively, as reported in [6]. The performance by using the proposed ordered trajectories on these datasets is 85.5% and 47.3%, respectively. This is nearly 0.7-1% (absolute) better than for the actual dense trajectories. The results obtained by the Spatio-Temporal Pyramidal approach of dense trajectories is 85.6% and 48.3% respectively. This is 0.1-1% (absolute) better than our proposed technique. However, it has to be noted that our results are obtained using only 5 channels while the Spatio-Temporal Pyramidal approach uses 30 channels. The length of the feature vectors in some of these channels can also be as high as 3-8 times than those used in the original dense trajectories or the proposed ordered trajectories. Shi et al. 2013 [26] suggested that random selection of 10000 dense trajectories can yield 83.3% and 47.6%, respectively. With the proposed technique, we see that only less important information, such as on the background, is omitted, as shown in a few examples in Figure 7. 2. VLAD based Encoding The corresponding results are shown in Table 5. Most recently, Jain et al. [3] employed the VLAD encoding on the HMDB51 dataset and reported a recognition rate of 52.1%. This can be due to two reasons: Firstly, due to the combination of dense trajectory results and ω-trajectories

Approach

Brief description

HMDB51

Jain et al. 2013 [3] Proposed approach

Dense trajectories ω-trajectories ω-traj. + dense traj. Ordered (dense) traj. Ordered traj. + dense traj.

48.0% 47.7% 52.1% 49.9% 53.3%

5.2. Improved Dense Trajectories as Base Trajectories This section presents results obtained by taking improved dense trajectories [33] as base trajectories. The corresponding results are shown in Table 6. Improved dense trajectories are an improved version of the dense trajectories obtained by estimating camera motion by matching feature points between frames using SIFT descriptors and computing dense optical flow. The obtained matches are used to estimate a homography with RANSAC. Further, a human detector was used to separate motion due to humans from camera motion. The estimate is also used to cancel out possible camera motion from the optical flow. This technique has been shown to improve significantly the motion-based descriptors such as MBH and HOF. Oneata et al. [47] used only the MBH descriptors of improved trajectories in a spatio-temporal pyramid (6 volumes in total) with SURF descriptors from every tenth frame for action recognition. Both [33, 47] use FV encoding. Initially, we follow the experimental set-up described in [33] to compute the improved trajectories and obtained Fisher Vectors. Upon fusing (concatenating) the ‘ordered trajectories’ 11

Table 6: Comparison with improved dense trajectory based techniques. Approach

Brief description

UCF50

HMDB51

UCF101

Oneata et al. [47] Wang and Schmid

Spatio-temporal pyramid of improved trajectories + SURF Improved trajectories without human detection Improved trajectories with human detection Spatio-temporal pyramid of improved trajectories Ordered (from improved trajectories without human detection) Ordered (dense) trajectories + improved trajectories (without human detection) Ordered (improved) trajectories + improved trajectories (without human detection)

90.0% 90.5% [33] 91.2% [33] NA 91.8% 91.6% 92.1%

54.8% 55.9% [33] 57.2% [33] NA 58.3% 58.5% 58.8%

NA 84.8% [48] NA 85.9% [48] 85.3% 85.4% [49] 85.8%

Proposed approach

with the improved trajectories, we find that it results in better recognition rates than those reported by the improved recognition rates by 1.5% (absolute). Details are shown in Table 6. As ordered trajectories are long duration trajectories, and improved trajectories are similar to the shorter base dense trajectories, we believe this supports our hypothesis that information carried by the ordered trajectories is complementary. This result has been obtained using only the camera motion compensated improved trajectories, without any human detector.

Similar to a compression factor, a lower value of ri gearing towards 0 indicates a smaller number of ordered trajectories, yet yielding competitive performance to the base trajectories. A distribution of ratios, ri s (Eq. 15) for OTL+1 , OTL+2 and OTL+3 for HMDB51 and UCF101 datasets is shown in Figures 8 and 9, respectively. At a first glance, it can be observed that in many videos, the number of ordered trajectories is about half the number of the base improved dense trajectories. And in some videos, the amount of ordered trajectories is even less than half. This can be very useful, particularly in large data classifications. In those cases where only a few ordered trajectories are detected, we observe that the base trajectories are themselves few in number or detected in distant segments. Surely, in such cases the chances of finding longer duration trajectories is low. Moreover, in some classes such as ‘Playing flute’, where there is hardly any motion detected at all, the base trajectories are very scant.

5.2.1. On the Influence of L The performance of the proposed ordered trajectories for different values of L is now discussed. The recognition rates obtained on the HMDB51 dataset for different values of L is shown in Table 7. Two observations are made. Firstly, the ordered trajectories are equal or better than the improved dense trajectories for different values of L. This supports our hypothesis that longer duration (L + 1) trajectories are richer than base trajectories with duration L. Secondly, the best results are obtained for L = 15. This is in line with the observations made by Wang et al. [6] and their choice of L = 15 for general usage.

Table 8: Performance of OTL+1 and OTL+2 .

Table 7: Performance of trajectories for different values of L. Approach

10

15

20

25

Improved dense trajectories Proposed ordered trajectories

55.4% 55.6%

55.9% 58.3%

56.7% 56.9%

56.2% 56.2%

NOTL+i NL

HMDB51

UCF50

UCF101

55.9% [33] 58.3% 58.8% 57.8% 59.1%

90.5% [33] 91.8% 92.1% 91.5% 92.0%

84.8% [48] 85.3% 85.8% 84.9% 85.9%

Further, it can be observed that the number of videos with a ratio ri < 0.5 tends to increase (i.e. the number of longer duration trajectories decreases) when the proposed technique is applied recursively. For example, the number of OTL+3 is lower than the number of OTL+2 , which in turn is lower than the number of OTL+1 . However, ordered trajectories are observed to perform better than the actual baseline trajectories, as shown in Table 8. Moreover, when used in conjunction with base trajectories, they are complementary and an improvement in performance is observed in all three datasets. These experiments quantitatively support that objects with longer trajectories contain richer information than base trajectories of fixed length. The performance of OTL+2 is better than base trajectories by 0.1-1.9% (absolute), but slightly lower than OTL+1 by 0.3-0.5% (absolute). This indicates that our proposed selection of ordered trajectories – from base or ordered trajectories – is valid and indicative of the selection of a richer set of trajectories.

5.2.2. Recursive Computation of Ordered Trajectories In this section, we show some results that reinforce our hypothesis that trajectories of objects with longer duration (ordered trajectories) tend to contain richer information. All experiments for this section are conducted using improved dense trajectories without human detection [33]. We refer to OTL+1 as ordered trajectories, having duration (L + 1) selected from improved dense trajectories; OTL+2 as ordered trajectories, having duration (L + 2) selected from OTL+1 , and so on. Further, we define a ratio factor ri as ri =

Approach Base (Improved dense) traj. OTL+1 OTL+1 + Base traj. OTL+2 OTL+2 + Base traj.

(15)

where NOTL+i = number of ordered trajectories (OTL+i ) and NL = Number of improved dense trajectories with L. 12

Figure 8: Ratio ri for Left: OTL+1 Middle: OTL+2 and Right: OTL+3 for the HMDB51 dataset.

Figure 9: Ratio ri for Left: OTL+1 Middle: OTL+2 and Right: OTL+3 for the UCF101 dataset.

Table 9: Body part trajectory based recognition rates (Source [50])

5.3. Relation with Moving Object Tracking Our hypothesis of searching for longer duration trajectories of moving objects has close similarities with the domain of moving object tracking. Moving object tracking comprises of obtaining ‘tracks’ for salient/interest objects and then ‘linking’ them later on, in the event of occlusion, loss of tracking, etc. to form longer trajectories of the objects. However, our proposed technique has the following novelties:

Approach

UCF50

UCF101

HMDB51

Body parts Body parts + traj. Improved dense traj.

72.5% 72.7% 90.5%

58.9% 60.3% 84.8%

29.1% 31.2% 55.9%

• On the Choice of ‘Tracks’ • On the Choice of ‘Linking’

Generally, in moving object tracking, ‘tracks’ are obtained for objects that are initialised in the beginning or by suitable techniques to select salient/interest objects. Usually, in human action recognition, human body parts are the most salient or main object of interest. In [50], Murthy et al. detect human body parts in each frame and their parts were ‘linked’ by the Hungarian algorithm [31] to yield longer trajectories – so called body part trajectories. Fisher Vector encoding was used to construct the feature vectors. The performance is shown in Table 9. On all three benchmark datasets, improved dense trajectories were found to perform better than body parts and their trajectories. Hence, our choice of dense trajectories as base trajectories to find longer duration trajectories can be expected to perform better than general moving object tracking. Moreover, dense trajectories [6] are also shown to perform better than the KLT tracker [1] that constructed KLT trajectories by selecting 100 interest points in each frame.

A popular technique for linking ‘tracks’ of moving objects is by choice of a suitable cost function and an optimisation algorithm to minimise it. Cost functions are generally based on temporal, kinematics and appearance [29]; colour histogram matching [51], or mean colour matching [52] of the tracks. Our cost function is based on kinematics (object positions only). As we cannot initialise manually any particular object as our primary interest, we did not include colour or appearance in the current version of our work on longer trajectories. However, in moving object tracking, the primary objects/targets of interest are initialised in the first frame and/or on consecutive frames based on combination of detections that are formed from either background subtraction or frame differences. So, appearance, colour etc. features are very useful to locate/link the objects in the consecutive frames in the moving object tracking domain, but not in the current application. 13

For a typical video of 407 frames (duration of 13s) and the same computation conditions as in Section 5.1.4, 1004 body part interest points were detected in the static video frames. Using the Hungarian algorithm, 4915 body part trajectories are generated in 39.3s by using the technique proposed by Murthy et al. [50]. However, using the proposed search algorithm, longer duration trajectories were generated in 0.6s only from the original 23748 dense trajectories. This shows the efficiency of our algorithm in terms of computation time.

[10] H. Wang, M. M. Ullah, A. Kl¨aser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition, in: British Machine Vision Conference (BMVC), 2009, pp. 124.1–124.11. [11] I. Laptev, M. Marszałek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2008. [12] S. Lazebnik, C. Schmid, J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006, pp. 2169–2178. [13] M. M. Ullah, S. N. Parizi, I. Laptev, Improving Bag-of-Features Action Recognition with Non-Local Cues, in: British Machine Vision Conference (BMVC), 2010, pp. 95.1–95.11. [14] H. Uemura, S. Ishikawa, K. Mikolajczyk, Feature tracking and motion compensation for action recognition, in: British Machine Vision Conference (BMVC), 2008, pp. 30.1–30.10. [15] J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, J. Li, Hierarchical spatio-temporal context modeling for action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 2004–2011. [16] R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of tracked keypoints, in: International Conference on Computer Vision (ICCV), 2009, pp. 104–111. [17] P. Matikainen, M. Hebert, R. Sukthankar, Representing pairwise spatial and temporal relations for action recognition, in: European conference on Computer vision (ECCV), 2010, pp. 508–521. [18] M. Raptis, Discovering discriminative action parts from mid-level video representations, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1242–1249. [19] X. Peng, Y. Qiao, Q. Peng, X. Qi, Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition, in: British Machine Vision Conference (BMVC), 2013. [20] Z. Zhao, A. Elgammal, Human activity recognition from frame´s spatiotemporal representation, in: International Conference on Pattern Recognition (ICPR), 2008, pp. 1–4. [21] L. Duan, D. Xu, I. W.-H. Tsang, J. Luo, Visual event recognition in videos by learning from web data, IEEE Transactions in Pattern Analysis and Machine Intelligence 34 (9) (2012) 1667–1680. [22] L. Liu, L. Shao, X. Zhen, X. Li, Learning discriminative key poses for action recognition, IEEE Transactions on Cybernetics 43 (6) (2013) 1860– 70. [23] E. Nowak, F. Jurie, B. Triggs, Sampling strategies for bag-of-features image classification, in: European Conference on Computer Vision (ECCV), 2006, pp. 490–503. [24] E. Vig, M. Dorr, D. Cox, Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements, in: European Conference on Computer Vision (ECCV), 2012, pp. 84–97. [25] B. Chakraborty, M. B. Holte, T. B. Moeslund, J. Gonzlez, Selective spatio-temporal interest points, Computer Vision and Image Understanding 116 (3) (2012) 396–410. [26] F. Shi, E. Petriu, R. Lagani`ere, Sampling strategies for real-time action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [27] S. Zhang, J. Wang, Z. Wang, Y. Gong, Y. Liu, Multi-target tracking by learning local-to-global trajectory models, Pattern Recognition 48 (2) (2015) 580 – 590. [28] B. Wu, R. Nevatia, Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors, International Journal of Computer Vision 75 (2) (2007) 247–266. [29] A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, W. Hu, Multi-object tracking through simultaneous long occlusions and split-merge conditions, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2006, pp. 666–673. [30] C. Stauffer, Estimating tracking sources and sinks, in: IEEE Computer Vision and Pattern Recognition Workshop, Vol. 4, 2003, pp. 35–35. [31] J. Munkres, Algorithms for the Assignment and Transportation Problems, Journal of the Society for Industrial and Applied Mathematics 5 (1) (1957) 32–38. [32] J. Xing, H. Ai, S. Lao, Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses, in: IEEE Conference on Computer Vision and Pattern Recogni-

5.4. Future Work Recently, Basharat et al. [53] have reported computational efficiency by proposing a greedy assignment algorithm significantly faster than the Hungarian algorithm. In future, we will investigate its feasibility in our application. Local descriptors such as HOG are based on appearance. So, matching HOG descriptors is similar to integrating appearance into the cost function. This will also be undertaken in future. 6. Conclusions A technique to capture motion characteristics of objects with longer duration has been proposed. It computes base trajectories and selects only a few from them, constituting trajectories of objects with longer duration. These selected trajectories are termed ordered trajectories. Local descriptors of these ordered trajectories are found to yield better performance than that obtained by the original base trajectories. Moreover, information other than the main objects of interest captured by the dense trajectories is also found out to be removed effectively by the proposed technique. References [1] B. D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the 7th International Joint Conference on Artificial Intelligence, 1981, pp. 674–679. [2] H. Wang, A. Kl¨aser, C. Schmid, C.-L. Liu, Action recognition by dense trajectories, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [3] M. Jain, H. J´egou, P. Bouthemy, Better exploiting motion for better action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [4] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, C.-W. Ngo, Trajectory-based modeling of human actions with motion reference points, in: European Conference on Computer Vision (ECCV), 2012, pp. 425–438. [5] O. Kliper-Gross, Y. Gurovich, T. Hassner, L. Wolf, Motion Interchange Patterns for Action Recognition in Unconstrained Videos, in: European Conference on Computer Vision (ECCV), 2012, pp. 256–269. [6] H. Wang, A. Kl¨aser, C. Schmid, C.-L. Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision 103 (1) (2013) 60–79. [7] I. Laptev, T. Lindeberg, Space-Time Interest Points, in: International Conference on Computer Vision (ICCV), 2003, pp. 432–439. [8] P. Doll´ar, V. Rabaud, G. Cottrell, S. Belongie, Behaviour recognition via sparse spatio-temporal features, in: IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. [9] G. Willems, T. Tuytelaars, L. Gool, An Efficient Dense and ScaleInvariant Spatio-Temporal Interest Point Detector, in: European Conference on Computer Vision (ECCV), 2008, pp. 650–663.

14

tion (CVPR), 2009, pp. 1200–1207. [33] H. Wang, C. Schmid, Action Recognition with Improved Trajectories, in: Intenational Conference on Computer Vision (ICCV), 2013. [34] J. Shi, C. Tomasi, Good features to track, in: IEEE Computer Vision and Pattern Recognition (CVPR), 1994, pp. 593–600. [35] G. Farneb¨ack, Two-frame motion estimation based on polynomial expansion, in: Scandinavian Conference on Image Analysis (SCIA), 2003, pp. 363–370. [36] L. Liu, L. Wang, X. Liu, In defense of soft-assignment coding, in: International Conference on Computer Vision (ICCV), 2011, pp. 2486–2493. [37] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1794–1801. [38] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3360–3367. [39] F. Perronnin, J. S´anchez, T. Mensink, Improving the Fisher kernel for large-scale image classification, in: European Conference on Computer Vision (ECCV), 2010. [40] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1– 27:27. [41] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874. [42] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: A large video database for human motion recognition, in: International Conference on Computer Vision (ICCV), 2011. [43] K. K. Reddy, M. Shah, Recognizing 50 Human Action Categories of Web Videos, Machine Vision and Applications 24 (2012) 971–981. [44] K. Soomro, A. R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Action Classes from Videos in the Wild, in: CRCV-TR-12-01, 2012. [45] J. Zhu, Q. Zhou, W. Zou, R. Zhang, W. Zhang, A generalized pyramid matching kernel for human action recognition in realistic videos, winner, Sensors 13 (11) (2013) 14398–14416. [46] O. V. Ramana Murthy, R. Goecke, Ordered Trajectories for Large Scale Human Action Recognition, in: International Conference on Computer Vision Workshops (ICCV), 2013. [47] D. Oneata, J. Verbeek, C. Schmid, Action and Event Recognition with Fisher Vectors on a Compact Feature Set, in: Intenational Conference on Computer Vision (ICCV), 2013. [48] H. Wang, C. Schmid, LEAR-INRIA submission for the THUMOS workshop, in: THUMOS:ICCV Challenge on Action Recognition with a Large Number of Classes, Winner, 2013. [49] O. V. R. Murthy, R. Goecke, Combined Ordered and Improved Trajectories for Large Scale Human Action Recognition, in: THUMOS:ICCV Challenge on Action Recognition with a Large Number of Classes, Second runner-up, 2013. [50] O. V. R. Murthy, I. Radwan, R. Goecke, Dense Body Part Trajecotries for Human Action Recognition, in: International Conference on Image Processing(ICIP), 2014. [51] R. Kaucic, A. Perera, G. Brooksby, J. Kaufhold, A. Hoogs, A unified framework for tracking through occlusions and across sensor gaps, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, 2005, pp. 990–997. [52] T. Huang, S. Russell, Object identification: a bayesian analysis with application to traffic surveillance, Artificial Intelligence 103 (12) (1998) 77 – 93. [53] A. Basharat, M. Turek, Y. Xu, C. Atkins, D. Stoup, K. Fieldhouse, P. Tunison, A. Hoogs, Real-time multi-target tracking at 210 megapixels/second in wide area motion imagery, in: IEEE Winter Conference on Applications of Computer Vision (WACV), 2014, pp. 839–846.

15