Learning-Semantic-Scene-by-Tracking and ... - Semantic Scholar

21 downloads 232 Views 4MB Size Report
performs the Learning-Semantic-Scene and Tracking, and makes them ... solve both problems as mention above in one framework. The overall system and the ...
An Online Approach: Learning-Semantic-Scene-by-Tracking and Tracking-by-Learning-Semantic-Scene Xuan Song† , Xiaowei Shao‡ , Huijing Zhao† , Jinshi Cui† , Ryosuke Shibasaki‡ and Hongbin Zha† †

Key Laboratory of Machine Perception (MoE), Peking University, China songxuan,zhaohj,cjs,[email protected]

Center for Spatial Information Science, University of Tokyo, Japan [email protected], [email protected].

35

Abstract

30

Learning the knowledge of scene structure and tracking a large number of targets are both active topics of computer vision in recent years, which plays a crucial role in surveillance, activity analysis, object classification and etc. In this paper, we propose a novel system which simultaneously performs the Learning-Semantic-Scene and Tracking, and makes them supplement each other in one framework. The trajectories obtained by the tracking are utilized to continually learn and update the scene knowledge via an online unsupervised learning. On the other hand, the learned knowledge of scene in turn is utilized to supervise and improve the tracking results. Therefore, this “adaptive learningtracking loop” can not only perform the robust tracking in high density crowd scene, dynamically update the knowledge of scene structure and output semantic words, but also ensures that the entire process is completely automatic and online. We successfully applied the proposed system into the JR subway station of Tokyo, which can dynamically obtain the semantic scene structure and robustly track more than 150 targets at the same time.

25

20

15

10

5

0 0

10

20

30

40

50

60

Figure 1. How to maintain the correct tracking in a high density scene? This is the JR subway station of Tokyo, and the data was obtained by eight single-row laser scanners. The green points are the background, the blue ones are the foreground, and the red ones show the position of single-row laser scanner. In this case, each person is represent by several points. More details about the experimental site, please refer [32].

knowledge cannot to be dynamically learned and updated in the existing work. On the other hand, tracking is the basis of surveillance, which plays a crucial role in any kind of monitoring task. However, it will become especially challenging in a high density crowd scene as shown in Fig.1. In addition, due to the real-time nature of many surveillance applications, it is very desirable to have a completely online and automatic system that runs robustly and requires little human intervention. Therefore, the purpose of this paper is to develop such a system that can not only simultaneously perform the learning semantic scene and tracking, but also solve both problems as mention above in one framework.

1. Introduction The surveillance task is to monitor the activities of persons in a scene, which requires the low-level detection, tracking, classification and the high-level activity analysis. Both the low-level and high-level tasks can be improved with the knowledge of scene structure (e.g., crowd flow, dominant paths, entry or exit and etc.) For instance, “persons and cars are moving in different roads”, “persons only appear/disappear at entry/exit”, “persons who are in a crowd flow can only follow the other people in it.” A statistical scene model can provide a priori knowledge on where, when and what types of activities occur. However, the scene

The overall system and the key idea of this research can be depicted as Fig.2. The tracking module provides us with a large number of trajectories. Thus, the knowledge of scene structure (e.g., dynamic properties vs static properties) can be dynamically learned and updated via an 1

Server PC

Laser Scanner Client PC

Laser Scanner

10/100 Base LAN

Figure 2. Overview of the proposed system.

online unsupervised learning method. Then the learned statistical scene model in turn supervises the tracking module and makes the results increasingly accurate. Therefore, this mode of co-operation between tracking and learning becomes “an adaptive loop”, which not only dynamically reflects the change of scene structure, but also solve the tough problems encountered in the tracking. Moreover, the entire process are completely online and automatic that requires no human intervention. The main contributions of this paper can be summarized as follows: (1) We develop a unified framework that couples the learning semantic scene and tracking, and make them supplement each other in it. (2) We apply an online learning algorithm [26] to the trajectories analysis, and make this task can be firstly online. (3) We firstly apply an online system that can robustly track more than 150 targets at the same time into a real scene (JR subway station of Tokyo). The remainder of this paper is structured as follows: In the following section, related work is briefly reviewed. Section 3 and 4 provide the details about Learning-SemanticScene-by-Tracking and Tracking-by-Learning-SemanticScene. Experiments and results are presented in Section 5 and the paper is finally summarized in Section 6.

2. Related Work Multiple target tracking (MTT) has been studied extensively and an in-depth review of tracking literature can be found in a recent survey by Yilmaz et al.[30]. Typically, multi-target tracking can be solved through data association [2], which contains the linear complexity based methods [11] and exponential complexity based: Multi-Hypothesis Tracker (MHT) [22], Joint Probabilistic Data Association Filter (JPDAF) [2, 9, 21], Monte Carlo technique based JPDA algorithms (MC-JPDAF) [23, 25] and Markov chain Monte Carlo data association (MCMC-DA) [15, 31]. In addition, MTT will encounter incredibly difficulties when the interactions or occlusions among targets frequently take place. Thus, researchers also focus on how to model the interactions and solve the “merge/split” problems in MTT. Representative publications include [4, 12, 16, 17, 19, 20, 29, 33]. However, most of the methods mentioned above

are difficult to be applied to track hundreds of targets in high density crowd scene. For tracking a large number of targets in crowded environments, Betke et al. [3] propose a two cluster-based data association approaches that are linear in the number of detections and tracked objects. But this method is difficult to be used in the human-based surveillance applications. Ali et al. [1] propose a floor fields based method for tracking persons in crowded scene, which is quite related to our work. But the main difference between their method and ours is that: the computing of dynamic floor field in [1] at a particular time period should use future information, which is not an completely online approach. Moreover, the proposed system is not only a tracking system, but also a semantic scene learning system, which is quite different from [1]. On the other hand, in recent years, learning semantic scene based on trajectories analysis has received the increasing attentions in various surveillance applications. Representative publications include [8, 13, 14, 18, 27, 28]. However, all these methods are batch, i.e. the clustering of trajectories is obtained after all the data has been collected and the cluster structure cannot change as a function of time. Hence, they are difficult to be applied to an online and real-time system. Therefore, in this paper, we firstly propose a novel system that can simultaneously output robust tracked trajectories of pedestrians and semantic scenes in both dynamic (crowd flow) and static (paths, exit/entrance) aspects. To our knowledge, there has not been such a system that can simultaneously perform the two tasks in a unified approach via an online and automatic manner.

3. Learning Semantic Scene by Tracking With the help of tracking, it is easy for us to obtain a large number of trajectories. We can use these trajectories to explore the knowledge of the scene at a specific time. Firstly, we should online cluster these trajectories based on different types of activities. It is very easy to be understood. As shown in Fig.3(a), at a specific time in a subway station, a large number of persons just got off from a train and were walking together to catch another train, which would

35

30

25

20

15

10

5

281 280 331 308 297 937 869 938 267 905 428 248 298 302 373 730451 274 828 290380 808 932 374 512 323 236 241 518 319 338 488 293 349406 940809 729 516 486 356 265352 347 427 578398 606449632 490 399 537 369405 580 330 484 242 282 273 310 301 744 515 291 607 370372425 259 672 397 354 489 350 262 491 335 631 514 328 555266 708 407 289 251 381 312 279268 482 333 306 247 577 402 671 271 404 258 396260329 424 254337 934 536 579 334 300 633 263 906 322 332 517320 605 238 321 272275 400 292 326 295 270 485 483 309513 269 371 327 303 353 318 426423 355 304 787 264 933 324 288 487375 261 240 250 634 379 403 382 430 283 351 294 538 278

35

30

25

100

20

90 80 70

15

60 50 40

10

30 20 10

5 20

40

60

80

100

120

140

160

0

0 10

20

30

40

50

60

10

20

30

40

50

60

Figure 3. Learning semantic scene by tracking. Once we obtained the tracking results (Fig.a), these trajectories were clustered into different crowd flows via an online unsupervised learning (Fig.b). Then, we could learn scene knowledge: (1) Dynamic properties: density distribution (Fig.c) and velocity distribution (Fig.d) of the crowd flow. (2) Static properties: walk paths and sinks/sources (Fig.f). Note that the arrows in Fig.d show the principal orientation at each position. Please see texts for more details.

become a crowd flow and it could be seen as a type of activity. Secondly, we should extract the knowledge of the scene from these clusters. Scene knowledge should contain two parts: dynamic properties (e.g., information of crowd flow) and static prosperities (e.g., walk paths and sinks/sources). These scene knowledge can provide a great help to the tracking and the overall pipeline is illustrated in Fig.3. In this section, we will provide the details about these items.

3.1. Online clustering Problem Formulation and Overview: At time 𝑡, person 𝑖 is represented by x𝑖 (𝑡) = (𝑥𝑖 , 𝑦𝑖 , 𝑣𝑖𝑥 , 𝑣𝑖𝑦 ), where (𝑥𝑖 , 𝑦𝑖 ) is the position, (𝑣𝑖𝑥 , 𝑣𝑖𝑦 ) the velocity. With the help of the trackers, we obtain the 𝑁 trajectories {𝐿𝑖 (𝑡)}𝑁 𝑖=1 , where 𝐿𝑖 (𝑡) = {x𝑖 (𝑡) : 𝑡 = 1, ..., 𝑇 }. We should cluster these trajectories into 𝑛 clusters {𝑆𝑗 (𝑡)}𝑛𝑗=1 at a specific time 𝑡. In order to dynamically reflect the change of scene information, all the clustering should be online. Inspired by Vidal [26], we consider each cluster 𝑆𝑗 (𝑡) as a moving hyperplane. Thus, we can model a union of 𝑛 hyperplane in ℝ𝐷 , where 𝑆𝑗 (𝑡) = {x ∈ 𝐷 ℝ 𝐷 : b⊤ 𝑗 (𝑡)x = 0}, 𝑗 = 1, ..., 𝑛, where b(𝑡) ∈ ℝ , as the zero set of a polynomial with time varying coefficients using normalized gradient descent. Then the hyperplane normals are estimated from the derivatives of the new polynomial at each trajectory. Lastly, the trajectories are grouped by clustering their associated normal vectors. Algorithm Detail: Given a point x(𝑡) in one of hyperplane 𝑆𝑗 (𝑡), there is a vector b𝑗 (𝑡) normal to it such that b⊤ 𝑗 (𝑡)x(𝑡) = 0. Thus, the following homogeneous polynomial of degree 𝑛 in 𝐷

variables must vanish at x(𝑡): ⊤ ⊤ 𝑝𝑛 (x(𝑡), 𝑡) = (b⊤ 1 (𝑡)x(𝑡))(b2 (𝑡)x(𝑡))...(b𝑛 (𝑡)x(𝑡)) = 0. (1) This homogeneous polynomial can be written as a linear combination of all the monomials of degree 𝑛 in x, x𝐼 = 𝑥𝑛1 1 𝑥𝑛2 2 ...𝑥𝑛𝐷𝐷 with 𝑛1 + 𝑛2 + ...𝑛𝐷 = 𝑛, as . ∑ 𝑝𝑛 (x, 𝑡) = 𝑒𝑛1 ,...,𝑛𝐷 (𝑡)𝑥𝑛1 1 ...𝑥𝑛𝐷𝐷 = e(𝑡)⊤ 𝜇𝑛 (x) = 0, (2) where 𝑒𝐼 (𝑡) ∈ ℝ represents the coefficient of the monomial x𝐼 . The map 𝜇𝑛 : ℝ𝐷 → ℝ𝑀𝑛 (𝐷) is known as Veronese map [10] of degree 𝑛, where 𝐼 is chosen in the degreelexicographic order and 𝑀𝑛 (𝐷) is the total number of independent monomials. Thank to the polynomial equation (2), we can perform the online hyperplane clustering by operating on the polynomial coefficients e(𝑡) rather than on the normal vectors {b𝑗 (𝑡)}𝑛𝑖=1 . That is because e(𝑡) does not depend on which hyperplane the x(𝑡) belong to. Therefore, at each time 𝑡, we seek to find an estimate ˆe(𝑡) of e(𝑡) that minimizes

𝑓 (e(𝑡)) =

𝑡 𝑁 1 ∑∑ (e(𝜅)⊤ 𝜇𝑛 (x𝑖 (𝜅)))2 . 𝑁 𝜅=1 𝑖=1

(3)

By using normalized gradient descent, we obtain the following recursive identifier ˆe(𝑡 + 1) = ˆe(𝑡) cos(∥v(𝑡)∥) +

v(𝑡) sin(∥v(𝑡)∥), ∥v(𝑡)∥

(4)

where the negative normalized gradient is computed as v(𝑡) = −𝛽(𝐼𝑀𝑛 (𝐷) − ˆe(𝑡)ˆe⊤ (𝑡)) ∑𝑁 (ˆe⊤ (𝑡)𝜇𝑛 (x𝑖 (𝑡)))𝜇𝑛 (x𝑖 (𝑡))/𝑁 , × 𝑖=1 ∑𝑁 1 + 𝛽 𝑖=1 ∥𝜇𝑛 (x𝑖 (𝑡))∥2 /𝑁

(5)

where 𝛽 > 0 is a fixed parameter. Once we obtain the estimate of e(𝑡), it is easy for us to estimate the normal vector to the hyperplane containing a trajectory x(𝑡) as 𝐷𝜇⊤ e(𝑡) 𝑛 (x(𝑡))ˆ ˆ b(x(𝑡)) = , ∥𝐷𝜇⊤ (x(𝑡))ˆ e(𝑡)∥ 𝑛

(6)

where 𝐷𝜇𝑛 (x) is the Jacobian of 𝜇𝑛 at x. ˆ 𝑖 (𝑡)) for the norNow, we have obtained the estimate b(x mal to the hyperplane passing through each one of the 𝑁 trajectories {x𝑖 (𝑡) ∈ ℝ𝐷 }𝑁 𝑖=1 at each time instant. The next step is to cluster these normals into 𝑛 groups. It can be done by a recursive K-means algorithm. Essentially, we can seek ˆ𝑗 (𝑡) ∈ ℝ𝐷−1 and the group indicator the normal vectors b 𝜙𝑖𝑗 (𝑡) ∈ {0, 1} of trajectory 𝑖 to hyperplane 𝑗 by maximizing ˆ𝑗 (𝑡)}) = 𝑓 ({𝜙𝑖𝑗 (𝑡)}, {b

𝑁 ∑ 𝑛 ∑

2 ˆ⊤ ˆ 𝜙𝑖𝑗 (𝑡)(b 𝑗 (𝑡)b(x𝑖 (𝑡))) .

𝑖=1 𝑗=1

(7) Vidal has proved that the recursive identifier (4)-(6) provide 𝐿2 -stable estimates of the parameters and 𝑛 can be a variable number. More details about this proof, please refer [26]. Thus, the overall online clustering algorithm can be summarized as follows: Online Clustering Algorithm Input: 𝑁 persons’ state {x𝑖 (𝑡)}𝑁 𝑖=1 at time 𝑡. Output: The group indicator {𝜙𝑖𝑗 (𝑡) : 𝑖 = 1, ..., 𝑁, 𝑗 = 1, ..., 𝑛} at time 𝑡, where 𝑛 is a variable group number.

3.2. Learning scene knowledge Dynamic Properties: Once we obtain the clustering results, it is easy for us to estimate the spatial extent of each clusters. Each cluster 𝑆𝑗 (𝑡) can be seen as a crowd flow, and we should estimate its density and velocity distribution in the region. For cluster 𝑆𝑗 (𝑡), density distribution at position (𝑥, 𝑦) in time 𝑡 is estimated as ∑ 𝔇𝑆𝑗 (𝑡) (𝑥, 𝑦, 𝑡) =

For each time 𝑡 ≥ 1 1. Update the coefficients of 𝑝ˆ𝑛 (x𝑖 (𝑡), 𝑡) = ˆe(𝑡)⊤ 𝜇𝑛 (x𝑖 (𝑡)) by v(𝑡) ˆe(𝑡 + 1) = ˆe(𝑡) cos(∥v(𝑡)∥) + ∥v(𝑡)∥ sin(∥v(𝑡)∥), v(𝑡) = −𝛽(𝐼𝑀𝑛 (𝐷) − ˆe(𝑡)ˆe⊤ (𝑡)) ×

∑𝑁

e⊤ (𝑡)𝜇𝑛 (x𝑖 (𝑡)))𝜇𝑛 (x𝑖 (𝑡))/𝑁 𝑖=1 (ˆ ∑ 2 1+𝛽 𝑁 𝑖=1 ∥𝜇𝑛 (x𝑖 (𝑡))∥ /𝑁

.

2. Solve for the normal vectors at the given trajectories ⊤ (x𝑖 (𝑡))ˆ e(𝑡) ˆ 𝑖 (𝑡)) = 𝐷𝜇𝑛 b(x , 𝑖 = 1, ..., 𝑁 . ⊤ ∥𝐷𝜇𝑛 (x𝑖 (𝑡))ˆ e(𝑡)∥

3. Clustering the normal vectors using K-means While (𝜙𝑖𝑗 (𝑡) is not convergence) { 2 ˆ⊤ ˆ 1, if𝑗 = arg max𝑘=1,...,𝑛 (b 𝑘 (𝑡)b(x𝑖 (𝑡))) , (a) Set 𝜙𝑖𝑗 (𝑡) = 0, otherwise 𝑖 = 1, ..., 𝑁 , 𝑗 = 1, ..., 𝑛 ˆ𝑗 (𝑡) = 𝑃 𝐶𝐴([𝜙1𝑗 (𝑡)b(x ˆ 1 (𝑡))...𝜙𝑁 𝑗 (𝑡)b(x ˆ 𝑁 (𝑡))]), (b) Set b 𝑗 = 1, ..., 𝑛. End ˆ𝑗 (𝑡 + 1) = b ˆ𝑗 (𝑡). Set b End

exp(−∥(𝑥 − 𝑥𝑖 , 𝑦 − 𝑦𝑖 )∥2 /𝜂𝑑 ),

(8)

𝐿𝑖 (𝑡)∈𝑆𝑗 (𝑡)

where 𝜂𝑑 is a constant parameter. An example of density distribution is shown in Fig.3(c), and the color displays the density value. The velocity distribution is based on the principal component of flow orientation and can be built as: 𝔙𝑆𝑗 (𝑡) (𝑥, 𝑦, 𝑡) = exp(− < (𝑣ˆ𝑥 (𝑥, 𝑦), 𝑣ˆ𝑦 (𝑥, 𝑦), (cos(𝛼𝑆∗ 𝑗 (𝑡) (𝑥, 𝑦)), sin(𝛼𝑆∗ 𝑗 (𝑡) (𝑥, 𝑦)) > /𝜂𝑣 ),

(9)

here stands for dot product, 𝜂𝑣 is a constant parameter, (𝑣ˆ𝑥 (𝑥, 𝑦), 𝑣ˆ𝑦 (𝑥, 𝑦)) is the velocity expectation of cluster 𝑆𝑗 (𝑡) at a particular position, and 𝛼𝑆∗ 𝑗 (𝑡) (𝑥, 𝑦) is the principal component in the distribution of flow orientation:

𝑆𝑗 (𝑡) (𝑥, 𝑦) =

𝑀 ∑

𝜋𝑚 𝑁 (𝛼𝑆𝑗 (𝑡) (𝑥, 𝑦); 𝜇, 𝜎),

(10)

𝑚=1

Initialization: ˆ𝑗 (1)}𝑛 and ˆe(1). 1. Randomly choose {b 𝑗=1

(𝑥𝑖 ,𝑦𝑖 )∈𝐿𝑖 (𝑡)



where equation (10) is the GMM model and its parameters 𝜋𝑚 can be obtained through EM iteration. An example of equation (9) is shown in Fig.3(d), the color denotes the speed, and the arrows display the principal orientation. The density and velocity distribution for each crowd flow reflect the dynamic properties of the scene. They can provide a motion prior to the targets that are in a particular crowd flow. Static properties (semantic words): For a particular scene, it should have some constant properties, such as dominant paths, exit and entrance. We should output these semantic words. They are easy to be obtained from the global density distribution (as shown in Fig.3(e)). Global density distribution is similar to the crowd flow density, but it reflects properties of the whole scene and can be computed as: ∑ 𝔇𝑔𝑙𝑜𝑏𝑎𝑙 (𝑥, 𝑦, 𝑡) = ∑ 𝐿𝑖 (𝑡)∈Ω

(𝑥𝑖 ,𝑦𝑖 )∈𝐿𝑖 (𝑡)

exp(−∥(𝑥 − 𝑥𝑖 , 𝑦 − 𝑦𝑖 )∥2 /𝜂𝑑 ),

(11)

35

35

Frame: 200 20 30

30

25

25

20

20

15

15

10

10

5

5

18 16 14 12 10 8

0 0

10

20

30

40

50

60

6 4

0 10

20

30

40

50

60

10

15

20

25

30

35

Figure 4. Tracking by learning semantic scene. We wanted to track target A in frame 200 (Fig.a). We detected that it was in the yellow crowd flow (Fig.b). Then the density distribution (Fig.c) and velocity distribution (Fig.d) of this crowd flow were computed. The two distribution gave a prior knowledge about the motion information of target A. Hence, we could easily obtain the tracking results of target A with their help (Fig.e).

where Ω is all the trajectories we have obtained at time 𝑡. As shown in Fig.3(e)(f), as a long time proceeded, the dominant paths of the scene were easily extracted by thresholding the global density distribution. On the other hand, the exit and entrance of the scene are two very interesting scene properties, which are called sources/sinks. This scene knowledge can powerfully assist in the tracking to deal with the appearing or disappearing of targets. The sinks/sources can be easily detected from the global density distribution 𝔇𝑔𝑙𝑜𝑏𝑎𝑙 . As shown in Fig.3(f), the sinks/sources usually occur at the region of great change of the global density distribution after thresholding. Moreover, the changed direction must follow the principal orientation of the crowd flow. Hence, the sinks/sources can be easily found by a gradient searching at the principal orientation of each crowd flow and the obtained results is shown in Fig.3(f). In summary, with the help of tracking, we online learned the: (1) Dynamic properties: density distribution 𝔇 and velocity distribution 𝔙 of crowd flows. (2) Static properties (semantic words): dominant paths, sinks/sources. These knowledge is all very helpful to the high-level activity analysis and low-level tracking or classification. In the next section, we will utilize these information to dynamically supervise and improve the tracking results.

4. Tracking by Learning Semantic Scene Scene knowledge can provide a great help to the tracking. Firstly, a person in particular crowd flow will be greatly influenced by it because he must follow other persons in it. The density and velocity distribution can be used to describe this influence and supervise the independent tracking. Secondly, a birth/death probability which depends on the gradient of density distribution is assigned to the targets or measurements. This probability can help the tracking easily deal with the appearing/disappearing of targets, and we can maintain the correct tracking even though the uncertain measurements frequently take place.

4.1. Tracking based on crowd flow Consider the state x𝑖𝑡 = (𝑥𝑖 , 𝑦𝑖 , 𝑣𝑖𝑥 , 𝑣𝑖𝑦 ) of person 𝑖 at time 𝑡 with its measurement z𝑖𝑡 which is the foreground laser points set after Mean-shift clustering [5], we should estimate its state as ˆx𝑖𝑡 = arg max 𝑝(x𝑖𝑡 ∣z𝑖𝑡 ). 𝑖

(12)

x𝑡

The posterior probability 𝑝(x𝑖𝑡 ∣z𝑖𝑡 ) can be computed by a Bayesian recursion as ∫ 𝑖 𝑖 𝑖 𝑖 𝑝(x𝑡 ∣z𝑡 ) = 𝛾𝑝(z𝑡 ∣x𝑡 ) 𝑝(x𝑖𝑡 ∣x𝑖𝑡−1 )𝑝(x𝑖𝑡−1 ∣z𝑖𝑡−1 )𝑑x𝑡−1 , (13) where 𝛾 is the normalization constant, 𝑝(z𝑖𝑡 ∣x𝑖𝑡 ) the similarity between target’s state and measurement, and 𝑝(x𝑖𝑡 ∣x𝑖𝑡−1 ) is the transition probability. Obviously, the crowd flow will have a great influence on the persons who are in it. We can use the density and velocity distribution to describe this influence, and the transition probability of person 𝑖 at the crowd flow 𝑆𝑗 (𝑡) can be computed as 𝑝𝑆𝑗 (𝑡) (x𝑖𝑡 ∣x𝑖𝑡−1 ) = 𝔇𝑆𝑗 (𝑡) 𝔙𝑆𝑗 (𝑡) 𝔚,

(14)

where 𝔚 is the walking model, which can be a constant velocity model, second order autoregressive model [24] or the “two feet model” [32]. Equation (14) is very easy to be understood. As shown in Fig.4, if an immediate crowd behavior is moving to a particular direction, the person in it will favor to this direction with a high transition probability, and the motion transition will also follow the density and velocity distribution of this crowd flow. Hence, we can utilize particle filter [7] to compute equation (12) and (13), and obtain the targets’ state at each time.

4.2. Uncertain measurements Actually, in our application, most failed tracking was caused by the uncertain measurements. For instance, a new

35

30

25

20

15

10

5

623544 588 505698 700607 Frame: 415 472 1694 1258 978 1256 478 546557 560979 975 704 545 781 459 538 609 792 684 434 701 633 485 528 632 455 630 611 592 467 577 890 610 627 566 536 570540 532 563 558787 696 791 564 7961749 1070 501 692 629 494 526 490 800 537 702 695 618 628 539689 981 486 541976 790553574 543 785 529 555 688 621 511 798 479578 686 605 631 583 703 435 622616 547 617 460 590572571 789 1259 524 437 885 593 523 561 1164 550 559 1071 517 884 503 512 782 625 1065 1748 706 576 568 1165 615614525 548 797 535 521 690 619 579 697 1066 784587 518 624 542 514613 1161 693549 794 799 589 1163 438 883 685534 519 516 565 606 980 793 1162 586 442533 1067 531 581 584 1158 699 691 1069 530 1747 694 474 974 786 1068 522

35

30

25

20

15

10

5

1071 1286 1164 1196 1153 1160 1755 1159 1145 Frame: 745 1152 1257 1258 1256 1289 1093 1283 1290 1291 1206 1261 1568 1199 1187 1426 1571 1185 1752 1220 1184 1084 1753 1170 1197 1208 1354 1570 1749 1133 1171 1175 1267 1205 1073 1098 1089 1108 11431192 1111 1195 1137 1085 1356 10961567 1259 1429 1275 1188 1099 1136 1272 1138 1172 116211781352 1432 1131 1103 1628 1425 1182 1154 1127 1266 1174 1270 1276 1758 1090 1126 1117 1348 1350 15041279 1287 1179 1189 1431 1116 1123 1177 1134 1194 1066 1146 1212 1217 1101 1500 1082 1202 1427 1213 1565 1087 1142 1150 1110 1169 17541428 1278 1627 1506 1168 1748 11091128 12731424 11151695 1149 1148 1193 1285 1161 1075 1167 1265 14211502 11251505 1263 1269 1566 1183 1076 1423 1156 1157 1144 1422 1264 1190 12181198 1503 1118 1351 1501 1355 1191 1349 1155 1271 1100 1163 11401069 12001151 1209 1346 1176 1204 1078 1751 12931120 11651102 1077 1173 1750 1158 1430 1211 1277 1262 1747 1215 1180 1353 1357 1119 1121 1080

1068 1201 1268 1141

1531

35

Frame: 990

30

25

20

15

10

5

475 0 10

20

30

40

50

60

0 10

20

30

40

50

60

0

35

35

30

30

30

25

25

25

20

20

20

15

15

15

10

10

10

5

5

0

0 10

20

30

40

50

60

1510 1786 1565 15621583 1550 1644 1587 1532 1695 1539 1598 1509 1607 1580 1524 1652 1595 1542 1546 1757 1606 1560 1777 1770 1638 1605 1630 1596 15131599 1646 1782 1600 1749 1778 15781549 1609 1585 15551570 17061519 1535 1697 1633 1571 1779 1769 1543 1575 1637 1518 1643 1767 1704 1632 1611 1653 1589 1603 1559 1774 1649 1641 1642 1645 1601 1655 1586 1708 15941640 1758 16311593 1773 1760 1517 1755 1700 1771 1699 1761 1545 1500 1766 1650 1557 1602 1561 1541 1784 1710 1748 1702 1521 1701 1582 1785 1576 1763 1781 15381639 1764 1627 1787 17881569 1636 1581 17831753 1768 1533 1789 1504 1765 1762 1772 1751 1558 1791 175415741544 1635 15721628 1756 1591 1608 1634 1553 17751577 1508 1530 1584 1647 1512 1507 1709 16291703 1556 16541506 1750 1604 1759 1747 1520 1552 1707 1776 1523 1573 1780 1566 1529 1696 1698 1554 1597 1503 1501 1540

10

20

30

40

50

60

10

20

30

40

50

60

35

5

0 10

20

30

40

50

60

Figure 5. The results of the proposed system. The first row is our tracking results, the second is the clustering results, the third and fourth are the incremental learned density and velocity distribution. Please see our supplementary video for more details.

target is often incorrectly initialized due to the false alarm. On the hand, the merge/split of measurements and the nondetection due to the occlusion often result in the break of trajectories because the targets cannot find the matching measurements. But we can easily deal with these problems with the help of sources/sink knowledge in the scene. 𝑖 We assigned a death probability 𝑃𝑑𝑒𝑎𝑡ℎ to each target 𝑖 and a birth probability 𝑃𝑏𝑖𝑟𝑡ℎ to each measurement. If a target with a high death probability cannot find any matching measurement, it can be seen as a disappearing target in the scene. Otherwise, we should still track it whether it has suitable matching. Similarly, If a measurement with a high birth probability does not associate with any target, we initialize and track it as a new appearing target. Otherwise, we deal with it as the false alarm. As shown in Fig.3(c), the birth and death probability should depend on the gradient of density distribution and the direction of gradient descent/ascend must follow the principal direction of the crowd flow. Therefore, the two probability can be computed as 𝑖 𝑃𝑏𝑖𝑟𝑡ℎ ∝ exp(< ∇𝔇𝑆𝑗 (𝑡) (𝑥, 𝑦, 𝑡), ⃗𝑣𝑗 /∣⃗𝑣𝑗 ∣ > /𝜉1 ), (15)

𝑖 ∝ exp(< ∇𝔇𝑆𝑗 (𝑡) (𝑥, 𝑦, 𝑡), ⃗𝑣𝑗 /∣⃗𝑣𝑗 ∣ > /𝜉2 ), (16) 𝑃𝑑𝑒𝑎𝑡ℎ

where ⃗𝑣𝑗 is the principal direction of the crowd flow, and 𝜉1 , 𝜉2 are the constant parameters. In summary, the tracking benefits from the learned scene knowledge and also provides new results to update scene knowledge. Therefore, this mode of co-operation between tracking and learning not only obtains the accurate tracking results and scene knowledge, but also ensures that the entire process is completely automatic and online.

5. Experiments and Results We applied our system to the real scene: lobby of JR subway station (about 60m×35m). Eight single-row laser scanners (LMS291) produced by SICK was utilized. They were set above 10cm on the ground surface and performing the horizontal scanning with a frequency of 37 fps. The observation model in equation (13) and walking model in equation (14) were same to the Song et al. [24]. The selected data used for evaluation was from 7:30 am to 8:30 am when was a quite busy time in Tokyo. In this section,

35

35

35 Frame: 990

Frame: 990 132

30

63 142 85136 139 82

25

20

15

118

117

76 97

10

5

131

95 130 83 133

119123 120 98

144 147 20

Target Number

Correct Tracking

10

30

25

8182 7876 37

53

20

15

90

70 10 47 5

50

60

0

10

68

54 52 51 39 79

73 72 69 40 38

113

40

153

30

70

28

109

69 66 8687 71125 146 96 79 91 88 114 100108 58 145 126 93 124 116 134 121 75 135 138 128 112 99 94 140 62 141 137 78

0

30

111

73

68 129122110

Frame: 990 21

63 66 58 77

67 65 62 49

131

20

93

144

10 145

23 24

30

76 109 146

27

29

139 142 148

40

50

60

141

68 80 69 117 108107 101 81 100 106 110102 73 150 84 77

66

95

92

5

0

134137 147 133 120 86

105 78138132

41 15

151

118143

135

20

45 34 57 50 595630 48 42 71 60 61 3233 316443 55 36 80 75 74 92 9091

114 112 149 91

25

46

44

136

152 111

10

119 20

30

40

50

60

Ours PF (No scene learning) Song et. al Cui et. al

60 40 20 0

500

1000

1500 (a) Correct tracking in 3000 frames

2000

2500

3000

60 40 20 0

500

1000

1500 (b) Targets number in 3000 frames

2000

2500

3000

Figure 6. Quantitative comparison among four methods. (a) shows the correct tracking of four methods in 3000 continuous frames. (b) shows the target number in these frames.

we will present our experimental results and perform the quantitative evaluation and comparison.

5.1. Results We found that when the cluster number 𝑛 > 6, the computation became incredibly huge and made our system unstable. Hence, we set 𝑛 < 6 in the entire evaluation. Fig.5 shows the example of our results. The first row is the tracking results, the second is the clustering results, and the third and fourth are the incremental learned density and velocity distribution map. From this figure, we can see that the persons were clearly clustered based on different crowd flows. In addition, the density distribution map became increasingly clear so as to reveal the knowledge of scene. Furthermore, from the velocity distribution, we can see that the persons in a crowd place usually walk quite slowly. As the 1090 frames proceeded, the dominant paths and sinks/sources of the scene were obtained as illustrated in Fig.7. Actually, with the increasing of the tracking frames, this results can be more accurate.

5.2. Quantitative comparison A quantitative comparison was conducted among four methods: Song et al.[24], Cui et al.[6], only particle filter (PF) based tracker (no scene learning) and our method. We made a statistical survey of 3000 continuous frames to evaluate the tracking performance of these methods in high density scene. The ground truth was obtained by a semi-automatic way (trackers+manual labeling). The failed tracking were including target missed, false location and identity switch, which can be automatically computed from the ground truth. The details about this is illustrated in Fig.6 and the overall success rate of these methods were shown in Table 1. From Fig.6, we can see that our method has the best

Algorithm

Success Rate

Song et. al Cui et. al PF only Our Method

85.5% 83.2% 75.2% 91.3%

Table 1. Success rate among four methods

performance among all the methods in high density scene and the scene knowledge provides about 16% performance improvements. As the illustrated in the tracking results of frame 990, the trajectories obtained by the trackers with no scene learning were quite short and frequently broke. By contrast, with the help of the scene knowledge, our method can easily maintain the long time and robust tracking.

6. Conclusion In this paper, we present a novel online system that can simultaneously perform the learning semantic scene and tracking. Experimental results demonstrate its feasibility and robustness. Actually, the proposed system can be easily extended to many interesting applications, such as objects classification, online abnormal activity detection, intelligent transportation and etc. In the future, we will try to add these modules into the proposed system.

7. Acknowledgement This work was supported in part by the East Japan Railway Company, the NSFC Grant No.60605001, the NHTRDP 863 Grant No.2007AA11Z225, No.2009AA012105 and NKBRPC Grant No.2006CB303100. We specially thank High-Level University Building Program of China for their supporting.

100

1

2

90

8 80

9

10

[16]

70 60

[17]

50 40

3

30

[18]

20

4 10

7

6 20

5 40

60

80

100

120

140

160

Figure 7. Obtained paths and sink/source after 1090 frames. The white color shows the path of the scene, and the rectangles show the sinks/sources.

References [1] S. Ali and M. Shah. Floor fields for tracking in high density crowd scenes. Proc. ECCV, pages 1–14, 2008. [2] Y. Bar-Shalom and T. E. Fortmann. Tracking and data association. New York: Academic Press, 1998. [3] M. Betke, D. Hirsh, A. Bagchi, N. Hristov, and N. Makris. Tracking large variable numbers of objects in clutter. Proc. IEEE CVPR, pages 1180–1187, 2007. [4] B. Bose, X. Wang, and E. Grimson. Multi-class object tracking algorithm that handles fragmentation and grouping. Proc. IEEE CVPR, pages 1550–1557, 2007. [5] D. Comaniciu and P. Meer. Distribution free decomposition of multivariate data. Pattern Analysis and Applications, 38:22–30, 1999. [6] J. Cui, H. Zha, H. Zhao, and R.Shibasaki. Fusion of detection and matching based approaches for laser based multiple people tracking. Proc. IEEE CVPR, pages 642–649, 2006. [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 10(1):197–208, 2000. [8] Z. Fu, W. Hu, and T. Tan. Similarity based vehicle trajectory clustering and anomaly detection. Proc. IEEE International Conference on Image Processing, pages 1133–1136, 2005. [9] G. Gennari and G. Hager. Probabilistic data association methods in visual tracking of groups. Proc. IEEE CVPR, pages 876–881, 2004. [10] J. Harris. Algebraic geometry: A first course. SpringerVerlag, 1992. [11] H. Jiang, S. Fels, and J. Little. A linear programming approach for multiple object tracking. Proc. IEEE CVPR, pages 1380–1387, 2007. [12] J.Sullivan and S.Carlsson. Tracking and labeling of interacting multiple targets. Proc. ECCV, pages 661–675, 2006. [13] I. Junejo and H. Foroosh. Trajectory rectification and path modeling for video surveillance. Proc. IEEE ICCV, pages 230–237, 2007. [14] I. Junejo, O. Javed, and M. Shah. Multi feature path modeling for video surveillance. Proc. IEEE Conference on Pattern Recognition, pages 383– 386, 2004. [15] Z. Khan, T. Balch, and F. Dellaert. Mcmc data association and sparse factorization updating for real time multitar-

[19]

[20]

[21]

[22] [23]

[24]

[25]

[26]

[27]

[28]

[29] [30] [31]

[32]

[33]

get tracking with merged and multiple measurements. IEEE Trans. on PAMI, 28(1):1960–1972, 2006. O. Lanz and R. Manduchi. Hybrid joint-separable multibody tracking. Proc. IEEE CVPR, pages 413–420, 2005. B. Leibe, K. Schindler, and L. Gool. Coupled detection and trajectory estimation for multi-object tracking. Proc. IEEE ICCV, pages 1110–1117, 2007. D. Makris and T. Ellis. Automatic learning of an activitybased semantic scene model. Proc. IEEE Conference on Advanced Video and Signal Based Surveillance, pages 183– 188, 2003. P.Nillius, J.Sullivan, and S.Carlsson. Multi-target tracking - linking identities using bayesian network inference. Proc. IEEE CVPR, pages 2187–2194, 2006. W. Qu, D. Schonfeld, and M. Mohamed. Real-time interactively distributed multi-object tracking using a magneticinertia potential model. Proc. IEEE ICCV, pages 535–540, 2005. C. Rasmussen and G. Hager. Probabilistic data association methods for tracking complex visual objects. IEEE Trans. on PAMI, 23(1):560–576, 2001. D. Read. An algorithm for tracking multiple targets. IEEE Trans. Automation and Control, 24(1):84–90, 1979. D. Schulz, W. Burgard, D. Fox, and A. Cremers. People tracking with a mobile robot using sample-based joint probabilistic data association filters. International Journal of Robotics Research, 22(2):99–116, 2003. X. Song, J. Cui, X. Wang, H. Zhao, and H. Zha. Tracking interacting targets with laser scanner via on-line supervised learning. Proc. IEEE International Conference on Robotics and Automation, pages 2271–2276, 2008. J. Vermaak, S. Godsill, and P. Perez. Monte carlo filtering for multi target tracking and data association. IEEE Trans. Aerospace and Electronic Systems, 41(1):309–332, 2005. R. Vidal. Online clustering of moving hyperplanes. Proc. Neural Information Processing Systems, pages 1433–1440, 2006. X. Wang, K. Ma, G. Ng, and E. Grimson. Trajectory analysis and semantic region modeling using a nonparametric bayesian model. Proc. IEEE CVPR, pages 1–8, 2008. X. Wang, K. Tieu, and E. Grimson. Learning semantic scene models by trajectory analysis. Proc. ECCV, pages 110–123, 2006. M. Yang, T. Yu, and Y. Wu. Game-theoretic multiple target tracking. Proc. IEEE ICCV, pages 110–117, 2007. A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys, pages 3–47, 2006. Q. Yu, G. Medioni, and I. Cohen. Multiple target tracking using spatio-temporal markov chain monte carlo data association. Proc. IEEE CVPR, pages 642–649, 2007. H. Zhao and R.Shibasaki. A novel system for tracking pedestrians using multiple single-row laser range scanners. IEEE Transactions on Systems, Man and Cybernetics, part A, pages 283–291, 2005. T. Zhao and R. Nevatia. Tracking multiple humans in complex situations. IEEE Trans. on PAMI, 7(1):1208–1221, 2004.