Pedestrian detection and tracking at crossroads - CiteSeerX

3 downloads 124427 Views 610KB Size Report
aDepartment of Information and Computer Education, National Taiwan Normal ... dGraduate Institute of Computer Science and Information Engineering, ...... and Ph.D. degrees in Computer Science from Northwestern University, Evanston, IL, in 1987 ... He was a Research Associate with the Computer Vision and Image ...
Pattern Recognition 37 (2004) 1025 – 1034 www.elsevier.com/locate/patcog

Pedestrian detection and tracking at crossroads Chia-Jung Paia , Hsiao-Rong Tyanb , Yu-Ming Liangc , Hong-Yuan Mark Liaoc;∗ , Sei-Wang Chend a Department

of Information and Computer Education, National Taiwan Normal University, Taiwan of Information and Computer Engineering, Chung Yuan University, Taiwan c Institute of Information Science, Academia Sinica, 128 Sinica Road, Sec 2, Nankang, Taipei 11529, Taiwan d Graduate Institute of Computer Science and Information Engineering, National Taiwan Normal University, Taiwan b Institute

Received 8 May 2003; accepted 7 October 2003

Abstract This paper presents a system that can perform pedestrian detection and tracking using vision-based techniques. A very important issue in the 5eld of intelligent transportation system is to prevent pedestrians from being hit by vehicles. Recently, a great number of vision-based techniques have been proposed for this purpose. In this paper, we propose a vision-based method, which combines the use of a pedestrian model as well as the walking rhythm of pedestrians to detect and track walking pedestrians. Through integrating some spatial and temporal information grabbed by a vision system, we are able to develop a reliable system that can be used to prevent tra7c accidents happened at crossroads. In addition, the proposed system can deal with the occlusion problem. Experimental results obtained by executing some real world cases have demonstrated that the proposed system is indeed superb. ? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Pedestrian detection and tracking; Intelligent transportation system; Pedestrian model; Walking rhythm

1. Introduction As the number of vehicles increases rapidly in the modern world, the number of deaths caused by tra7c accidents is signi5cantly increased in the past few years. It is mentioned in Ref. [1] that the number one and number two tra7c accidents that cause injuries are passenger cars and pedestrians, respectively. In this paper, we shall focus ourselves on how to e7ciently detect and then track pedestrians at crossroads so that the drivers can be more cautious. In many countries, pedestrian crosswalk and tra7c signal are built as a complete set at every crossroad. These facilities are helpful in assisting pedestrians to cross a street. A 5xed signal duration, however, very often makes every tra7c participant wait a long period of time. In some modern countries, the so-called pelican crossings are used. At this kind ∗ Corresponding author. Tel.: +886-2-27883799; fax: +886-227824814. E-mail address: [email protected] (H.-Y.M. Liao).

crossing, a pedestrian can change the tra7c signal or extend the signal duration by himself/herself. However, it happens very often that the time is still insu7cient for slow walkers such as the elders or young kids. Under these circumstances, an automated pedestrian detection system is indispensable. In recent years, many modern countries have spent their resources on the development of Intelligent Transportation Systems (ITS). Among the great number of ITS-related researches, the vision-based approach [1–17] has become one of the main streams due to its power on grabbing richer information. Using the data grabbed from a vision system, many valuable tra7c-related events can be extracted and then analyzed. Among diDerent types of tra7c-related events, the detection of pedestrians is a very important one. To e7ciently and eDectively detect pedestrians is not an easy task due to the articulated nature of a human body. The most commonly used technique for detecting pedestrians is via direct classi5cation. A common classi5cation process needs to either extract features from targets or to 5nd an appropriate template for every target. For

0031-3203/$30.00 ? 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2003.10.005

1026

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

Fig. 1. The system setup of our method: (a) the camera setup; and (b) an image taken by the camera.

classi5cation, neural network is one of the powerful tools that can be used to classify objects [6,10,15]. In Refs. [5,11], wavelet transform is used as a tool to extract features, and the classi5cation process is performed by using support vector machine. In Ref. [9], template matching is used to perform classi5cation, but the pedestrian templates derived from different viewpoints need to be predetermined. In addition to the above-mentioned approaches, a pedestrian model or the walking rhythm of a pedestrian can also be used to perform classi5cation. The pedestrian model used in Ref. [2] assumes that the pedestrian body is symmetric from either the front or the back viewpoint. In Ref. [8], HeikkilGa and SilvHen proposed to extract the contour of a target and to encode the target by a coding method. The pedestrian model means to represent a pedestrian by using certain code sequences. In Refs. [4,7,14], the walking rhythm of a pedestrian is used as the feature for classi5cation. The walking rhythm is the swing phenomenon of the legs of a pedestrian, and it should maintain periodicity when the pedestrian walks. In this paper, we propose a vision-based approach that can identify pedestrians e7ciently and automatically. First, we set a camera on top of road (see Fig. 1). Once pedestrians are present, we can provide the approaching vehicles with the appearance of pedestrians by a communication technique. The proposed pedestrian detection system is not only useful in the transportation domain, but also useful in other domains, such as the security system of a building, parking area monitoring, and a robot patrol system. The rest of this paper is organized as follows. In Section 2, we shall describe the process of foreground extraction. Then, the moving target tracking process will be elaborated in Section 3. In Section 4, we shall describe how to perform pedestrian recognition. The experimental results will be reported in Section 5, and the conclusion will be drawn in Section 6. 2. Foreground extraction Foreground extraction is a fairly important process for a pedestrian detection system because it will inIuence the 5nal detection results. There are many existing methods designed for foreground extraction, and the most famous ones are either stereo vision-based [9,10,15] or background

Fig. 2. The illustration of foreground extraction: (a) – (h) are partial images needed to generate the background image (120 frames in total); (i) is the generated background image; (j) is an input image in the sequence; (k) shows extracted foreground pixels; (l) is the downsized image; (m) shows foreground objects; (n) is the result after performing shadow removal; and (o) is the result after performing contour extraction.

subtraction-based [18,8,13]. The time complexity of a stereo vision-based system is too high, so very often a background subtraction-based approach is used to extract foreground objects. In our system, we generate the background 5rst. The background is generated by the following iterative equation: Bj (x; y) = Bj−1 (x; y) +

1 (Ij (x; y) − Bj−1 (x; y)); n

(1)

where Bj (x; y) is the color value of a background pixel (x; y) at time j, n is the number of frames covered for generating background, and Ij (x; y) is the color value of pixel (x; y) at frame j. The initial value of Bj (x; y) is equal to Ij (x; y). The physical meaning of Eq. (1) is to average n consecutive frames, and an example is shown in Fig. 2(i). After the background is generated, the foreground pixels can be extracted by the following equation:  1 if |Ij (x; y) − B(x; y)| ¿ Tobj ; Pj (x; y) = (2) 0 otherwise; where Tobj is a prede5ned threshold. An example showing the foreground pixels is illustrated in Fig. 2(k). When the pixels that belong to the foreground are extracted, they should be grouped into objects. However, due to the nature of an image-processing operation, it is very possible to mistakenly include pixels that are noises. Therefore, we apply one iteration of erosion to reduce their eDect. The structuring element used for executing erosion is shown in Fig. 3. In order to speed up the whole process, the next step is to downsize each side of the foreground image into 1=n. In each block, the block will be marked as 1 if the number of 1’s is greater than Tblock . Otherwise, the block will be marked as 0. An example of the downsized image is shown in Fig. 2(l).

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

1027

Fig. 3. The structuring element used for erosion and dilation.

For extracting the foreground objects, a connected component labeling algorithm is applied to the above-mentioned downsized image. After the labeling process, we map those labeled connected components back to the original foreground image. In addition, since we have applied an erosion operation previously, one iteration of dilation has to be applied to these mapped connected components. The structuring element used for dilation is the same as the one used in erosion (Fig. 3). An example of the dilated connected components is illustrated in Fig. 2(m). In the connected component labeling process, it is possible to “connect” some irrelevant shadows. Therefore, a shadow removal process has to be applied right after the connected component process. In Ref. [3], Prati et al. proposed a deterministic non-model based shadow removal algorithm. According to their observation, they found a shadow could be detected by checking the status of the hue, saturation, and intensity between the foreground image and the background image. They devise a shadow checking criterion as follows [3]:  1 if  6 IjV (x; y)=BV (x; y) 6        ∧IjS (x; y) − BS (x; y) 6 S SPj (x; y) = (3)   ∧|IjH (x; y) − BH (x; y)| 6 H ;     0 otherwise; where IjH (x; y), IjS (x; y) and IjV (x; y) are the hue, saturation, and intensity of a foreground pixel located at (x; y) of frame j; and BH (x; y), BS (x; y) and BV (x; y) are the hue, saturation and intensity of a background pixel located at (x; y). Eq. (3) in total has four thresholds. They are , , S , and H . These thresholds are all determined by the condition of illumination. When SPj (x; y)=1, it indicates the (x; y) pixel of frame j is a pixel belonging to a shadow. Using the criterion de5ned by Eq. (3), we can separate the shadow from the foreground objects. An example showing the result after performing shadow removal is illustrated in Fig. 2(n). After the process of shadow removal, the next step is to perform contour extraction based on the previously extracted foreground objects. Since the extracted contour is not necessarily closed, we adopted the contour tracing algorithm proposed by Chen et al. [18] to accomplish the task. In Ref. [18], contour extraction is realized by scanning horizontally and vertically on the right most pixels of the same row that belong to the same object. Similar operation can be applied to the case of vertical scan. Fig. 2(o) is an example showing the results after performing contour extraction.

Fig. 4. Moving objects tracking. The dynamical graph matching are shown from (a) to (f). The pro5le 2 is missing in (c), and the occluding event happens in (d) and (e).

3. Moving target tracking For the purpose of conducting an e7cient tracking, we adopt a dynamical graph matching approach [18] to ful5ll our goal. Here, we use Fig. 4 to explain how this approach works. The upper half of Fig. 4(a) – (f) illustrates a sequence of consecutive frames that reIect the real situation of a pedestrian crossing. The lower half of Fig. 4(a) – (f), on the other hand, illustrates the bipartite graph that can be used to explain how dynamical graph matching works. A bipartite graph consists of two major components: the pro5les and the real objects. For example, the bipartite graph shown at the lower half of Fig. 4(a) consists of four objects and four corresponding pro5les. The four objects are obtained by executing the foreground extraction process and the four pro5les are the four foreground objects’ counterparts that are temporally stored in the pro5le database. If there is a link existing between an object and a pro5le, then it means they match each other. In the very beginning (before tracking), the bipartite graph is empty. If there are some objects entering the 5eld of view of the subsequent frames, then they are considered as new pro5les and will be added into the graph. In Fig. 4(a), we add four pro5les (2–5) into the graph. In Fig. 4(b), foreground processing detects object 2–5. The link between pro5le 3 and object 2, and that between pro5le 2 and object 3, can be established. A pro5le will be deleted if there is no matched object in the subsequent image. Therefore, pro5le 2 is deleted because its corresponding object is out

1028

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

of the range and this situation is illustrated in Fig. 4(c). On the other hand, if there exists an object that cannot 5nd a counterpart in the pro5le database, we have to add a new pro5le in the database. Fig. 4(f) illustrates a typical example. Object 2 shown in Fig. 4(f) enters the 5eld of view, a new item indicated pro5le 6 is added to the database. As to the matching process, the feature used for matching is the color information. At 5rst, we quantize the color space and make the image become smaller in size. The color image can be converted into 64 × 3 or 16 × 3 levels. After quantization, the quantized color histograms of pro5les and objects can be obtained. The Kullback–Leibler distance [18] is used as a dissimilarity measurement: D(p(u); q(u)) =

n 

p(u) log

u=1

p(u) ; q(u)

(4)

where n is the level number used for quantization, p(u) is the quantized color histogram of a pro5le, and q(u) is the quantized color histogram of an object. If each pro5le has its own matching, the color histogram of each pro5le will be updated. Otherwise, the color histogram of an occluded pro5le remains unchanged. Usually, occlusion is an indispensable event in the tracking phase, and it must be solved. When there is an event involving occlusion, the amount of objects in the 5eld of view will be decreased. We assume that the tracked pro5le moves in a constant speed, so we can predict the current position of unmatched pro5le. If the predicted position is outside an image, the corresponding pro5le will be treated as a missing one. If there is another object in the predicted position, the pro5le is signi5cantly occluded by this object. We will label the occluded pro5le “occluding” so that we can know that it is occluded. An example of occlusion is shown in Fig. 4(d) and (e).

4. Pedestrian recognition In this section, we shall describe how to use a pedestrian model together with the walking rhythm of a human being to execute pedestrian recognition. For characterizing a pedestrian model, a non-rigid body together with the ratio of width to height of a human being’s torso is adopted. The details of how these features can be used in pedestrian recognition will be discussed in the following. 4.1. Pedestrian model In this work, we proposed a non-rigid body based pedestrian model. The main feature used is the width-to-height ratio of the torso of a human being. Here, the shape of a pedestrian will be characterized by an ellipse. We will locate the pedestrian’s torso by matching an ellipse, and the ratio of the major axis to the minor axis of the ellipse will be restricted by a certain range. The ellipse can also be used

Fig. 5. The contour maps of diDerent moving objects. (a1) – (a6) show a moving vehicle within six consecutive frames. (b1) – (b6) show the corresponding contour maps of (a1) – (a6). (c1) – (c6) show a pedestrian. (d1) – (d6) show the corresponding contour maps of the pedestrian. (e1) – (e6) show a pedestrian crowd. (f1) – (f6) show the corresponding contour maps of the pedestrian crowd.

to locate the feet part and the walking rhythm can thus be measured. The main diDerence between a pedestrian and a vehicle is the style of moving. Usually, the movement of a vehicle obeys the rigid body motion rules. The movement of a pedestrian, on the other hand, falls into the category of non-rigid body motion. When a pedestrian walks, the contour of his/her appearance varies signi5cantly, especially the lower half of his/her torso. Fig. 5 illustrates, respectively, the contour maps of a moving car, a moving pedestrian, and a moving pedestrian crowd. It is obvious that the movement of a pedestrian or a pedestrian crowd is much more signi5cant than that of a moving vehicle (especially the lower half parts). Therefore, we can check the lower half of a pedestrian to see if there is any non-rigid body motion. In the checking process, two steps are involved. First, for the contour of a single pedestrian, we check the lower half of his/her torso. Let the area occupied by the two feet of a pedestrian be Aobj and the area of the silhouette of the two feet be Asil . It is noted that Asil is subject to change when a pedestrian walks. For a video sequence that shows a walking pedestrian, we

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

Fig. 6. The process for determining the maximum inscribed ellipse: (a) the initial ellipse; (b) the distance computed to adjust the length of vertical axis and the y coordinate of the ellipse’s center; (c) the resultant ellipse after vertical adjustment; (d) the distance computed to adjust the length of horizontal axis and the horizontal coordinate of the ellipse’s center; and (e) the obtained maximum inscribed ellipse.

can keep summing up the area occupied by diDerent Asil ’s. Let the ratio between the accumulated Asil and Aobj be Aratio . For a walking pedestrian, Aratio is insigni5cant at the very beginning because only one Asil in a frame is counted. When the time goes by, the accumulated Asil is getting larger and larger and 5nally exceeds a prede5ned threshold. When this happens, we can con5rm that the tracked target is a pedestrian. On the other hand, if the tracked region belongs to a rigid body motion (such as a vehicle), then Aratio remains the same during the tracking process. Under the circumstance, we can identify it as a non-pedestrian. For the case of a pedestrian crowd, the analysis is a little bit complicated because the non-rigid body motion that can be characterized by the silhouette of the feet of a pedestrian crowd is not as signi5cant as that of a single pedestrian. Therefore, if only Aratio is used, a pedestrian crowd could be mistakenly identi5ed as a vehicle. Therefore, we propose to use Shannon’s entropy, Eratio , de5ned as follows [3]:  Eratio = − p(i) log p(i); (5) 16i6N

where N is the number of blocks covered in a region of interest, and p(i) is the ratio between the number of contour points and the total number of points covered by the corresponding object. Using Eratio , we can determine more precisely whether a moving object is a rigid body or not. As to the determination of a cutoD threshold, we shall describe it in the experiments. In what follows, we shall describe a method that can be used to locate the torso of a pedestrian. We assume the human body is always perpendicular to the road surface when a pedestrian walks. This assumption can help reduce the search space when an ellipse is adopted to characterize the torso of a pedestrian. We expect to derive the maximum inscribed ellipse of a torso. For the ratio between the major axis and the minor axis of the above-mentioned ellipse, we need to con5ne it within a certain range. The procedure of determining the maximum inscribed ellipse of a human torso is illustrated in Fig. 6. At the very beginning, we initialize the whole process by providing an initial ellipse. The center of the ellipse is the center of the target object. On the

1029

other hand, we take the height and the width of the target object as the long (major) axis and the short (minor) axis, respectively, of the initial ellipse (Fig. 6(a)). After the initialization process, the next step we need to adjust the length of the long axis and the short axis. We start the discussion from the long axis (vertical axis). We divide the initial ellipse into the upper half and the lower half. For every point located on the upper silhouette of the object, we calculate its vertical distance, dvu (v means vertical and u represents upper), with the upper half ellipse. If dvu receives a positive value, it means the silhouette point is outside the ellipse; otherwise, it is inside the ellipse. On the other hand, for every point located on the lower silhouette of the object, we also calculate its vertical distance, dvl , with the lower half ellipse. An example showing how dvl and dvu are measured is illustrated in Fig. 6(b). Having all dvl ’s and dvu ’s, we can adjust the length of the long axis and the vertical center coordinate as follows: A = A + (min{dvu } + min{dvl }); Cy = Cy +



min{dvu } + min{dvl } 2

(6)  ;

(7)

where A is the length of the long axis of the initial ellipse and A represents the length after adjustment. Cy here represents the center’s vertical coordinate of the initial ellipse and Cy is the new center’s vertical coordinate. Fig. 6(c) shows the result after executing vertical adjustment. The adjustment in the horizontal direction is similar (Fig. 6(d)) and the result after executing both the vertical and the horizontal adjustments is shown in Fig. 6(e). The above-mentioned maximum inscribed ellipse allocation method can only 5nd an approximate maximum inscribed ellipse, but it is very close to the ideal one. In comparison with using the Hough transform to do the same task, the above-mentioned method is much better in terms of speed. As to the ratio between the major axis and the minor axis of a torso, one has to provide the range for it. In our work, we set its upper bound and lower bound to be 1 and 1/8, respectively. Since it is possible that the maximum inscribed ellipse may accidentally cover the foot part of a pedestrian, we usually take data from an observation period instead of taking data from a single shot. In the observation period, the major–minor axes ratio of the detected ellipse should always stay within the prede5ned range. For the case of a pedestrian crowd, we shall explain it later. 4.2. Walking rhythm Walking rhythm [4,7,14] is another important feature that can be used to detect the existence of pedestrians. In Ref. [14], Yasutomi et al. used a spatial–temporal approach to detect moving pedestrians. It is known that when a pedestrian walks, his/her legs have a swing motion and the frequency of this swing in some sense reIects the habit of this person.

1030

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

The process of how to systematically determine the above-mentioned thresholds will be discussed in the experiments. Basically, the walking rhythm feature becomes very useful when the pedestrian model test receives a not-so-sure result. 4.3. The occlusion problem

Fig. 7. The walking rhythm of a pedestrian: (a) the distance between feet; (b) the walking rhythm; and (c) the power spectrum of the walking rhythm.

It is mentioned in Ref. [14] that the frequency of a normal pedestrian is from 1.5 to 2:5 steps=s. For detecting the walking rhythm of a pedestrian, several features have to be measured. First, the distance between the two feet is required. Second, the change of the above mentioned distance over time has to be determined. Fig. 7(a) and (b) show, respectively, the distance between two feet at an instance and the change of distance over time. As to the power spectrum feature, it can be derived by applying Fourier transform or the maximum entropy method. Using an appropriate method, we are able to calculate the power of every frequency. The power spectrum distribution corresponding to the walking rhythm shown in Fig. 7(b) is illustrated in Fig. 7(c). In Fig. 7(c), it is obvious that the power of frequency 0 is relatively higher than other frequencies. Usually, we do not consider the frequency 0 power a meaningful portion and thus it can be ignored. The drawback of using the walking rhythm feature is that it takes some time to collect the data. However, it is a much more stable feature in comparison with the aforementioned pedestrian model. The procedure for computing the walking rhythm of a pedestrian can be divided into three steps: Step 1: Set the power to be 0 if it is lower than a prede5ned threshold; otherwise, the power remains unchanged. Step 2: Sum up the power from the frequency 1.5 to 2:5 steps=s and calculate the power ratio by dividing the above calculated power by the total power. Step 3: If the power ratio derived from step 2 is higher than a prede5ned threshold (pedestrian con5dence level), then the target is a pedestrian; otherwise, it is not a pedestrian.

In addition to the pedestrian model and the walking rhythm, occlusion is an important issue that should be addressed. Let the distance between the two feet of a pedestrian be Fl and the horizontal distance from the head to one of the feet be Fsi (i = 1 for the left foot and i = 2 for the right foot). Fig. 8(a) shows how Fl and Fsi are de5ned. Since Fsi is dependent on the position of the head, the determination of the head position is also a crucial issue. For determining the head position, we take the projection of the upper half torso (the two feet are not included) of a pedestrian (as indicated in Fig. 8(b)). From the projection we can easily identify the location of the head. Fig. 8(c) shows the changes of Fl , Fs1 , and Fs2 over time. The above-mentioned measurements are all derived from a single pedestrian, but it is de5nitely not the case when there are more than one pedestrian and they are mutually occluded. It is obvious that when there exists mutual occlusion between two pedestrians, there is no way to measure Fl and Fsi correctly. Fig. 9(a) is an example showing two mutually occluded pedestrians. The solid line represents the Fl without occlusion, and the dashed line represents the Fl when mutual occlusion occurred. It is apparent that when there is occlusion, the width between two feet is larger than the original size due to the occlusion eDect. For solving this problem, we have to analyze the relationship between Fl and Fsi (for i = 1 or 2). From the experiments, we know Fl is equal to the product of Fsi and RFi (where i = 1 for the left foot and i = 2 for the right foot). RFi can be computed by dividing the sum of Fl ’s within an enough period of time by the sum of Fsi ’s within the same period of time. Fig. 8(d) shows the adjusted Fl and Fsi after executing phase shift estimation and adjustment on Fl and Fsi shown in Fig. 8(c). As to the peak/valley phase shift problem between Fl and Fsi , we can solve it by averaging the diDerence between each corresponding peak/valley pairs. With the above measured Fsi and RFi , one can estimate Fl easily by multiplying Fsi by RFi . Fig. 9(b) shows the estimated Fl after executing the aforementioned method. The method mentioned above can be used to detect a single pedestrian. The pedestrian crowd, however, cannot pass the check of the pedestrian model. The pedestrian crowd mentioned here means at least two or more than two pedestrians. A pedestrian crowd has the similar property as a vehicle. Because the ratio of a pedestrian crowd’s torso does not fall into the prede5ned range described in the single pedestrian case, we have to loosen the constraint of the pedestrian model in order to correctly detect the pedestrian crowd. A target that falls into the category of a pedestrian crowd

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

1031

Fig. 8. The relationship between Fl and the horizontal distance from head to foot Fsi : (a) the detected pedestrian and the de5nition of Fsi ; (b) the horizontal projection of the pedestrian to allocate the head position; (c) the relationship of Fl and Fsi without considering RF ; and (d) the relationship of Fl and Fsi with RF multiplied.

Fig. 9. The result of the estimation. (a) The walking rhythm of two mutually occluded pedestrians (the solid line represents Fl at the time when there is no occlusion, and the dashed line represents Fl at the time when occlusion occurred). (b) The result after executing the estimation process.

Fig. 10. The distribution of Aratio : (a) the Aratio distribution of single pedestrians; and (b) the Aratio distribution of vehicles.

5.1. Recognition of single pedestrians should pass the non-rigid body test. This is the only test that we have to execute in distinguishing a pedestrian crowd from a moving vehicle. 5. Experimental results We have conducted a series of experiments to test the effectiveness of the proposed method. In the 5rst part, we use the pedestrian model to identify “possible pedestrian”. If a target can pass the check of the pedestrian model, the walking rhythm test is then used to perform a 5ner check. Basically, the walking rhythm is able to distinguish a pedestrian from a non-pedestrian. And, an identi5ed “non-pedestrian” can be either a “pedestrian crowd” or a “vehicle”. The experiment will be divided into two parts. One is to recognize single pedestrians, and the other is to classify pedestrian crowds and vehicles. In what follows, we shall describe the experimental results on these two parts.

We used the pedestrian model and the walking rhythm approach to recognize single pedestrians. 46 video sequences with walking pedestrians were adopted to test our algorithms. Among the 46 test videos, 40 were able to pass the non-rigid body test and six were not able to pass the test because the pedestrians within these six videos have worn clothes that had similar colors with the road surfaces. Basically, this problem can be solved by using other types of sensors, such as a laser scanner or electric eye. We shall address this issue as our future work. In order for collecting enough contour data, we spent 0:25 s to collect data, and it was about 7.5 frames at the frame rate of 30 fps. For the parameter Aratio described in Section 4, the threshold that can be used to judge a rigid body motion from a non-rigid body motion was set to be 0.4. This setting was based on analyzing the distributions derived from a large number of tests (Fig. 10(a) and (b)). Fig. 11 shows the experimental results after applying the pedestrian model test. Fig. 11(a) – (j) were

1032

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

Fig. 12. (a) – (c) are examples of two pedestrians that walk in reverse direction. (d) – (f) are examples of two pedestrians that walk in the same direction but their velocities are diDerent. (g) – (i) are examples that one pedestrian occludes the other one all the way. The dashed line represents Fl during occlusion, and the solid line represents Fl at the time when there is no occlusion.

Fig. 11. Results obtained by executing the pedestrian model test.

a sequence of successive frames. The left-hand side of each sub-5gure was the original image frame, and the right-hand side were the results after executing the pedestrian model test. The upper part of the right-hand side of each sub-5gure was the resulting contour map and the lower part was the located pedestrian’s torso. We used diDerent colors to represent diDerent things. The green color represented a shadow point and the pink color was used to label the torso. On the other hand, the red color was used to indicate an unknown object and the blue color was used to represent a “possible pedestrian”. After the “possible pedestrian” has been identi5ed, the next step is to perform the walking rhythm test. The cutoD thresholds for the power spectrum and the pedestrian con5dence level were set to 0.25 and 0.1, respectively. In order for correctly estimating the walking rhythm of a pedestrian, we took at least four steps from each target. In other words, the duration for experimentation was at least 2 s and the total frame number involved was at least 60 frames when the frame rate was at 30 fps. Among the 40 “possible pedestrians” that were identi5ed by the pedestrian model, 39 passed the walking rhythm test and were con5rmed to be “pedestrians”. The only target that was missed in the walking rhythm test was because the pedestrian in that test was running

instead of walking. As to the occlusion problem mentioned in Section 4, our method was able to deal with it and the results were shown in Fig. 12. In Fig. 12(a) – (c), two pedestrians walked in reverse direction. The curves under each sub-5gure reIected whether the occlusion occurred or not. In this case, occlusion happened in the case of Fig. 12(b). Fig. 12(d) – (f) show two pedestrians walked in the same direction but with diDerent speeds. In this case, occlusion occurred in the case of Fig. 12(e). In Fig. 12(g) – (i), two pedestrians walked closely all the time. Therefore, they were identi5ed as a pedestrian crowd instead of an occlusion case. 5.2. Classi8cation of pedestrian crowds and vehicles The diDerence between a pedestrian crowd and a vehicle can be distinguished by a non-rigid body test as described in Section 4. We used the proposed contour map as the solution for this problem. It was mentioned in Section 4 that the measurement of a contour map could be accomplished by computing its corresponding entropy. The cutoD threshold for Eratio that can be used to distinguish a pedestrian crowd from a vehicle was determined by statistics. We measured the Eratio values from nine pedestrian crowds and 27 vehicles. Their corresponding Eratio distributions were, respectively, illustrated in Fig. 13(a) and (b). From these statistics, we can easily determine the cutoD threshold of Eratio to be 3. Based on the thresholds of Eratio and Aratio , the Aratio values of 26 vehicles were larger than the corresponding threshold. The remaining one was a cyclist riding a bicycle. Among

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

1033

Fig. 13. The distribution of Eratio : (a) the Eratio distribution of a number of pedestrian crowds; and (b) the Eratio distribution of a number of vehicles.

Fig. 15. The contour map of pedestrian crowd. The lower half part of pedestrian crowd does not have as many contour points as a single pedestrian, but its entropy value is still larger than that of the vehicle.

Fig. 14. Some contour map of vehicles. (a) the side view of a motorcycle, (b) the back of a vehicle, (c) the front of a motorcycle driver, (d) a turning car, (e) the back of a vehicle, (f) the side view of a bicycle driver. (a) – (d) are examples of successful experiments, and (e), (f) are failed ones. The reason why (f) is failed is that the contour’s variation of cyclist is really large, and (e) is because that its color is similar to the lane marking, so it is divided into several blocks.

the 26 quali5ed vehicles, 23 were successfully identi5ed as “non-pedestrian”. The three vehicles that did not pass the entropy limit were due to their colors. Basically, their colors were either similar to the color of lane markings or that of road surfaces. Fig. 14 illustrates some typical contour maps detected during the experiments. As to the experiments conducted on the nine pedestrian crowds, only one out of the nine was identi5ed as a “non-pedestrian”. Some typical contour maps detected during the experiments were shown in Fig. 15.

6. Conclusion We have proposed a vision-based system to detect and track pedestrians by combining the pedestrian model and the walking rhythm of pedestrians. The system is able to accomplish the jobs in a very e7cient and accurate manner. In addition, the occlusion problem that the most di7cult issue in this problem has been addressed and the preliminary tests demonstrate the potential for the success of our method. The

experimental results have demonstrated that the proposed system is indeed powerful in terms of eDectiveness and e7ciency.

References [1] D.M. Gavrila, Sensor-based pedestrian protection, IEEE Intell. Syst. 16 (2001) 77–81. [2] A. Broggi, M. Bertozzi, A. Fascioli, M. Sechi, Shape-based pedestrian detection, Proceedings of the IEEE Intelligent Vehicle Symposium, USA, 2000, pp. 215 –220. [3] A. Prati, I. MikiHc, C. Grana, M.M. Trivedi, Shadow detection algorithms for tra7c Iow analysis: a comparative study, IEEE Conference on Intelligent Transportation System, Oakland, California, 2001, pp. 340 –345. [4] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, W.V. Seelen, Walking pedestrian recognition, IEEE Trans. Intell. Transp. Syst. 1 (3) (2000) 155–163. [5] C. Papageorgiou, T. Poggio, Trainable pedestrian detection, Proceedings of ICIP, Kobe, Japan, 1999, pp. 25 –28. [6] C. WGohler, J.K. Anlauf, Real-time object recognition on image sequences with the adaptable time delay neural network algorithm—applications for autonomous vehicles, Image Vision Comput. 19 (9 –10) (2001) 593–618. [7] H. Mori, N.M. Charkari, T. Matsushita, On-line vehicle and pedestrian detections based on sign pattern, IEEE Trans. Ind. Electron. 41 (4) (1994) 384–391. [8] J. HeikkilGa, O. SilvHen, A real-time system for monitoring of cyclists and pedestrians, IEEE Proceedings on Visual Surveillance, Fort Collins, Colorado, 1999, pp. 74 –81. [9] L.C. Fu, C.Y. Liu, Computer vision based object detection and recognition for vehicle driving, IEEE Proceedings on Robotics and Automation, Seoul, Korea, 2001, pp. 2634 –2641.

1034

C.-J. Pai et al. / Pattern Recognition 37 (2004) 1025 – 1034

[10] L. Zhao, C.E. Thorpe, Stereo- and neural network-based pedestrian detection, IEEE Trans. Intell. Transp. Syst. 1 (3) (2000) 148–154. [11] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, T. Poggio, Pedestrian detection using wavelet templates, IEEE Proceedings on Computer Vision and Pattern Recognition, Puerto Rico, 1997, pp. 99 –193. [12] O. Masoud, N.P. Papanikolopoulos, A novel method for tracking and counting pedestrians in real-time using a single camera, IEEE Trans. Vehicular Technol. Boston, MA 50 (5) (2001) 1267–1278. [13] O. Masoud, N.P. Papanikolopoulos, Robust pedestrian tracking using a model-based approach, IEEE Conference on Intelligent Transportation Systems, 1997, pp. 338–343. [14] S. Yasutomi, H. Mori, S. Kotani, Finding pedestrians by estimating temporal-frequency and spatial-period of

[15] [16]

[17]

[18]

the moving objects, Robot. Autonom. Syst. 17 (1996) 25–34. U. Franke, D. Gavrila, S. GGorzig, F. Lindner, F. Paetzold, C. WGohler, Autonomous driving goes downtown, IEEE Intell. Syst. 13 (1998) 40–48. V. Philomin, R. Duraiswami, L. Davis, Pedestrian tracking from a moving vehicle, Proceedings of the IEEE Intelligent Vehicles Symposium, Dearborn, MI, USA, 2000, pp. 350 – 355. C.J. Pai, H.R. Tyan, Y.M. Liang, H.Y. Mark Liao, S.W. Chen, Pedestrian detection and tracking at crossroads, Proceedings of the International Conference on Image Processing, Barcelona, Spain, September 14 –17, 2003. H.T. Chen, H.H. Lin, T.L. Liu, Multi-object tracking using dynamical graph matching, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, Vol. 2, 2001, pp. 210 –217.

About the Author—CHIA-JUNG PAI received the B.Sc. degree in Information and Computer Education, the M.Sc. degree in the same department from National Taiwan Normal University in 1999 and 2002, respectively. In 2002, he worked as a Research Assistant in Chunghwa Telecom Laboratories, Taiwan, Republic of China. The main areas of his research include pattern recognition, image processing, and computer vision. About the Author—HSIAO-RONG TYAN received the B.S. degree in electronic engineering from Chung-Yuan Christian University, Chung-Li, Taiwan, in 1984, and the M.S. and Ph.D. degrees in Computer Science from Northwestern University, Evanston, IL, in 1987 and 1992, respectively. She is an Associate Professor of the Department of Information and Computer Engineering, Chung-Yuan Christian University, Chung-Li, Taiwan, where she currently conducts research in the areas of computer networks, computer security, and intelligent systems. About the Author—YU-MING LIANG received the B.S. and M.S. degrees in information and computer education from National Taiwan Normal University in 1999 and 2002, respectively. He is currently a research assistant in the Institute of Information Science, Academia, Sinica, Taipei, Taiwan. His areas of research interest include pattern recognition, image processing, and computer vision. About the Author—HONG-YUAN MARK LIAO received the B.S. degree in physics from National Tsing-Hua University, Hsin-Chu, Taiwan, in 1981, and the M.S. and Ph.D. degrees in electrical engineering from Northwestern University, Evanston, IL, in 1985 and 1990, respectively. He was a Research Associate with the Computer Vision and Image Processing Laboratory at Northwestern University during 1990 –1991. In July 1991, he joined the Institute of Information Science, Academia, Sinica, Taipei, Taiwan, as an Assistant Research fellow. He was promoted to Associate Research Fellow and then Research Fellow in 1995 and 1998, respectively. From August 1997 to July 2000, he served as the Deputy Director of the institute. Currently, he is the Acting Director of Institute of Applied Science and Engineering Research. He is also jointly appointed as a professor of the Computer Science and Information engineering Department and the Computer and Information Science Department of National Chiao-Tung University. His current research interests include multimedia signal processing, wavelet-based image analysis, content-based multimedia retrieval, and multimedia protection. He is now the Managing Editor of the Journal of Information Science and Engineering. He is on the editorial boards of the IEEE Transactions on Multimedia, the International Journal of Visual Communication and Image Representation, Acta Automatica Sinica, and the Tamkang Journal of Science and Engineering. Dr. Liao was the recipient of the Young Investigators’ award from Academia Sinica in 1998; the Excellent Paper Award from the Image Processing and Pattern Recognition society of Taiwan in 1998 and 2000; and the Paper Award from the same society in 1996 and 1999. He served as the Program Chair of the International Symposium on Multimedia Information Processing (ISMIP’97) and the Program co-chair of the Second IEEE Paci5c-Rim conference on Multimedia (2001). He will serve as the conference co-chair of the 5th International Conference on Multimedia and Exposition, which will be held at Taipei in July, 2004. About the Author—SEI-WANG CHEN received the B.Sc. degree in Atmospheric and Space Physics, the M.Sc. degree in Geophysics from National Central University in 1974 and 1976, respectively, and the M.Sc. and Ph.D. degrees in Computer Science from Michigan State University, East Lansing, Michigan, in 1985 and 1989, respectively. From 1977 to 1983, he worked as a Research Assistant in the Computer Center of Central Weather Bureau, Taiwan, Republic of China. In 1990, he was a researcher in the Advanced Technology Center of the Computer and Communication Laboratories at the Industrial Technology Research Institute, Hsinchu, Taiwan, Republic of China. From 1991 to 1994, he was an Associated Professor of the Department of Information and Computer Education at the National Taiwan Normal University, Taipei, Taiwan, Republic of China. From 1995 to 2001, he became a full Professor of the same department. He is currently a Professor of the Graduate Institute of Computer Science and Information Engineering at the same University. His areas of research interest include neural networks, fuzzy systems, pattern recognition, image processing, and computer vision.