Person re-identification using appearance ... - Semantic Scholar

3 downloads 0 Views 3MB Size Report
Abstract. In this paper, we present a person re-identification method based on appearance classification. It consists a human silhouette com- parison by ...
Person re-identification using appearance classification Kheir-Eddine Aziz, Djamel Merad, and Bernard Fertil LSIS - UMR CNRS 6168, 163, Avenue of Luminy, 13288 Marseille Cedex 9, France [email protected],{djamal.merad,bernard.fertil}@univmed. fr http://www.lsis.org/image

Abstract. In this paper, we present a person re-identification method based on appearance classification. It consists a human silhouette comparison by characterizing and classification of a persons appearance (the frontal and the back appearance) using the geometric distance between the detected head of person and the camera. The combination of the head detector, the orthogonal iteration algorithm to help head pose estimation and appearance classification is the novelty of our work. In this way, robustness against viewpoint, illumination and clothes appearance changes is achieved. Our approach uses matching of interest-points descriptors based on fast cross-bin metric. The approach applies to situations where the number of people varies continuously, considering multiple images for each individual. Keywords: Person re-identification, head detection, head pose estimation, appearance classification, matching features, cross-bin metric.

1

Introduction

Person re-identification is a crucial issue in multi-camera tracking scenarios, where cameras with non-overlapping views are employed. Considering a single camera, the tracking captures several instances of the same individual, providing a volume of frames. The re-identification consists in matching different volumes of the same individual, coming from different cameras. In the literature, the re-identification methods that focus solely on the appearance of the body are dubbed appearance-based methods, and can be grouped in two sets. The first group is composed by the single-shot methods, that model for person analyzing of a single image [5][14][15]. They are applied when tracking information is absent. The second group encloses the multiple-shot approaches; they employ multiple images of a person (usually obtained via tracking) to build a signature [2][4][6][11]. In [2], each person is subdivided into a set of horizontal stripes. The signature is built by the median color value of each stripe accumulated over different frames. A matching between decomposable triangulated graphs, capturing the spatial distribution of local temporal descriptions, is presented in [4]. In [6], a signature composed by a set of SURF interest points collected over short video

2

Person re-identification using appearance classification

sequences is employed. In [11], each person is described by local and global features, which are fed into a multi-class SVM for recognition. Other approaches simplify the problem by adding temporal reasoning on the spatial layout of the monitored environment, in order to prune the candidate set to be matched [7], but these cannot be considered purely appearance-based approaches. In this paper, we present a novel multiple-shot person re-identification method based on the classification of the appearance for each person on two classes, frontal and back appearance. A pre-processing step is integrated in single-camera tracking phase to classifier the appearance of the persons. Then, the complementary aspects of the human body appearance are extracted from the class of images a set of SIFT, SURF and Spin image interest points. For the matching, we use a new robust and fast cross-bin metric based Earth Movers Distance (EMD) variant presented in [12]. Our method reduce the confusion between the appearances due the illumination, viewpoint changes (see Figure 1). The rest of the paper is organized as follows. Sec.2 gives detail of our approach. Several results are reported in Sec.3 and, finally, the conclusions are drawn in Sec.4.

Fig. 1. Examples of matching of interest points for a given person seen under 2 same viewpoints (better matching with appearance classification)(a, c) and from a different viewpoints(b).

2

The proposed method

The proposed system is illustrated in the Figure 2. In the first phase, the appearance class is built for each detected and tracked person. In the second phase, the appearance classification of each person is performed by calculating the distance head-camera. In the third phase, the features are accumulated from each appearance class. Finally, the matching of the signatures using cross-bin is performed. 2.1

The appearance classification

The single-camera tracking output usually consists in a sequence of consecutive images of each individual in the scene. The appearance of the person is reached by calculating the distance between the detected head and the camera. For the head detection, we adopt our method presented in [3] (see Figure 3). If this

Person re-identification using appearance classification

3

Fig. 2. The person re-identification system.

distance increases we are talking about frontal pose, otherwise it’s about back pose (see Figure 4-b). We choose to calculate the distance between the heads and the camera because in the crowded environments, it is easy to detect the heads of people than the people themselves. This advantage is presented in [3].

Fig. 3. Example of the head detection in crowded environment [3].

To calculate this distance, we assume the head space coordinates model {Xh , Yh , Zh } and the model of head. The four 2D corners of the head model in the head space coordinates model are called A, B, C and D as shown in Figure 4-a. We assume the size of the head model in head space coordinates model is (20cm × 20cm). The second one is the camera space coordinates {Xc , Yc , Zc } supposedly calibrated. The last one is the image space coordinates {U, V }. The points a, b, c, d are the coordinates of the corresponding corners of the detected head in the image plane. This distance consists in finding the rigid transformation (R, T) from C to H (Equation (1)).

4

Person re-identification using appearance classification



   Xc Xh  Yc  = [R|T ]  Yh  Zc Zh

(1)

We use the method named Orthogonal Iteration (OI) algorithm, proposed by Lu et al.[10]. To estimate the head pose, this algorithm used an appropriate error function defined in the local reference model of head. The error function is rewritten in order to accept an iteration based on the classical solution of the 3D pose estimation problem, called absolute orientation problem. This algorithm gives exact results and converges quickly enough.

(a) Fig. 4. (a) Head pose estimation. (b) Different appearance of one person.

If the component Tz of the translation vector T increases (see Figure 4-a, Algorithm 1), the person is in the frontal pose, otherwise the person is in the back pose. For the comparison between two consecutive values of Tz is significant, we do not use every successive frame, but instead images sampled every half-second (time-spaced images). In the case profile pose, the global appearance of one person does not much change compared to the frontal or back pose. Therefore, we consider this pose as the frontal or back pose. 2.2

Descriptors

In the following we briefly explain the SIFT [9], SURF [1] and Spin image [8] descriptors which offers scale and rotation invariant properties. SIFT (Scale Invariant Feature Transform) descriptors are computed for normalized image patches with the code provided by Lowe [9]. A descriptor is a 3D histogram of gradient location and orientation, where location is quantized into a 4 × 4 location grid and the gradient angle is quantized into eight orientations. The resulting descriptor is of dimension 128. SURF (Speeded Up Robust Features) is a 64-dimensional SURF descriptor [1] also focuses on the spatial distribution of gradient information within the

(b)

Person re-identification using appearance classification

5

Algorithm 1 Distance head-camera for one person Require: Internal parameters of the calibrated camera Require: heads list : list of the tracked head for one person Require: [Xh , Yh , Zh ] = [(0, 20, 0); (20, 20, 0); (20, 0, 0); (0, 0, 0)] : The coordinates of the four corners of the head model in the head space coordinates model Require: [a, b, c, d]: The four corners of the detected head in the image plane 1: for i = 0 to heads list.size do 2: [Ri , Ti ] ← OI([Xci , Yci , Zci ], [Xh , Yh , Zh ]) 3: [Ri+1 , Ti+1 ] ← OI([Xci+1 , Yci+1 , Zci+1 ], [Xh , Yh , Zh ]) 4: if Tzi > Tzi+1 then 5: Appearance P ersoni ← Frontal pose 6: else 7: Appearance P ersoni ← Back pose 8: end if 9: end for

interest point neighborhood, where the interest points itself can be localized by interest point detection approaches or in a regular grid. Spin image is a histogram of quantized pixel locations and intensity values [8]. The intensity of a normalized patch is quantized into 10 bins. A 10 bin normalized histogram is computed for each of five rings centered on the region. The dimension of the spin descriptor is 50. 2.3

Feature Matching

In general, we have two sets of pedestrian images: a gallery set A and a probe set B. Re-identification consists in associating each person of B to the corresponding person of A. This association depends on the content of two sets: 1) each image represent a different individual appearance (frontal and back appearance); 2) if both A and B contain the same individual appearance (frontal or back appearance). For the matching, the several measures have been proposed for the dissimilarity between two descriptors. We divide them into two categories. The bin-by-bin dissimilarity measures only compare contents of corresponding vector bins. The cross-bin measures also contain terms that compare non-corresponding bins. To this end, cross-bin distance makes use of the ground distance dij , defined as the distance between the representative features for bin i and bin j. Predictably, bin-by-bin measures are more sensitive to the position of bin boundaries. The Earth Movers Distance (EMD) [13] is a cross-bin distance that addresses this alignment problem. EMD is defined as the minimal cost that must be paid to transform one vector into the other, where there is a ground distance between the basic features that are aggregated into the vector. Pele et al.[12] proposed a linear time algorithm for the computation of the EMD variant, with a robust ground distance for oriented gradients. Given two histograms P, Q; the EMD as defined by Rubner et al.[13] is:

6

Person re-identification using appearance classification ∑ i,j fij dij ∑ i,j fij f ij

EM D(P, Q) = min ∑ j

fij ≤ Pi ,



fij ≤ Qj ,

i

∑ i,j

fij = min(

∑ i

Pi ,



(2) Qj ), fij ≥ 0

j

where fij denote the flows. Each fij represents the amount transported from the ith supply th the j th demand. dij denote the ground distance between bin i and bin j in the histograms. Pele et al.[12] proposed the variante: ∑ ∑ ∑ \ (3) EM Dα (P, Q) = (min fij dij ) + Pi − Qj × α max {dij } i,j fij i,j i j It is common practice to use the L2 metric for comparing SIFT descriptors. This practice assumes that the SIFT histograms are aligned, so that a bin in one histogram is only compared to the corresponding bin in the other histogram. This is often not the case, due to quantization, distortion, occlusion, etc. This distance has three instead of two matching costs: zero-cost for exact corresponding bins, one-cost for neighboring bins and two-cost for farther bins and for the extra mass. Thus the metric is robust to small errors and outliers. \ EM D has two advantages over Rubners EMD for comparing SIFT descriptors. First, the difference in total gradient magnitude between SIFT spatial cells is an important distinctive cue. Using Rubners definition this cue is ignored. \ Second, EM D is a metric even for non-normalized histograms. In our work, we use the same metric to match the SIFT, the SURF and Spin image descriptors. Finally, the association of each person Pi to the corresponding person Pj is done by a voting approach: every interest point extracted from the set Pi is compared to all models points of the person Pj , and a vote is added for each model containing a close enough descriptor. We just match the interest points of the same class of appearance. Eventually, the identification is made with the highest voted for the model (see Figure 5). Two descriptors are matched if the first descriptor is the nearest neighbor to the second and the distance ratio between the first and the second nearest neighbor is below a threshold.

3

Experimental Results

In our knowledge, there is no available benchmark for evaluation of person re-identification based on multi-shot appearance with non-overlapping camera. Therefore, we decide to conduct a first experimental evaluation of our proposed method on a available series of videos showing persons recorded with three nonoverlapping camera. These videos include images of the same 9 person seen by the three non-overlapping cameras with different viewpoints 1 . (see Figure 7). 1

http://kheir-eddine.aziz.perso.esil.univmed.fr/demo

Person re-identification using appearance classification

7

Fig. 5. Person re-identification process.

The re-identification performance evaluation is done with the precision-recall metric: P recision =

correct matches correct matches + f alse matches ,

Recall =

correct matches queries number

(4)

To evaluate the overall performance of our proposed method, we select the set of person detected in the camera one as test data (request) and the set of the person detected in the other cameras as reference data (model). The model for each person was built with 8 images (4 images for the frontal appearance and 4 images for the back appearance collected during tracking). Likewise, for a test, each query was built with 8 images (see Figure 5). Figure 7 provides the evaluation of the number of matches interest points. We observed that the number of interest points is more important with the classification of the appearances. This explains the classification of appearances greatly reduces the illumination and appearance changes. The resulting performance, computed on a total of 100 queries from the camera 1, is illustrated in the Figure 8-a, in which we make the minimum number of matched points vary (15, 25, 35, 45, 55, 65) between query and model required to validate a re-identification. The precision is higher for the three descriptors

8

Person re-identification using appearance classification

Fig. 6. Typical views of the person used for the test. Line 1, line 2 and line 3 are sets of persons viewed respectively by camera 1, camera 2 and camera 3

Fig. 7. The number of matches interest points with and without appearance classification (with SURF descriptor)

(a) Fig. 8. (a) Precision-recall for our person re-identification method. (b) The precision vs recall curves for person re-identification according to method proposed in [6].

(SIFT, SURF, Spin image) compared to the approach proposed by Hamdoun et al in [6] (see Figure 8-b). Therefore; there are less false matches due to mainly to the similarity in appearance. For example, the legs of the person 8 is similar to those of the person 5, 7 and 9 ( see Figure 9) and part of the torso of the person 1 is similar to the legs of the person P7 and P8 . This problem can avoid by using a 2D articulated body model.

(b)

Person re-identification using appearance classification

9

(a) SURF

(b) SIFT

(c) SPIN

Fig. 9. Confusion matrix in person re-identification experiment with different descriptors.

4

Conclusions

In this paper, we proposed a new person re-identification based on the appearance classification. Its consists in classifying the people in two appearances classes by calculating the geometric distance between these heads and the camera. Based on the detection head by skeleton graph, the calculating the distance between a person and a camera is easy even in crowded environment. Employing this classification, we obtained the novel best performances to identifier people. As future work we plan to extend our method by extending the 2D articulated body model to reduce de confusion case. Though at a preliminary stage, we are releasing the person database and our code.

10

Person re-identification using appearance classification

References 1. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (surf). In: Computer Vision and Image Understanding, vol. 110, pp. 346–359, Elsevier Science Inc, New York (2008) 2. Bird, N.D., Masoud, O.T., Papanikolopoulos, N.P., Isaacs, A.: Detection of loitering individuals in public transportation areas. In: IEEE Transactions on Intelligent Transportation Systems, vol. 6, pp. 167–177, Minneapolis (2005) . 3. Merad, D., Aziz, K.E., Thome, N.: Fast people counting using head detection from skeleton graph. In: IEEE Conference, Advanced Video and Signal Based Surveillance, vol. 0, pp. 151–156, Los Alamitos (2010) 4. Gheissari, N., Thomas, B.S., Hartley, R.: Person reidentification using spatiotemporal appearance. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1528–1535, Washington (2006) 5. Douglas, G., Hai, T.: Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: Proceedings of the 10th European Conference on Computer Vision, pp. 262–275, Springer-Verlag, Berlin, Heidelberg (2008) 6. Hamdoun, O., Moutarde, F., Stanciulescu, B., Steux, B.: Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In: The third ACM/IEEE International Conference on Distributed Smart Cameras, pp. 1–6 (2008) 7. Javed, O., Shafique, K., Rasheed, Z., Shah, M.: Modeling inter-camera space-time and appearance relationships for tracking across nonoverlapping views. In: Comput. Vis. Image Underst, vol. 109, pp. 146–162, New York, USA (2008) 8. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 433–449, Los Alamitos, USA (1999) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In: International Journal of Computer Vision, vol. 60(2), pp. 91–110 (2004) 10. Lu, C.P., Hager, G.D., Mjolsness, E.: Fast and globally convergent pose estimation from video images. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22(6), pp. 610–622, Washington (2000) 11. Nakajima, C., Pontil, M., Heisele, B., Poggio, T.: Full-body person recognition system. In: Pattern Recognition, vol. 36(9), pp. 1997–2006 (2003) 12. Pele, O., Werman, M.: A linear time histogram metric for improved sift matching. In: Proceedings of the 10th European Conference on Computer Vision: Part III, pp. 495–508, Springer-Verlag, Berlin, Heidelberg (2008) 13. Rubner, Y., Tomasi, C., Guibas, L.G.: The earth movers distance as a metric for image retrieval. Int. International Journal of Computer Vision, vol. 40, pp. 99–121, Springer Netherlands, Hingham, USA (2000) 14. Schwartz, W.R., Davis, L.S.: Learning discriminative appearancebased models using partial least squares. In: Computer Graphics and Image Processing, vol. 0, pp. 322–329, , Brazilian Symposium, Los Alamitos, USA (2009) 15. Zheng, W., Davis, L., Xiang, T.: Associating groups of people. In: British Machine Vision Conference, pp. 1–6 (2009)