A Multi-task Deep Network for Person Re-identification

0 downloads 0 Views 5MB Size Report
Jul 19, 2016 - Person re-identification (RID) focuses on identifying people ..... For most person re-identification datasets, the size of data is too small to train a.
arXiv:1607.05369v1 [cs.CV] 19 Jul 2016 1

A Multi-task Deep Network for Person Re-identification Weihua Chen1 , Xiaotang Chen1 , Jianguo Zhang2 , Kaiqi Huang1 1

National Laboratory of Pattern Recognition, Institute of Automation Chinese Academy of Sciences 2 Computing, School of Science and Engineering, University of Dundee

Abstract. Person re-identification (RID) focuses on identifying people across different scenes in video surveillance, which is usually formulated as either a binary classification task or a ranking task in current person RID approaches. To the best of our knowledge, none of existing work treats the two tasks simultaneously. In this paper, we take both tasks into account and propose a multi-task deep network (MTDnet) to jointly optimize the two tasks simultaneously for person RID. We show that our proposed architecture significantly boosts the performance. Furthermore, a good performance of any deep architectures requires a sufficient training set which is usually not met in person RID. To cope with this situation, we further extend the MTDnet and propose a cross-domain architecture that is capable of using an auxiliary set to assist training on small target sets. In the experiments, our approach significantly outperforms previous state-of-the-art methods on almost all the datasets, which clearly demonstrates the effectiveness of the proposed approach. Keywords: multi-task, person re-identification, deep learning

Introduction

Person re-identification (RID) is an important task in wide area video surveillance. The key challenge is the large appearance variations, usually caused by the significant changes in human body poses, illumination and camera views. Recently, deep learning approaches [1,2,3,4,5] are successfully employed in person RID with significant performance, especially on large datasets, such as CUHK03. Most deep learning methods [1,2,3] solve the problem as a binary classification issue and adopt a classification loss (e.g. a softmax loss) to train their models. The core behind these approaches is to train an identifiable feature for each pair to do the classification. Though they manage to maximize the overall classification accuracy of image pairs, there is no mechanism to guarantee the positive pairs holding the shorter within-pair distances than false positive pairs, which is a preferred property for person RID. In these methods, only pairs of images are compared, the relative information between pairs are scarce. An example is shown in Fig. 1 (a). Case 1 and 2 illustrate two projected distributions of scores obtained by trained binary classifiers over image pairs

Fig. 1. Problems in two tasks.(a) Classification issue: the classification loss prefer to train a lower misclassification rate model like Case 2 rather than Case 1. (b) Ranking issue: the appearance of top-rank images is more similar to the query image, while the true positive presents a much less similar appearance. (Best viewed in color and see main text for detailed explanation)

containing images from three persons (person A, B and C). For each pair sample, the score underneath denotes the dissimilarity between its two images (e.g., 1 − sof tmax). Query:X indicates where an image from person X is used as a query image (the left image in a pair). For example, Query:A means an image from person A is used as a query image. Green-coloured rectangle indicates a matched pair, and red rectangle for the mismatched pair, both at ground truth level. In Case 1, it is evident that for each query image (w.r.t one particular person), we can get the correct rank-1 match, i.e. two images within its positive pairs always hold a smaller dissimilarity value than those within its negative pairs. However, in this case it is very difficult for a classifier to determine a suitable threshold to get a low misclassification cost (e.g., less than two misclassified samples). On the contrary in Case 2, where the vertical dashed line denotes the decision threshold learned by the classifier, the classifier has a lower misclassification rate. As a result, a binary classifier (e.g., trained by softmax) will favor Case 2 rather than Case 1, as the classification loss in Case 2 (one misclassified sample) will be lower than that in Case 1. But in RID, we prefer Case 1, which outputs correct rank-1 matches for all of the three persons (if queried for each), rather than Case 2 that contains a false rank-1 match (highlighted in an orange circle). Case 2 could be potentially rectified by a ranking loss. As person RID commonly uses the Cumulative Matching Characteristic (CMC) curve for performance evaluation which follows rank-n criteria, some deep learning approaches [4] begin to treat the person RID as a ranking task, similar to image retrieval, and apply a ranking loss (e.g. a triplet loss) to address the problem. The main purpose is to keep the positive pairs maintaining a shorter relative distances in the projected space. However, the person RID differs from image retrieval in that person RID needs to identify the same person across

different scenes (i.e., a task of predicting positive and negative pairs, focusing on identifiable feature learning, and a positive pair is not necessarily the most similar pair in appearance). Ranking-based approaches are sensitive to their similarity measurements. The current measurements (e.g. the Euclidean distance in the triplet loss) care more about the similarity to query images in appearance. In the projection space obtained by a model trained on the triplet loss, it’s very challenging to find out a true positive which holds a less similar appearance, as shown in Fig. 1 (b). Current person RID approaches on either the classification task or the ranking task have their advantages and shortcomings. This paper proposes an alternative solution to better address the person RID by taking both tasks together. The ranking loss encourages a relative distance constraint, while the classification loss trains an identifiable feature for each pair during the similarity measurement. We jointly optimize two tasks together in one deep network in consideration of the positive correlation and the complementarity of two tasks. To be specific, the architecture of the classification task in the proposed network is also different from other classification-based networks [1,2,3]. Ahmed et al. [2] presents an improved deep network architecture to learn both the feature extraction and the similarity measurement, which has shown the best results among all the deep-learning-based methods. It can be observed that both DeepRID [1] and Ahmed et al. [2] incorporated some manual operations in their networks as priori knowledge. However, the effectiveness of these priori knowledge depends on the reliability of empirical knowledge, which partly limits the applicability of those knowledge-dependent networks. One purpose of this paper is to develop an approach that is capable of learning fully data-driven features without the need to introduce priori knowledge (i.e., priori knowledge free). In our classification task, joint feature maps are introduced to represent the relationship of paired person images. The joint feature maps are learned completely on training data, while Ahmed’s network [2] is based on the local positional difference of the two sets of feature maps. Therefore, our approach gives more freedom to the network to learn the relationships between the two raw images. Deep learning approaches, such as convolutional neural networks (CNN) [6], benefit a lot from a large scale dataset (e.g., ImageNet). However, this is not the case in person RID, as it is noted that most of current datasets with matched image pairs acquired for person RID are often of limited sizes, e.g., CUHK01 [7], VIPeR [8], iLIDS [9], PRID [10]. It could hinder the attempts to maximize the learning potential of our proposed network on each of those datasets. This case can be migrated by using some auxiliary datasets. However, the variations across camera views are different from dataset to dataset. As a consequence, the data of the auxiliary dataset can’t be directly used to learn the joint feature maps in small datasets. In this paper, the problem is considered as a semi-supervised cross-domain issue [11,12]. The target domain is the small dataset that contains only a few samples and the source domain is an auxiliary dataset which is large enough for training CNN models. As the person RID can be considered as a

binary classification problem, our purpose is to keep the samples of the same class in different domains closer. A cross-domain architecture is further proposed to minimize the difference of the joint feature maps in two datasets, which are belonged to the same class of pairs (i.e., positive pair and negative pair), and utilize the joint feature maps of the auxiliary dataset to fine tune those of small datasets during the training process. In this case, the joint feature maps of small datasets are improved with the data of the auxiliary dataset and boost the RID performance on smaller target datasets. In summary, our contributions are four-fold: 1) a novel multi-task deep network for person re-identification, which jointly optimize the loss of classification and ranking;1 2) effective joint feature maps introduced in the classification task without the need to add priori knowledge; 3) a cross-domain architecture based on the contrastive loss of joint feature maps for handling the challenge of limited training set; 4) a comprehensive evaluation of our methods on five datasets, and showing the superior performance over most of the stateof-the-art methods. The rest of the paper is organized as follows. Related work is introduced in Section 2. The proposed multi-task network including the joint feature maps and the cross-domain architecture is described in Section 3, Then, the experimental results and conclusions are presented in Section 4 and Section 5.

2

Related work

Most of existing methods in person RID focus on either feature extraction [13,14,15], or similarity measurement [16,17,18,19,20,21,22,23]. Person image descriptors commonly used include color histogram [16,17,24], local binary patterns [16,17,24], gabor features [17], and etc., which show certain robustness to the variations of poses, illumination and viewpoints. For similarity measurement, many metric learning approaches are proposed to learn a suitable metric, such as mahalanobis metric learning [16], locally adaptive decision functions [18], local fisher discriminant analysis [20], correspondence structure learning [21], cross-view quadratic discriminant analysis [22], and etc. A few of them [24,25] learn a combination of multiple metrics together. However, manually crafting features and metrics designed require empirical knowledge, and are usually not optimal to cope with large intra-person variations. Nowadays, with the development of deep learning and increasing availability of datasets, the handcrafted features and metrics struggle to keep top performance widely, especially on datasets of a very large scale. Alternatively, deep learning is attempted for person RID to automatically learn features or metrics, or both [1,2,3,4,5]. Some of them [4,5] consider the person RID as a ranking issue. For example, Ding et al. [4] use a triplet loss to get the relative distance between images. Other approaches [1,2,3] tackle the person RID problem from the classification aspect. For instance, Yi et al. [3] utilize a siamese 1

The source codes and models can be found on https://github.com/cwhgn/MTDnet.

convolutional neural network to train a data-driven feature representation and employed a cosine function layer for matching. Li et al. [1] design a deep filter pairing neural network to solve the re-identification problem. Both of them employ a classification loss to train their models. Our network combines two tasks (the classification loss and the ranking loss) together and utilizes their complementary during training. It is noted that the multi-task loss concept is recently explored in fast RCNN [26] but on the different problem of object detection and different combination of losses: a classification loss and a regression loss. What’s more, even as a classification task, current networks [1,2,3] still contain domain-driven operations. DeepRID [1] divide images into several horizontal stripes before training. Ahmed et al. [2] provide an improved deep learning architecture; most of their network are data-driven. But in its third layer, a subtraction based on small patches is used for the cross-input neighborhood differences. Though intuitively, images should be compared by local positional differences, which is hypothetical priori knowledge and imposed manually. Instead, we seek to enable the network learning how to compare the two images, rather than comparing based on a positional difference map. We have demonstrated the superior performance of this step in our experiment. It is worth noting that none of the works above in person RID seeks to solve the problem of “learning a deep net on a small dataset” which is a typical case in person RID. This paper further addresses this issue by proposing a cross-domain deep architecture capable of learning across RID datasets.

3

The proposed network

In this section, the details of our network are described. Section 3.1 introduces the design of the multi-task network. Section 3.2 presents insightful analysis of the joint feature maps. The cross-domain architecture is presented in Section 3.3. 3.1

The multi-task network

The framework of our proposed multi-task network is shown in Fig. 2. It contains two parts: the classification block and the ranking block. Following current methods [4], this paper applies the ranking loss on co-reinforcing the learning of the feature representation in the convolutional layer, whilst the classification architecture is used to identify whether an image pair belongs to the same person (i.e., a binary classification problem). From wang’s work [27], it had been shown that the higher layers in deep network capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra-class variations. In our model, we want to employ the classification loss to train identifiable features of each pair for the binary classification and the ranking loss to co-train the discriminative features with relative distances. Therefore, in

Fig. 2. The framework of the proposed multi-task deep network and the cross-domain architecture. Best viewed in color. Please note the partition of the multi-task network into classification block and ranking block is purely for the sake of clarity and the ease of explanation. They shared the first two layers, and jointly optimized together. The cross-domain architecture is only used when an auxiliary dataset is needed for training.

our network the ranking loss is set in the bottom layer, while the classification is designed after the top layer 2 . The ranking block is a triplet-input model. For each positive pair, we produce ten triplets (a positive pair + a negative image: A1 , A2 , B2 3 ). All these triplets constitute our training data. The input triplet contains three images each of the size 3 ∗ 224 ∗ 224. The ranking task includes two convolutional layers at the beginning, which are used to reinforce the learning of discriminative features. After the two convolutional layers, three sets of feature maps hold the same size of 256 ∗ 13 ∗ 13 and are sent to a triplet loss through a shared fully connected layer. The triplet loss being minimized is the same as FaceNet [28]:

Ltrp =

N X

[kfA1 − fA2 k22 − kfA1 − fB2 k22 + α]+

(1)

i=1

where α is a margin that is enforced between positive and negative pairs, N is the number of the triplets. f ∈ R512 denotes the features input to the triplet loss from three images. Minimizing the triplet loss is to reserve the information of the relative distance between input images. The feature maps learned by the triplet loss could also be more discriminative for further matching which eliminates noises and appearance variations from different camera views to some extent. In the classification part, the input to the third convolutional layer is a set of feature maps of an image pair. The three sets of aligned feature maps with the size of 256 ∗ 13 ∗ 13 from the ranking task are regrouped into two types of pairs, a positive pair and a negative pair. The feature maps from the two images with the same person ID, i.e. (A1 , A2 ), are concatenated as a positive pair, while one image in the matched pair (A1 ) and one negative image (B2 ) from the different camera view are stacked to form the negative pair as those two images are from different cameras. The size of feature maps of each pair is 512 ∗ 13 ∗ 13. These two pairs are fed to three convolutional layers in order, one at each time. The feature maps learned from these layers are called the joint feature maps, which come from each input pair to encode the relationship of two images. Then they are sent into the full connected layers to calculate the similarity. The joint feature maps hold the identifiable information of the input image pair that can represent the relationship of two images. We use these joint feature maps to identify whether the input image pair is from the same person. The classification loss in our network is the binary logistic regression loss, the same as the binary softmax loss in [1,2],

Lcls = −

N X

[(1 − y)p0 + yp1 ]

(2)

i=1 2

3

We are fully aware that there are other possible designs, and investigation of all possible designs are beyond the scope of this paper. A, B are the person IDs and 1, 2 mean the camera IDs.

Fig. 3. Visualization of some channels learned in a random kernel of the third convolutional layer. The upper line shows some channels selected randomly from the first 256 channels which used to convolve the set of feature maps from the first image, and the lower line expresses the corresponding ones in the latter 256 channels used to convolve those from the second image.

where y ∈ {0, 1}. When the input pair is a positive pair (e.g. (A1 , A2 )), y = 1. On the contrary, y = 0 for a negative pair (e.g. (A1 , B2 )). p is the discrete probability distribution, p = {p0 ; p1 }, over two categories. The multi-task loss L for each triplet input to jointly train for ranking and classification is: L = (1 − w)Ltrp + 2wLcls

(3)

w is the weight for regulating the effect of the ranking task and the classification task, and set to 0.5 in this paper. The factor 2 comes from the fact that one triplet produces two pairs (a positive and a negative). Our five convolutional layers are extended from the architecture of AlexNet [29], differing in that the size of each kernel in the third convolutional layer is (512 × 3 × 3) instead of (256 × 3 × 3) used in AlexNet. In the train phase, the triplet loss optimises the first two convolutional layers while the classification loss simultaneously trained all five convolutional layers including the first two. In other words, the kernels of the first two layers are jointly optimised by two losses for extracting a discriminative feature of each image. The left three layers are mainly trained by the classification loss to obtain an identifiable feature for image pairs to achieve the binary person identification. In the test phase, only the classification task architecture (including the first two layers) is used, the input two images are sent through five convolutional layers and three fully connected layers, with the last layer predicting the goodness of matching of a test pair. 3.2

Analysis of joint feature maps

Ahmed et al. [2] makes an assumption that the local difference of two images can be obtained from a positional subtraction based on small patches, so they add this as a priori knowledge into their network using Eq.4.

Ki (x, y) = fi (x, y)1(5, 5) − N [gi (x, y)] where

(4)

1(5, 5) ∈ R5×5 is a 5 × 5 matrix of 1s, N [gi (x, y)] ∈ R5×5 is the 5 × 5 neighborhood of gi centered at (x, y).

In [2], the size of patch is 5 × 5. fi (x, y) and gi (x, y) are the corresponding patch pairs, and the 5 × 5 matrix Ki (x, y) is the difference of two 5 × 5 matrices. Our method attempts to learn the whole classification task purely by training data. As seen in Fig.2, if this subtraction process is effective and could be learned through the training data, it would be reflected in kernels learned in the third convolutional layer. As we known, the kernels in the third convolutional layer hold the size (512 × 3 × 3). The first 256 channels of each kernel in the third convolution are used to convolve the set of feature maps from the first image, while the latter 256 channels are to convolve that from the second image. The equation in our network corresponding to Eq.4 is down Ki (x, y) = N [fi (x, y)]kup (3, 3) i (3, 3) + N [gi (x, y)]ki

where

(5)

kup is the first 256 channels of kernel ki . i kdown is the latter 256 channels of kernel ki . i

After comparing the two equations, we can infer that Eq.5 can be similar down as Eq.4 as long as the kup (3, 3) are set to 1(5, 5) and −1(5, 5) i (3, 3) and ki respectively. Fig.3 visualizes some sample channels in one kernel learned in the third convolutional layer trained using CUHK03 dataset. It can be seen that most of the corresponding locations in the upper channels (kup ) and the lower channels (kdown ) have the opposite colors, which indicates that they hold opposite signs and the kernel mainly provides a subtraction-like operation in Eq.5. In this way, it can be found that our network has partly learned this priori knowledge on its own. The results imply that the effect of our learned operation is similar to that of a subtraction, which illustrates that the person RID in the CUHK03 dataset indeed needs a subtraction-like operation. However, this doesn’t mean our network fails in learning something more useful than just the subtraction in a general sense. On the contrary, it demonstrates that our network is capable of learning this subtraction-like operation when necessary. Superior results in Section 4 over Ahmed’s method [2] on other datasets indicates that what our network learned is beyond the subtraction-like operation. Fig.4 (a) and (b) visualize two examples of the joint feature maps in the fifth convolutional layer for a positive pair (Fig.4 (a)) and a negative pair (Fig.4 (b)). There are 256 joint feature maps for each pair and we show them all. It can be seen that the negative pair has higher responses than the positive pair in most joint feature maps. The two pairs can be easily distinguished from their feature maps.

Fig. 4. (a) and (b) are visualizations of the joint feature maps after the fifth convolutional layer. (a) is a positive pair and (b) is a negative pair. The brighter points indicate larger values. (c) is the feature responses of a positive pair and a negative pair after the second fully connected layer. The distributions are from values of the 4096 dimensional features.

Fig.4 (c) shows the feature responses after the second fully connected layer for a positive and a negative pair respectively. It’s obvious that the positive pair holds much lower responses on average than the negative pair, which confirms the capability of the learned joint feature maps in discriminating positive or negative pair. 3.3

Cross-domain architecture

For most person re-identification datasets, the size of data is too small to train a deep model. The common way is to crop or mirror the images, which can increase the number of the samples in datasets. However, even with these augmentation processes, the total number of the samples is still far from the requirement of deep learning. This problem is considered as a semi-supervised cross-domain issue in this paper. In cross-domain transfer, the assumption is that two domains share the same task but the data distributions are different. For example, in image classification, two domains would have the same category but the images contain different views or illuminations. In our issue, the corresponding assumption is that two re-identification datasets should share the same similarity function while different variations caused by views or poses widely exist in images from two datasets. In Fig.2, the relationship of two images is reflected by the joint feature maps. For two positive pairs from two different datasets, the learned similarity measures for each of the pairs should ideally lead to the same prediction results, i.e., both of the pairs are matched pairs. To achieve such a transfer, we propose to force the learned joint feature maps of the positive pairs from two datasets closer than those of negative pairs. The proposed cross-domain architecture is also shown in Fig.2, which utilizes a contrastive loss [30] to keep the two sets of joint feature maps of the same class

of pairs as similar as possible during the training process. The label for the two pairs is designed in the following equation. labelp = labela labelb

(6)

where means the XNOR operation, labela ∈ {0, 1} is the label for a pair from source; labelb ∈ {0, 1} is the label for a pair from target; labelp is the result after performing the XNOR operation between the labels of those two pairs. If the labels of the two pairs are the same (i.e., labela and labelb are the same), the contrastive loss will keep the two sets of the joint feature maps closer, and otherwise larger, the loss is as following: Lcts = −

N X

1 1 [y d2w + (1 − y) max(0, m − dw )2 ] 2 2 i=1

(7)

dw = kFa − Fb k2 where y is the label of two pairs after the XNOR operation, Fa and Fb are responses of the feature maps after the second fully connected layer from two datasets. The training phase of the cross-domain architecture is also a multi-task process. The softmax loss and the triplet loss are to do the re-identification task, while the contrastive loss is employed to keep two sets of the joint feature maps from the same class in two datasets as similar as possible. After training, only the model on the target dataset will be reserved for testing. The whole process can be considered as another kind of fine-tune operation using a crossdomain architecture. The purpose is to use the set of the joint feature maps learned on the auxiliary source dataset to fine tune that on smaller target sets during training and boost the RID performances. It is worth noting that we don’t force the feature maps of two completely different people each from one of two datasets to be similar, instead we ensure that the way in which image pairs are compared (encoded by the learned weights on the joint feature maps) is similar and could be shared across the two datasets. That is the motivation of introducing the cross-domain architecture.

4

Experiments

We conducts two sets of experiments: 1) to evaluate the proposed multi-task deep net (including single-task net), the joint feature maps and the cross-domain architecture; 2) to compare the proposed approach with state of the arts. 4.1

Setup

Implementation and protocol. Our methods are implemented using the Caffe framework [31]. All images are resized to 227 × 227 before being fed to network. The learning rate is set to 10−3 consistently across all experiments.

For all the dataset, we horizontally mirror each image and increase the dataset fourfold. We use a pre-trained AlexNet model (trained on Imagenet dataset [29]) to initialize the kernel weights of only the first two convolutional layers, because the pre-trained model is used to represent features of images, not pairs. In the proposed network, the first two convolutional layers are used to learn the image features, while the rest three convolutional layers manage to extract the features of image pairs. Cumulative Matching Characteristics (CMC) curves are employed to measure the RID performance. We report the single-shot results on all the datasets. Dataset and settings. The experiment is conducted on one large dataset and four small datasets. The large dataset is CUHK03 [1], containing 13164 images from 1360 persons. We randomly select 1160 persons for training, 100 persons for validation and 100 persons for testing, following exactly the same setting as [1] and [2]. The four small datasets are CUHK01 [7], VIPeR [8], iLIDS [9], PRID [10]. The CUHK01 [7] dataset contains 971 persons captured from two camera views. Each person has 2 images from Camera A and 2 images from Camera B. The VIPeR [8] dataset includes 632 individuals. Every individual contains two images from two cameras. The iLIDS [9] dataset consists of 119 individuals captured from eight cameras with different viewpoints. The number of images for each individual varies from 2 to 8. The PRID [32] dataset is designed mainly for person re-identification. The images are captured from two surveillance cameras. Camera view A contains 385 individuals, while camera view B contains 749 individuals, with 200 of them appearing in both views.

Fig. 5. Experimental results of each block of the proposed network. Rank-1 identification rates are shown in the legends.

For the small datasets with two cameras, we take images of the individual from camera view A as probe images and randomly choose one image of the same individual in camera view B as the gallery image. For multi-camera datasets (with larger than two cameras), one image of the individual is selected as the gallery image and the rest images of the same individual are used as probe images. In each dataset, we randomly divide the individuals into two equal parts, with one used for training and the other for testing. Specifically, for CUHK01, VIPeR, iLIDS and PRID, the number of individuals used for training is 485, 316, 59 and 100 respectively, and the rest for testing. In the PRID dataset, besides 100 test individuals, there are another 549 people in the gallery. Note that for comparison purpose, we further report our results on CUHK01 with another additional settings: only 100 persons in CUHK01 are randomly chosen for testing, and all the rest 871 persons are used for training, the same setting as in [1] and [2], denoted by CUHK01(p=100). For the CUHK01 [7] dataset, we also present multi-shot results to compare with those state-of-the-arts which provide multi-shot results on this dataset in their studies. The comparison is shown in our supplementary materials. 4.2

Results for the multi-task network

Multi vs. single task. Results of CMCs with different ranks are shown in Fig. 5. The proposed multi-task network (Fig. 2) is denoted by MTDnet; the weight w in Eq. 3 is set to 0.5. As the MTDnet network adopts the classification loss for testing, we give results using the ranking loss for testing with the same model (denoted by MTDtrp). It’s obvious that the performance of MTDnet is much better than MTDtrp which implies the identifiable features of each pair learned with the classification loss is more effective for person RID than the Euclidean distance in the triplet loss. The results of the single-task networks using the triplet ranking loss (denoted by MTDnet-rnk ) and classification loss (denoted by MTDnet-cls) individually is also provided. The MTDnet-cls network is trained by setting the weight w to 1 in the multi-task network, where the weight of the triplet loss becomes to 0, while the joint feature maps still work for the classification loss. Similarly, the MTDnet-rnk network is obtained by only using the triplet loss to train the model. It is worth noting that, for a fair comparison, the architecture of the MTDnet-rnk network is expanded into containing five convolutional layers plus three fully connected layers as AlexNet [29] instead of the two convolutional layers shown in Fig. 2, i.e, the number of layers in the two single-task networks is the same. The similarity of two images in the MTDnet-rnk is computed with the Euclidean distance. On CUHK03, our multi-task network with two tasks trained simultaneously achieves a rank-1 rate of 74.68% and is much better than either MTDnet-cls or MTDnet-rnk, which indicates the complementarity of two tasks and the effectiveness of jointly optimizing. On all of the four small datasets, our multi-task network (MTDnet) consistently outperforms each of two single-task nets (MTDnet-cls and MTDnet-rnk ). Effect of joint feature map. IDLA [2] is a classification-based deep learning approach, which has shown the best results among all the deep-learning-based

methods. In IDLA [2], a manual subtraction for the cross-input neighborhood differences is employed during matching, whilst the approach of the MTDnet-cls network introduces the joint feature maps. From Fig. 5, we can see that the MTDnet-cls network outperforms IDLA [2] on all the datasets which implies the advantages of our fully data-driven joint feature maps. Cross vs. single domain. We compare the cross-domain architecture (MTDnet-cross) with the original multi-task network (MTDnet) on four small datasets. In this experiment, CUHK03 is considered as the dataset from the source domain, while each of the four small dataset is from the target domain. Therefore, the knowledge transfer is from CUHK03 to each of the four small dataset. The original multi-task network (MTDnet) is directly applied on small datasets but initialized with the model trained on CUHK03. In the cross-domain architecture, both the target domain network and the source domain network are initialized using the model trained on CUHK03. And in test phase, only the target domain network is used to compute the results. Relevant results are shown in Fig.5. It’s obvious that almost all results of the cross-domain architecture are better than that of the single domain architecture, which demonstrates the effectiveness of the cross-domain architecture. 4.3

Comparison with the state of the arts

We compare ours with 26 methods, whichever have the results reported on at least one of the five datasets or using one of the two settings on CUHK01. Specifically, The methods we compared include: KISSME [16], local Fisher discriminant analysis (LF) [20], LADF [18], saliency matching (SalMatch) [19], mid-level filters (mFilter) [14], SCNCD [13], kernel-based metric learning (kML) [24], CSL [21], XQDA [22], JRL [5], MTL-LORAE [15], MLAPG [23], deep metric learning (DML) [3], IDLA [2], RDC [4], SLME [25], ITML [33], largest margin nearest neighbor (LMNN) [34], metric learning to rank (RANK) [35], logistic distance metric learning (LDM) [36], SDALF [37], eSDC [38], FPNN [1], PCCA [39], PRDC [40]. Our performance on CUHK03 comes from our multi-task network and the results of other four small datasets are obtained using our cross-domain architecture. All of the results are shown in Figure 6. FPNN [1], IDLA [2] and DML [3] are deep methods based on the classification loss, while RDC [4] and JRL [5] are on the ranking loss. Most of these methods are in the top performance group among all of the methods considered. It is noted that the result of MTDnet is better than all the approaches above in all cases, which further confirms that jointly optimizing the two losses has a clear advantage over a single loss. Under the rank-1 identification rate, our multi-task network outperforms all existing person re-id algorithms on CUHK03, CUHK01, iLIDS and PRID. For the VIPeR dataset, our result is not better than but comparable to SLME [25]. SLME [25] is a metric ensemble method, while our approach only uses a trained deep feature and not associates with any existing methods, either features or metrics. However, overall our approach outperforms SLME [25] in most cases.

Fig. 6. CMC performance compared with state-of-the-art approaches. The higher the recognition rate, the better the performance. In most of datasets, our method outperforms state of the arts.

5

Conclusion

In this paper, a multi-task network with joint feature maps has been proposed for Person RID. It integrates the classification and ranking tasks together in one network and takes the advantage of their complementarity. The learned joint feature maps are fully data-driven without any manual operation. In the case of having small target datasets, a cross-domain architecture has been further introduced to fine tune the joint feature maps and improve the performance. The results of the proposed network have outperformed almost all state-of-theart methods compared on both large and small datasets.

References 1. Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: CVPR. (2014) 152–159 2. Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: CVPR. (2015) 3. Yi, D., Lei, Z., Li, S.Z.: Deep metric learning for practical person re-identification. In: ICPR. (2014) 4. Ding, S., Lin, L., Wang, G., Chao, H.: Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48(10) (2015) 2993– 3003 5. Chen, S.Z., Guo, C.C., Lai, J.H.: Deep ranking for person re-identification via joint representation learning. arXiv preprint arXiv:1505.06821 (2015)

6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE. (1998) 7. Li, W., Zhao, R., Wang, X.: Human reidentification with transferred metric learning. In: ACCV. (2012) 31–44 8. Gray, D., Brennan, S., Tao, H.: Evaluating appearance models for recognition, reacquisition, and tracking. In: Proc. IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS). Volume 3. (2007) 9. Zheng, W., Gong, S., Xiang, T.: Associating groups of people. In: BMVC. Volume 2. (2009) 6 10. Hirzer, M., Beleznai, C., Roth, P.M., Bischof, H.: Person re-identification by descriptive and discriminative classification. In: Image Analysis. Springer (2011) 91–102 11. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML, IEEE (2015) 12. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014) 13. Yang, Y., Yang, J., Yan, J., Liao, S., Yi, D., Li, S.Z.: Salient color names for person re-identification. In: ECCV. (2014) 536–551 14. Zhao, R., Ouyang, W., Wang, X.: Learning mid-level filters for person reidentification. In: CVPR. (2014) 144–151 15. Su, C., Yang, F., Zhang, S., Tian, Q., Davis, L.S., Gao, W.: Multi-task learning with low rank attribute embedding for person re-identification. In: ICCV. (2015) 3739–3747 16. Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.: Large scale metric learning from equivalence constraints. In: CVPR. (2012) 2288–2295 17. Li, W., Wang, X.: Locally aligned feature transforms across views. In: CVPR. (2013) 3594–3601 18. Li, Z., Chang, S., Liang, F., Huang, T.S., Cao, L., Smith, J.R.: Learning locallyadaptive decision functions for person verification. In: CVPR. (2013) 3610–3617 19. Zhao, R., Ouyang, W., Wang, X.: Person re-identification by salience matching. In: ICCV. (2013) 2528–2535 20. Pedagadi, S., Orwell, J., Velastin, S., Boghossian, B.: Local fisher discriminant analysis for pedestrian re-identification. In: CVPR, IEEE (2013) 3318–3325 21. Shen, Y., Lin, W., Yan, J., Xu, M., Wu, J., Wang, J.: Person re-identification with correspondence structure learning. In: ICCV. (2015) 22. Liao, S., Hu, Y., Zhu, X., Li, S.Z.: Person re-identification by local maximal occurrence representation and metric learning. In: CVPR. (2015) 2197–2206 23. Liao, S., Li, S.Z.: Efficient psd constrained asymmetric metric learning for person re-identification. In: ICCV. (2015) 24. Xiong, F., Gou, M., Camps, O., Sznaier, M.: Person re-identification using kernelbased metric learning methods. In: ECCV. Springer (2014) 1–16 25. Paisitkriangkrai, S., Shen, C., Hengel, A.v.d.: Learning to rank in person reidentification with metric ensembles. In: CVPR. (2015) 26. Girshick, R.: Fast r-cnn. In: ICCV. (2015) 1440–1448 27. Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: ICCV. (2015) 3119–3127 28. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR. (2015) 815–823 29. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. (2012) 1097–1105

30. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: CVPR. Volume 1., IEEE (2005) 539–546 31. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, ACM (2014) 675–678 32. Baltieri, D., Vezzani, R., Cucchiara, R.: 3dpes: 3d people dataset for surveillance and forensics. In: Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding, ACM (2011) 59–64 33. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: ICML, ACM (2007) 209–216 34. Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research 10 (2009) 207– 244 35. McFee, B., Lanckriet, G.R.: Metric learning to rank. In: ICML. (2010) 775–782 36. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? metric learning approaches for face identification. In: ICCV, IEEE (2009) 498–505 37. Farenzena, M., Bazzani, L., Perina, A., Murino, V., Cristani, M.: Person reidentification by symmetry-driven accumulation of local features. In: CVPR, IEEE (2010) 2360–2367 38. Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person reidentification. In: CVPR, IEEE (2013) 3586–3593 39. Mignon, A., Jurie, F.: Pcca: A new approach for distance learning from sparse pairwise constraints. In: CVPR, IEEE (2012) 2666–2672 40. Zheng, W., Gong, S., Xiang, T.: Person re-identification by probabilistic relative distance comparison. In: CVPR, IEEE (2011) 649–656