Person Re-Identification by Deep Joint Learning of Multi-Loss ...

7 downloads 0 Views 1MB Size Report
May 12, 2017 - cally, we formulate a method for joint learning of local and global .... (SCS) model [Chen et al., 2016] and Multi-Channel Parts. (MCP) network ...
Person Re-Identification by Deep Joint Learning of Multi-Loss Classification

arXiv:1705.04724v2 [cs.CV] 23 May 2017

Wei Li, Xiatian Zhu, Shaogang Gong Queen Mary University of London, London, UK {w.li, xiatian.zhu, s.gong}@qmul.ac.uk

Abstract Existing person re-identification (re-id) methods rely mostly on either localised or global feature representation alone. This ignores their joint benefit and mutual complementary effects. In this work, we show the advantages of jointly learning local and global features in a Convolutional Neural Network (CNN) by aiming to discover correlated local and global features in different context. Specifically, we formulate a method for joint learning of local and global feature selection losses designed to optimise person re-id when using only generic matching metrics such as the L2 distance. We design a novel CNN architecture for Jointly Learning Multi-Loss (JLML) of local and global discriminative feature optimisation subject concurrently to the same re-id labelled information. Extensive comparative evaluations demonstrate the advantages of this new JLML model for person re-id over a wide range of state-of-the-art re-id methods on five benchmarks (VIPeR, GRID, CUHK01, CUHK03, Market-1501).

1

Introduction

Person re-identification (re-id) is about matching identity classes in detected person bounding box images from nonoverlapping camera views over distributed open spaces. This is an inherently challenging task because person visual appearance may change dramatically in different camera views from different locations due to unknown changes in human pose, illumination, occlusion, and background clutter [Gong et al., 2014]. Existing person re-id studies typically focus on either feature representation [Gray and Tao, 2008; Farenzena et al., 2010; Kviatkovsky et al., 2013; Zhao et al., 2013; Liao et al., 2015; Matsukawa et al., 2016a; Ma et al., 2017] or matching distance metrics [Koestinger et al., 2012; Xiong et al., 2014; Zheng et al., 2013; Wang et al., 2014b; Paisitkriangkrai et al., 2015; Zhang et al., 2016; Wang et al., 2016b; Wang et al., 2016c; Wang et al., 2016d; Chen et al., 2017b] or their combination in deep learning framework [Li et al., 2014; Ahmed et al., 2015; Wang et al., 2016a; Xiao et al., 2016; Subramaniam et al., 2016; Chen et al., 2017a]. Regardless, the overall objective is to obtain a view- and location-

invariant (cross-domain) representation. We consider that learning any matching distance metric is intrinsically learning a global feature transformation across domains (two disjoint camera views) therefore obtaining a “normalised” feature representation for matching. Most re-id features are typically hand-crafted to encode local topological and/or spatial structural information, by different image decomposition schemes such as horizontal stripes [Gray and Tao, 2008; Kviatkovsky et al., 2013], body parts [Farenzena et al., 2010], and patches [Zhao et al., 2013; Matsukawa et al., 2016a; Liao et al., 2015]. These localised features are effective for mitigating the person pose and detection misalignment in re-id matching. More recent deep re-id models [Xiao et al., 2016; Wang et al., 2016a; Chen et al., 2017a; Ahmed et al., 2015] benefit from the availability of larger scale datasets such as CUHK03 [Li et al., 2014] and Market-1501 [Zheng et al., 2015] and from lessons learned on other vision tasks [Krizhevsky et al., 2012; Girshick et al., 2014]. In contrast to local hand-crafted features, deep models, in particular Convolutional Neural Networks (CNN) [LeCun et al., 1998], favour intrinsically in learning global feature representations with a few exceptions. They have been shown to be effective for re-id. We consider that either local or global feature learning alone is suboptimal. This is motivated by the human visual system that leverages both global (contextual) and local (saliency) information concurrently [Navon, 1977; Torralba et al., 2006]. This intuition for joint learning aims to extract correlated complementary information in different context whilst satisfying the same learning constraint1 therefore achieving more reliable recognition. To that end, we need to address a number of non-trivial problems: (i) the model learning behaviour in satisfying the same label constraint may be different at the local and global levels; (ii) any complementary correlation between local and global features is unknown and may vary among individual instances, therefore must be learned and optimised consistently across data; (iii) People’s appearance in public scenes is diverse in both pattens and configurations. This makes it challenging to learn correlations between local and global features for all appearances. This work aims to formulate a deep learning model for 1 In person re-id context, the learning constraint refers to the image person identity label supervision.

jointly optimising local and global feature selections concurrently and to improve person re-id using only generic matching metrics such as the L2 distance. We explore a deep learning approach for its potential superiority in learning from large scale data [Xiao et al., 2016; Chen et al., 2017a]. For the bounding box image based person re-id, we consider the entire person in the bounding box as a global scene context and body parts of the person as local information sources, both are subject to the surrounding background clutter within a bounding box, and potentially also misalignment and partial occlusion from bounding box detection. In this setting, we wish to discover and optimise jointly correlated complementary feature selections in the local and global representations, both subject to the same label constraint concurrently. Whilst the former aims to address pose/detection misalignment and occlusion by localised fine-grained saliency information, the latter exploits holistic coarse-grained context for more robust global matching. To that end, we formulate a deep two-branch CNN architecture, with one branch for learning localised feature selection (local branch) and the other for learning global feature selection (global branch). Importantly, the two branches are not independent but synergistically correlated and jointly learned concurrently. This is achieved by: (i) imposing interbranch interaction between the local and global branches, and (ii) enforcing a separate learning objective loss function to each branch for learning independent discriminative capabilities, whilst being subject to the same class label constraint. Under such balancing between interaction and independence, we allow both branches to be learned concurrently for maximising their joint optimal extraction and selection of different discriminative features for person re-id. We call this model the Joint Learning Multi-Loss (JLML) CNN model. To minimise poor learning due to inherent noise and potential covariance, we introduce a structured feature selective and discriminative learning mechanism into both the local and global branches subject to a joint sparsity regularisation. The contributions of this work are: (I) We propose the idea of learning concurrently both local and global feature selections for optimising feature discriminative capabilities in different context whilst performing the same person re-id tasks. This is currently under-studied in the person re-id literature to our best knowledge. (II) We formulate a novel Joint Learning Multi-Loss (JLML) CNN model for not only learning both global and local discriminative features in different context by optimising multiple classification losses on the same person label information concurrently, but also utilising their complementary advantages jointly in coping with local misalignment and optimising holistic matching criteria for person re-id. (III) We introduce a structured sparsity based feature selection learning mechanism for improving multiloss joint feature learning robustness w.r.t. noise and data covariance between local and global representations. Extensive comparative evaluations demonstrate the superiority of the proposed JLML model over a wide range of existing state-ofthe-art re-id models on five benchmark datasets VIPeR [Gray and Tao, 2008], GRID [Loy et al., 2009], CUHK01 [Li et al., 2012], CUHK03 [Li et al., 2014], and Market-1501 [Zheng et al., 2015].

2

Related Works

The proposed JLML model considers learning both local and global feature selections jointly for optimising their correlated complementary advantages. This goes beyond existing methods mostly relying on only one level of feature representation. Specifically, the JLML method is related to the saliency learning based models [Zhao et al., 2013; Wang et al., 2014a] in terms of modelling localised part importance. However, these existing methods consider only the patch appearance statistics within individual locations but no global feature representation learning, let alone the correlation and complementary information discovery between local and global features as modelled by the JLML. Whilst the more recent Spatially Constrained Similarity (SCS) model [Chen et al., 2016] and Multi-Channel Parts (MCP) network [Cheng et al., 2016] consider both levels of representation, the JLML model differs significantly from them: (i) The SCS method focuses on supervised metric learning, whilst the JLML aims at joint discriminative feature learning and needs only generic metrics for re-id matching. Also, hand-crafted local and global features are extracted separately in SCS without any inter-feature interaction and correlation learning involved, as opposite to the joint learning of global and local feature selections concurrently subject to the same supervision information in the JLML; (ii) The local and global branches of the MCP model are supervised and optimised by a triplet ranking loss, in contrast to the proposed multiple classification loss design (Sec. 3.2). Critically, this one-loss model learning is likely to impose negative influence on the discriminative feature learning behaviour for both branches due to potential over-low pre-branch independence and over-high inter-branch correlation. This may lead to suboptimal joint learning of local and global feature selections in model optimisation, as suggested by our evaluation in Section 4.3. (iii) In addition, the JLML is capable of performing structured feature sparsity regularisation along with the multi-loss joint learning of local and global feature selections for providing additional benefits (Sec. 4.3). Whilst similar in theory to the sparsity constraint on the supervised SCS metric learning, we perform differently sparse generic feature learning without the need for supervised metric optimisation. In terms of loss function, the HER model [Wang et al., 2016b] similarly does not exploit pair-wise re-id labels but defines a single identity label per training person for regression loss (vs. the classification loss in the JLML) based reid feature embedding optimisation. Importantly, HER relies on the pre-defined feature (mostly hand-crafted local feature) without the capability of jointly learning global and local feature representations and discovering their correlated complementary advantages as specifically designed in JLML. Also, the DGD [Xiao et al., 2016] model uses the classification loss for model optimisation. However, this model considers only the global feature representation learning of one-loss classification as opposite to the proposed joint global and local feature learning of multi-loss classification concurrently subject to maximising the same person identity matching.

3 3.1

Model Design Problem Definition

We assume a set of n training images I = {Ii }ni=1 with the corresponding identity labels as Y = {yi }ni=1 . These training images capture the visual appearance of nid (where yi ∈ [1, · · · , nid ]) different people under non-overlapping camera views. We formulate a Joint Learning Multi-Loss (JLML) CNN model that aims to discover and capture concurrently complementary discriminative information about a person image from both local and global visual features of the image in order to optimise person re-id under significant viewing condition changes across locations. This is in contrast to most existing re-id methods typically depending only on either local or global features alone.

3.2

Joint Learning Multi-Loss Local Branch

Sparsity

Pooling-256

Shared Conv 1

Local 1

FC-512 FC-IDs

Stripe 1

ID Class Labels

Local k

Slice Stripe k

Stripe m

𝑾𝐿

Local m

1,2

Global Branch

Sparsity Pooling-512 FC-512

Global

𝑾𝐺

FC-IDs

risks. This is especially critical in learning person re-id models when labelled training data is limited. (II) Multi-task independent learning subject to shared label constraints. To maximise the learning of complementary discriminative features from local and global representations, the remaining layers of the two branches are learned independently subject to given identity labels. That is, the JLML model aims to learn concurrently multiple identity feature representations for different local image regions and the entire image, all of which aim to maximise the same identity matching both individually and collectively at the same time. Independent multi-task learning aims to preserve both local saliency in feature selection and global robustness in image representation. To that end, the JLML model is designed to perform multi-task independent learning subject to shared identity label constraints by allocating each branch with a separate objective loss function. By doing so, the per-branch learning behaviour is conditioned independently on the respective feature representation. We call this branch-wise loss formulation as the MultiLoss design. Table 1: JLML-ResNet39. MP: Max-Pooling; AP: AveragePooling; S: Stride; SL: Slice; CA: Concatenation; G: Global; L: Local. Layer # Layer Output Size Global Branch Local Branch 1

conv1

112×112

9

conv2 x

G: 56×56 L: 28×56

9

conv3 x

G: 28×28 L: 14×28

9

conv4 x

G: 14×14 L: 7×14

9

conv5 x

G: 7×7 L: 4×7

1

fc

1×1

ID Class Labels

2,1

Figure 1: The Joint Learning Multi-Loss (JLML) CNN model architecture.

The overall design of the proposed JLML model is depicted in Figure 1. This JLML model consists of a twobranches CNN network: (1) One local branch of m streams of an identical structure with each stream learning the most discriminative local visual features for one of m local image regions of a person bounding box image; (2) Another global branch responsible for learning the most discriminative global level features from the entire person image. For concurrently optimising per-branch discriminative feature representations and discovering correlated complementary information between local and global feature selections, a joint learning scheme that subjects both local and global branches to the same identity label supervision is considered with two underlying principles: (I) Shared low-level features. We construct the global and local branches on a shared lower conv layer, in particular the first conv layer2 , for facilitating inter-branch common learning. The intuition is that, the lower conv layers capture lowlevel features such as edges and corners which are common to all patterns in the same images. This shared learning is similar in spirit to multi-task learning [Argyriou et al., 2007], where the local and global feature learning branches are two related learning tasks. Sharing the low-level conv layer reduces the model parameter size therefore model overfitting 2 We found empirically no clear benefits from increasing the number of shared conv layers in our implementation.

1

fc

1×1

3×3, 32, S-2 3×3 MP, S-2 SL-4, 2×2 MP, S-1     1×1, 32 1×1, 16     3×3, 32×3 3×3, 16×3 1×1, 64 1×1, 32     1×1, 64 1×1, 32      3×3, 64 ×3 3×3, 32×3 1×1, 128 1×1, 64     1×1, 128 1×1, 64     3×3, 128×3  3×3, 64 ×3 1×1, 256 1×1, 128     1×1, 256 1×1, 128     3×3, 256×3 3×3, 128×3 1×1, 512 1×1, 256 7×7 AP 4×7 h i h AP, CA-4 i 1×1, 512

1×1, 512

ID#

ID#

Network Construction. We adopt the Residual CNN unit [He et al., 2016] as the JLML’s building blocks due to its capacity for deeper model design whilst retaining a smaller model parameter size3 . Specifically, we customise the ResNet50 architecture in both layer and filter numbers and design the JLML model as a 39 layers ResNet (JLMLResNet39) tailored for re-id tasks. The configuration of JLML-ResNet39 is given in Table 1. Note that, the ReLU 3 The choice of base network is independent of our JLML model design. Other types, e.g. GoogLeNet [Szegedy et al., 2015] or VGG-Net [Simonyan and Zisserman, 2015], can be readily applied in our model.

(1)

i=1 d

where WG = [wg1 , · · · , wg g ] ∈ R cg ×dg is the parameter matrix of the global branch feature layer taking as input dg dimensional vectors from the previous layer and outputting cg dimensional (512-D) feature representation. Specifically, with the `1 norm applied on the `2 norm of wgi , our aim is to learn (tune) selectively feature dimension importance subject to both the sparsity principle and the identity label constraint simultaneously. Similarly, we also enforce a local feature sparsity constraint by an exclusive group LASSO [Kong et al., 2014]: `1,2 = kWL k1,2 =

cl X m X

i kwl,j k21

(2)

i=1 j=1

1> wl,1  WL = · · · cl > wl,1

··· ··· ···

 1>

1 . . . 𝑤𝑙111,𝑑 𝑤𝑙1 1,1

𝑙

𝑐𝑔

𝑤𝑔

1,1

...

𝑐𝑔

𝑤𝑔

𝑤𝑙1𝑙 1,1. . . 𝑤𝑙1𝑙 1,𝑑 𝑐

1,𝑑𝑔

2,1

= ෍ 𝒘𝑔𝑖 i=1

1 . . .𝑤𝑙𝑘11,𝑑 𝑤𝑙𝑘 1,1

𝑐

...

𝑤𝑙𝑘𝑙 1,1. . . 𝑤𝑙𝑘𝑙 1,𝑑

𝑙

𝑙

𝟐

Small

𝑙

𝑐𝑙

Large

...

𝑐

𝑐

Weights

𝑑𝑔

𝑾𝐺

...

...

1 1 . . . 𝑤𝑙𝑚 𝑤𝑙𝑚 1,𝑑 1,1

𝑙

𝑾𝐿

1,2

𝑤𝑙𝑚𝑙 1,1. . . 𝑤𝑙𝑚𝑙 1,𝑑 𝑐

𝑐

𝑙

𝑚

= ෍ ෍ 𝒘𝑖𝑙𝑗 i=1 j=1

2 1

Figure 2: Group sparsity regularisations on fc layer parameter matrices (WG for the global branch and WL for the local branch) for selectively learning feature representations. Solid and dashed rectangles denote `2 norm and `1 norm respectively.

probability y˜i of image Ii over the given identity label yi : exp(wy>i xi ) p(˜ yi = yi |Ii ) = P|n | id > k=1 exp(wk xi )

(4)

where xi refers to the feature vector of Ii from the corresponding branch, and Wk the prediction function parameter of training identity class k. The training loss on a batch of nbs images is computed as: nbs   1 X l=− log p(˜ yi = yi |Ii ) nbs i=1

(5)

Combined with the group sparsity based feature selection regularisations, we have the final loss function for the global and local branch sub-networks as: lglobal = l + λglobal kWG k2,1 , llocal = l + λlocal kWL k1,2 (6)

where 

1,𝑑𝑔

...

kwgi k2

𝑤𝑔1

...

dg X

...

1,1

...

`2,1 = kWG k2,1 =

𝑤𝑔1

...

rectification non-linearity [Krizhevsky et al., 2012] after each conv layer is omitted for brevity. Feature Selection. To optimise JLML model learning robustness against noise and diverse data source, we introduce a feature selection capability in JLML by a structure sparsity induced regularization [Kong et al., 2014; Wang et al., 2013]. Our idea is to have a competing-to-survive mechanism in feature learning that discourages irrelevant features whilst encourages discriminative features concurrently in different local and global context to maximise a shared identity matching objective. To that end, we sparsify the global feature representation with a group LASSO [Wang et al., 2013]:



 1>

wl,m wl ···  =  ···  cl > wl,m wlcl >

(3)

is the parameter matrix of the local branch feature layer with m×dl and cl (512) as the input and output dimensions (m the i image stripe number). The wl,j ∈ R dl ×1 defines the parameter vector for contributing the i-th output feature dimension from the j-th local input feature vector, j ∈ [1, 2, · · · , m]. In particular, the `2,1 regulariser performs sparse feature selection for individual image regions as below: (1) We perform feature selective learning at the local region level by enforci ing the `1 norm directly on wl,j , conceptually similar to the group LASSO at the global level. (2) We then apply a nonsparse smooth fusion with the `2 norm to combine the efi fects of different local features weighted by the sparse wl,j . (3) Lastly, we exploit the `1 norm again at the level of wlk (k ∈ [1, 2, · · · , cl ]) to learn the local 512-D feature representation selection. Figure 2 shows our structured sparsity regularisations for both local and global feature selections. Loss Function. For model training, we utilise the crossentropy classification loss function for both global and local branches so to optimise person identity classification given training labels of multiple person classes extracted from pairwise labelled re-id dataset. Formally, we predict the posterior

where λglobal and λlocal control the balance between the identity label loss and the feature selection sparsity regularisation. We empirically set λlocal = λglobal = 5×10−4 by crossvalidation in our evaluations. Choice of Loss Function. Our JLML model learning deploys a classification loss function. This differs significantly from the contrastive loss functions used by most existing deep re-id methods designed to exploit pairwise re-id labels defined by both positive and negative pairs, such as the pairwise verification [Varior et al., 2016; Subramaniam et al., 2016; Ahmed et al., 2015; Li et al., 2014], triplet ranking [Cheng et al., 2016], or both [Wang et al., 2016a; Chen et al., 2017a]. Our JLML model training does not use any labelled negative pairs inherent to all person re-id training data, and we extract identity class labels from only positive pairs. The motivations for our JLML classification loss based learning are: (i) Significantly simplified training data batch construction, e.g. random sampling with no notorious tricks required, as shown by other deep classification methods [Krizhevsky et al., 2012]. This makes our JLML model more scalable in real-world applications with very large training population sizes when available. This also eliminates the undesirable need for carefully forming pairs and/or triplets in preparing re-id training splits, as in most existing methods, due to the inherent imbalanced negative and positive pair size distributions. (ii) Visual psychophysical findings suggest that representations optimised for classification tasks generalise well

to novel categories [Edelman, 1998]. We consider that reid tasks are about model generalisation to unseen test identity classes given training data on independent seen identity classes. Our JLML model learning exploits this general classification learning principle beyond the strict pair-wise relative verification loss in existing re-id models.

3.3

Model Training

We adopt the standard Stochastic Gradient Descent (SGD) optimisation algorithm [Krizhevsky et al., 2012] to perform the batch-wise joint learning of local and global branches. Note that, with SGD we can naturally synchronise the optimisation processes of the two branches by constraining their learning behaviours subject to the same identity label information at each update. This is likely to avoid representation learning divergence between two branches and help enhance the correlated complementary learning capability.

3.4

Re-Id by Generic Distance Metrics

Once the JLML model is learned, we obtain a 1,024-D joint representation by concatenating the local (512-D) and global (512-D) feature vectors (the fc layer in Table 1). For person re-id, we deploy this 1,024-D deep feature representation using only a generic distance metric without camera-pair specific distance metric learning, e.g. L2 distance. Specifically, given a test probe image I p from one camera view and a set of test gallery images {Iig } from other non-overlapping camera views: (1) We first compute their corresponding 1,024-D feature vectors by forward-feeding the images to the trained JLML model, denoted as xp = [xpg ; xpl ] and {xgi = [xgg ; xgl ]}. (2) We then compute L2 normalisation on the global and local features, separately. (3) Lastly we compute the cross-camera matching distances between xp and xgi by some generic matching metric, e.g. L2 distance. We then rank all gallery images in ascendant order by their L2 distances to the probe image. The probabilities of true matches of probe person images in Rank-1 and among the higher ranks indicate the goodness of the learned JLML deep features for person re-id tasks.

4

numbers of images per person per view captured from a supermarket, with all bounding boxes automatically detected. These datasets present a wide range of re-id evaluation scenarios with different population sizes under different challenging viewing conditions (Table 2).

Experiments

Datasets. For evaluation, we used five benchmarking re-id datasets, VIPeR [Gray and Tao, 2008], GRID [Loy et al., 2009], CUHK01 [Li et al., 2012], CUHK03 [Li et al., 2014], and Market-1501 [Zheng et al., 2015]. Figure 3 shows some examples of person bounding box images from these datasets. The datasets are collected by different data sampling protocols from different environments, where: (a) VIPeR has one image per person per view, with low-resolution under severe lighting change. (b) GRID provides one image per person per view, with additional images for 775 distracting persons under very poor lighting from underground stations. (c) CUHK01 contains two images person per view from a university campus. (d) CUHK03 consists of up to five images per person per view, obtained by both manually labelled and auto-detected person bounding boxes with the latter posing a more challenging re-id task due to detection bounding box misalignment and occlusion. (e) Market-1501 has variable

(a)

VIPeR

(b)

GRID

(c)

CUHK01

(d)

CUHK03

(e)

Market

Figure 3: Example cross-view image pairs from five re-id datasets. Table 2: Settings of person re-id datasets. TS: Test Setting; SS: Single-Shot; MS: Multi-Shot. SQ: Single-Query; MQ: Multi-Query. Dataset Cams IDs Train IDs Test IDs Labelled Detected TS VIPeR 2 632 316 316 1,264 0 SS GRID 8 250 125 125 1,275 0 SS CUHK01 2 971 871/485 100/486 1,942 0 SS/MS CUHK03 6 1,467 1,367 100 14,097 14,097 SS Market 6 1,501 751 750 0 32,668 SQ/MQ

Evaluation Protocol. We adopted the standard supervised re-id setting to evaluate the proposed JLML model (Sec. 4.1). The training and test data splits and testing settings of each dataset is given in Table 2. Specifically, on VIPeR, we split randomly the whole population (632 people) into two halves: One for training (316) and another for testing (316). We repeated 10 trials of random people splits and used the averaged results. On CUHK01, we considered two training/test splits: 485/486 [Liao et al., 2015] and 871/100 [Ahmed et al., 2015]. Again, we reported the results averaged over 10 random trials for either split. On GRID, the training/test split were 125/125 with 775 distractor people included in the test gallery. We used the benchmarking 10 people splits [Loy et al., 2009] and the averaged performance. On CUHK03, following [Li et al., 2014] we repeated 20 times of random 1260/100 training/test splits and reported the averaged accuracies under the single-shot evaluation setting. On Market-1501, we used the standard training/test split (750/751) [Zheng et al., 2015]. We used the cumulative matching characteristic (CMC) to measure re-id accuracy on all benchmarks, except on Market1501 we also used in addition the recall measure of multiple truth matches by mean Average Precision (mAP), i.e. first computing the area under the Precision-Recall curve for each probe, then calculating the mean of Average Precision over all probes [Zheng et al., 2015]. Competitors. We compared the JLML model against 10 existing state-of-the-art methods as listed in Table 3. They range from hand-crafted and deep learning features to domainspecific distance metric learning methods. We summarise them into three categories: (A) Hand-crafted (feature) with domain-specific distance learning (metric); (B) Deep learning (feature) with domain-specific deep verification metric learning; (C) Deep learning (feature) with generic non-learning L2 distance (metric). Implementation. We used the Caffe framework [Jia et al., 2014] for our JLML model implementation. We started by

Table 3: Person re-id method categorisation by features and metrics. Cat: Category; DL: Deep Learning; CPSL: Camera-Pair Specific Learning; DVM: Deep Verification Metric; DVM,L2: Ensemble of DVM and L2; CHS: Fusion of Colour, HOG, SILPT features. Cat

A

B

C

Feature Metric Method Hand-Crafted DL CPSL Generic XQDA [Liao et al., 2015] LOMO XQDA GOG [Matsukawa et al., 2016b] GOG XQDA NFST [Zhang et al., 2016] LOMO, KCCA NSFT SCS [Chen et al., 2016] CHS SCS DCNN+ [Ahmed et al., 2015] DCNN+ DVM X-Corr [Subramaniam et al., 2016] X-Corr DVM MTDnet [Chen et al., 2017a] MTDnet DVM, L2 S-CNN [Varior et al., 2016] S-CNN L2 DGD [Xiao et al., 2016] DGD L2 MCP [Cheng et al., 2016] MCP L2 JLML (Ours) JLML L2

pre-training the JLML model on ImageNet (ILSVRC2012). Subsequently, for CUHK03 or Market, we used only their own training data for model fine-tuning, i.e. ImageNet → CUHK03/Market; For CUHK01 or VIPeR or GRID, we pretrained JLML on CUHK03+Market (whole datasets), and then fine-tuned on their respective training images, i.e. ImageNet → CUHK03+Market → CUHK01 / VIPeR / GRID. All input person images were resized to 224 × 224 in pixel. For local branch, according to a coarse body part layout we evenly decomposed the whole shared convolutional feature maps (i.e. the entire image) into four (m = 4) horizontal strip-regions. We used the same parameter settings (summarised in Table 4) for pre-training and training the JLML model on all datasets. We also adopted the stepped learning rate policy, e.g. dropping the learning rate by a factor of 10 every 100K iterations for JLML pre-training and every 20K iterations for JLML training. We utilised the L2 distance as the default matching metric, unless stated otherwise. Table 4: JLML training parameters. BLR: base learning rate; LRP: learning rate policy; MOT: momentum; IT: iteration; BS: batch size. Parameter BLR LRP MOT IT # BS Pre-train 0.01 step (0.1, 100K) 0.9 300K 32 Train 0.01 step (0.1, 20K) 0.9 50K 32

4.1

Conventional Intra-Domain Re-Id Evaluations

We conducted extensively comparative evaluations on conventional supervised learning based person re-id tasks. (I) Evaluation on CUHK03. Table 5 shows the comparisons of JLML against 8 existing methods on CUHK03. It is evident that JLML outperforms existing methods in all categories on both labelled and detected bounding boxes, surpassing the 2nd best performers DGD and X-Corr on corresponding labelled and detected images in Rank-1 by 7.9%(83.275.3) and 8.6%(80.6-72.0) respectively. X-Corr/GOG/JLML also suffer the least from auto-detection misalignment, indicating the robustness and competitiveness of the joint learning approach to mining complementary local and global discriminative features. (II) Evaluation on Market-1501. We evaluated JLML against four existing models on Market-1501. Table 6 shows the clear performance superiority of JLML over all state-ofthe-arts with more significant Rank-1 advantages over other methods compared to CUHK03, giving 19.3%(85.1-65.8)

Table 5: CUHK03 evaluation. 1st /2nd best in red/blue. Cat A B C

Annotation Rank (%) XQDA GOG NSFT DCNN+ X-Corr MTDnet S-CNN DGD JLML

R1 55.2 67.3 62.5 54.7 72.4 74.7 75.3 83.2

Labelled R5 R10 77.1 86.8 91.0 96.0 90.0 94.8 86.5 93.9 95.5 96.0 97.5 98.0 99.4

R20 83.1 98.1 98.1 98.4 99.8

R1 46.3 65.5 54.7 44.9 72.0 68.1 80.6

Detected R5 R10 78.9 83.5 88.4 93.7 84.7 94.8 76.0 83.5 96.0 88.1 94.6 96.9 98.7

R20 93.2 95.2 93.2 98.2 99.2

(SQ) and 13.7%(89.7-76.0) (MQ) gains over the 2nd best SCNN. This further validates the advantages of our joint learning of multi-loss classification for optimising re-id especially when the re-id test population size increases (751 people on Market-1501 vs. 100 people on CUHK03). Table 6: Market-1501 evaluation. 1st /2nd best in red/blue. All person bounding box images were auto-detected. Query Type Single-Query Cat Measure (%) R1 mAP XQDA 43.8 22.2 A SCS 51.9 26.3 NFST 61.0 35.6 S-CNN 65.8 39.5 C JLML 85.1 65.5

Multi-Query R1 mAP 54.1 28.4 71.5 46.0 76.0 48.4 89.7 74.5

(III) Evaluation on CUHK01. We compared our JLML model with 8 state-of-the-art methods on CUHK01. Table 7 shows that JLML surpasses clearly all compared models under both training/test splits in single- and multi-short settings. Moreover, JLML outperforms in Rank-1 (76.7%) the best hand-crafted feature method NFST (R1 69.1%) when the training population size is small (486 people). When the training population size increases (871 people), JLML is even more effective than all deep competitors in exploiting extra training classes by inducing more identity-discriminative joint person features in distinct context. For example, JLML gains 5.8%(87.0-81.2) more Rank-1 than the 2nd best method X-Corr in single-shot re-id, further improved the gain of 4.8%(69.8-65.0) under the 486/485 split. These results show consistent superiority and robustness of the proposed JLML model over the existing methods. (IV) Evaluation on VIPeR. We evaluated the performance of JLML against 8 strong competitors on VIPeR, a more challenging test scenario with fewer training classes (316 people) and lower image resolution. On this dataset, the best performers are hand-crafted feature methods (SCS and NFST) rather than deep models. This is in contrast to the tests on CUHK01, CUHK03, and Market-1501. This is due to (i) the small training data insufficient for learning effectively discriminative deep models with millions of parameters; (ii) the greater disparity to CUHK03 in camera viewing conditions which makes knowledge transfer less effective (see Implementation). Nevertheless, the JLML model remains the best among all deep methods with or without deep verification metric learning. This validates the superiority and robustness of our deep joint global and local representation learning of

Table 7: CUHK01 evaluation. 1st /2nd best in bold/typewriter. Split 871/100 split 486/485 split Cat Rank (%) R1 R5 R10 R20 R1 R5 R10 R20 Single-Shot Testing Setting A GOG - 57.8 79.1 86.2 92.1 DCNN+ 65.0 - 47.5 71.6 80.3 87.5 B X-Corr 81.2 97.3 - 98.6 65.0 89.7 - 94.4 MTDnet 78.5 96.5 97.5 DGD - 66.6 C MCP - 53.7 84.3 91.0 96.3 JLML 87.0 97.2 98.6 99.4 69.8 88.4 93.3 96.3 Multi-Shot Testing Setting XQDA - 63.2 83.9 90.0 94.2 A GOG - 67.3 86.9 91.8 95.9 NFST - 69.1 86.9 91.8 95.4 C JLML 91.2 98.4 99.2 99.8 76.7 92.6 95.6 98.1

multi-loss classification given sparse training data. We attribute this property to the JLML’s capability of mining complementary features in different context for both handling local misalignment and optimising global matching. Table 8: VIPeR evaluation. 1st /2nd best in red/blue. Cat A B C

Rank (%) XQDA GOG NFST SCS DCNN+ MTDnet MCP DGD JLML

R1 40.0 49.7 51.1 53.5 34.8 47.5 47.8 38.6 50.2

R5 68.1 82.1 82.6 63.6 73.1 74.7 74.2

R10 80.5 88.7 90.5 91.5 75.6 82.6 84.8 84.3

R20 91.1 94.5 95.9 96.7 84.5 91.1 91.6

(V) Evaluation on GRID. We compared JLML against 4 competing methods on GRID4 . In addition to poor image resolution, poor lighting and a small training size (125 people), GRID also has extra distractors in the testing population therefore presenting a very challenging but realistic re-id scenario. Table 9 shows a significant superiority of JLML over existing state-of-the-arts, with Rank-1 12.8%(37.5-24.7) better than the 2nd best method GOG, a 51.8% relative improvement. This demonstrates the unique and practically desirable advantage of JLML in handling more realistically challenging open-world re-id matching where large numbers of distractors are usually present. It is worth pointing out that this step-change advantage in re-id matching rate on GRID is achieved by deep learning from only a limited number of training identity classes with highly imbalanced images sampled from 8 distributed camera views, e.g. 25 images from the 6th camera vs. 513 from the 5th camera. This imbalanced sampling directly results in not only scarce pairwise training 4

The GRID dataset has not been evaluated as extensively as others like VIPeR / CUHK01 / CUHK03, although GRID provides a more realistic test setting with a large number of distractors in testing. One possible reason is the more challenging re-id setting imposed by GRID resulting in significantly poorer matching rates by all published methods (see http://personal.ie.cuhk.edu.hk/˜ccloy/ downloads_qmul_underground_reid.html), also as verified by our evaluation in Table 9.

data but also insufficient training samples for pairwise camera views, resulting in significant degradation in re-id performance from all pairwise supervised learning based models XQDA, GOG, SCS, and X-Corr. In contrast, JLML is designed to avoid the need for pairwise labelled information in model learning by instead learning from multi-loss classifications. Moreover, the joint learning of multi-loss classification benefits from concurrent local and global feature selections in different context, resulting in more robust and accurate re-id matching in a heterogeneous search space. Table 9: GRID evaluation. 1st /2nd best in red/blue. Cat A B C

4.2

Rank (%) XQDA GOG SCS X-Corr JLML

R1 16.6 24.7 24.2 19.2 37.5

R5 33.8 47.0 44.6 38.4 61.4

R10 41.8 58.4 54.1 53.6 69.4

R20 52.4 69.0 65.2 66.4 77.4

CNN Architecture Comparisons

We compared the proposed JLML-ResNet39 model with four seminal classification CNN architectures (Alexnet [Krizhevsky et al., 2012], VGG16 [Simonyan and Zisserman, 2015], GoogLeNet [Szegedy et al., 2015], and ResNet50 [He et al., 2016]) in model size and complexity. Table 10 shows that the JLML has both the 2nd smallest model size (7.2 million parameters) and the 2nd smallest FLOPs (1.54×109 ), although containing more streams (5 vs. 1 in all other CNNs) and more layers (39, more than all except ResNet50). Table 10: Comparisons of model size and complexity. FLOPs: the number of FLoating-point OPerations; PN: Parameter Number. Model FLOPs PN (million) Depth Stream # AlexNet 7.25×108 58.3 7 1 VGG16 1.55×1010 134.2 16 1 ResNet50 3.80×109 23.5 50 1 GoogLeNet 1.57×109 6.0 22 1 JLML-ResNet39 1.54×109 7.2 39 5

4.3

Further Analysis and Discussions

We further examined the component effects of our JLML model on Market-1501 in the following aspects. (I) Complementary Benefits of Global and Local Features. We evaluated the complementary effects of our jointly learned local and global features by comparing their individual re-id performance against that of the joint features. Table 11 shows: (i) Any of the two feature representations alone is competitive for re-id, e.g. the local JLML feature surpasses S-CNN (Table 6) by Rank-1 13.1%(78.9-65.8) (SQ) and 10.4%(86.4-76.0) (MQ); and by mAP 18.3%(57.8-39.5) (SQ) and 20.0%(68.4-48.4) (MQ). (ii) A further performance gain is obtained from the joint feature representation, yielding further 6.2%(85.1-78.9) (SQ) and 3.3%(89.7-86.4) (MQ) in Rank-1 increase, and 7.7%(65.5-57.8) (SQ) and 6.1%(74.568.4) (MQ) in mAP boost. These results show the complementary advantages of jointly learning the local and global features in different context using the JLML model.

Table 11: Complementary benefits of global and local features. Query Type Measure (%) JLML (Global) JLML (Local) JLML (joint)

Single-Query R1 mAP 77.4 56.0 78.9 57.8 85.1 65.5

Multi-Query R1 mAP 85.0 66.0 86.4 68.4 89.7 74.5

(II) Importance of Branch Independence. We evaluated the importance of branch independence by comparing our MultiLoss design with a UniLoss design that merges two branches into a single loss [Cheng et al., 2016]. Table 12 shows that the proposed MultiLoss model significantly improves the discriminative power of global and local re-id features, e.g. with Rank-1 increase of 9.0%(85.1-76.1) (SQ) and 6.0%(89.783.7) (MQ); and mAP improvement of 13.3%(65.5-52.2) (SQ) and 11.7%(74.5-62.8) (MQ). This shows that branch independence plays a critical role in joint learning of multi-loss classification for effective feature optimisation. One plausible reason is due to the negative effect of a single loss imposed on the learning behaviour of both branches, caused by the potential divergence in discriminative features in different context (local and global). This is shown by the significant performance degradation of both global and local features when the UniLoss model is imposed. Table 12: Importance of branch independence. Loss UniLoss MultiLoss

Query Type Measure (%) Global Feature Local Feature Full Global Feature Local Feature Full

Single-Query R1 mAP 58.3 31.7 46.3 26.3 76.1 52.2 77.4 56.0 78.9 57.8 85.1 65.5

Multi-Query R1 mAP 70.4 43.2 58.0 34.0 83.7 62.8 85.0 66.0 86.4 68.4 89.7 74.5

(III) Benefits from Shared Low-Level Features. We evaluated the effects of interaction between global and local branches introduced by the shared conv layer (common ground) by deliberately removing it and then comparing the re-id performance. Table 13 shows the benefits from jointly learning low-level features in the common conv layers, e.g. improving Rank-1 by 1.9%(85.1-83.2) / 1.4%(89.7-88.3) and mAP by 2.4%(65.5-63.1) / 2.4%(74.5-72.1) for single-/multiquery re-id. This confirms a similar finding as in multi-task learning study [Argyriou et al., 2007]. Table 13: Benefits from shared low-level features. Query Type Measure (%) Without Shared Feature With Shared Feature

Single-Query R1 mAP 83.2 63.1 85.1 65.5

Multi-Query R1 mAP 88.3 72.1 89.7 74.5

(IV) Effects of Selective Feature Learning. We evaluated the contribution of our structured sparsity based Selective Feature Learning (SFL) (Eq. (6)). Table 14 shows that our SFL mechanism can bring additional re-id matching benefits, e.g. improving Rank-1 rate by 1.7%(85.1-83.4) (SQ) and 1.0%(89.7-88.7) (MQ); and mAP by 1.7%(65.5-63.8) (SQ) and 1.6%(74.5-72.9) (MQ). (V) Choice of Generic Matching Metrics. We evaluated the choice of generic matching distances on person re-id using

Table 14: Effects of selective feature learning (SFL). Query Type Single-Query Multi-Query Measure (%) R1 mAP R1 mAP Without SFL 83.4 63.8 88.7 72.9 With SFL 85.1 65.5 89.7 74.5 the full JLML feature. Table 15 shows that L1 and L2 generate very similar and competitive re-id matching accuracies. This suggests the flexibility of the JLML model in adopting generic matching metrics. Table 15: Effects of generic matching metrics. Query-Type Measure (%) L1 L2

Single-Query R1 mAP 84.9 65.3 85.1 65.5

Multi-Query R1 mAP 89.2 74.6 89.7 74.5

(VI) Effects of Body Parts Number. We evaluated the sensitivity of local decomposition, i.e. body parts number m. Table 16 shows that the decomposition of 4 body-parts is the optimal choice, approximately corresponding to head+shoulder, upper-body, upper-leg and lower-leg (Figure 4). Table 16: Effects of body parts number. Query-Type Measure (%) 2 4 6 8 10

Single-Query R1 mAP 83.9 64.4 85.1 65.5 83.4 62.6 82.3 61.3 81.7 60.4

Multi-Query R1 mAP 88.8 72.9 89.7 74.5 88.5 71.8 87.4 70.7 87.2 69.8

(VII) Complementary Effects between JLML Deep Features and Supervised Metric Learning. We evaluated the complementary effects of the JLML deep features and conventional supervised metric learning (XQDA [Liao et al., 2015], KISSME [Koestinger et al., 2012], and CRAFT [Chen et al., 2017b]). Results from Table 17 show that: (1) Given strong deep learning features such as JLML, additional distance metric learning does not benefit further from the same training data. (2) Moreover, it may even suffer from some adversary effect. (VIII) Local Features vs. Global Features. A strength of the local features is the capability of mitigating misalignment and occlusion, as compared to the global features. This is inherently learned from data by the JLML local branch. Figure 5 shows the single-query re-id results on six randomly selected probe persons with misalignment and/or occlusion. It is evident that the local features achieve better re-id matching ranks than the global counterparts in most cases. This clearly demonstrates the robustness of local features against the misalignment of and occlusion within a person bounding box. (IX) Feature Extraction Time Cost. The average time for extracting JLML feature is 2.75 milliseconds per image (364 images per second) on a Nvidia Pascal P100 GPU card.

Local features: 1,2,3,8,9

Local features: 1,6,8,11,15

Local features: 1,5,34,38,39

Global features: 1,2,8,10,24

Global features: 1,5,34,38,39

Global features: 1,7,76,81,88

Local features: 2,4,7,21,57,99

Local features: 1,32,122,409,460

Local features: 1,2,7,10,13,14

Global features: 1,9,124,173,212

Global features: 1,144,163,948,960

Global features: 1,11,15,16,17

Figure 4: Visualisation of the optimal body part decomposition. Table 17: Complementary of JLML features and metric learning. Query-Type Measure (%) KISSME XQDA CRAFT L2

5

Single-Query R1 mAP 82.1 61.4 82.6 63.2 77.9 56.4 85.1 65.5

Multi-Query R1 mAP 87.5 70.2 88.2 72.4 89.7 74.5

Conclusion

In this work, we presented a novel Joint Learning of MultiLoss (JLML) CNN model (JLML-ResNet39) for person reidentification feature learning. In contrast to existing re-id approaches that often employ either global or local appearance features alone, the proposed model is capable of extracting and exploiting both and maximising their correlated complementary effects by learning discriminative feature representations in different context subject to multi-loss classification objectives in a unified framework. This is made possible by the proposed JLML-ResNet39 architecture design. Moreover, we introduce a structured sparsity based feature selective learning mechanism to reduce feature redundancy and further improve the joint feature selections. Extensive comparative evaluations on five re-id benchmark datasets were conducted to validate the advantages of the proposed JLML model over a wide range of the state-of-the-art methods on both manually labelled and more challenging auto-detected person images. We also provided component evaluations and analysis of model performance in order to give insights on the model design.

References [Ahmed et al., 2015] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person reidentification. In CVPR, 2015. [Argyriou et al., 2007] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. In NIPS, 2007. [Chen et al., 2016] Dapeng Chen, Zejian Yuan, Badong Chen, and Nanning Zheng. Similarity learning with spatial constraints for person re-identification. In CVPR, 2016. [Chen et al., 2017a] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. A multi-task deep network for person reidentification. In AAAI, 2017. [Chen et al., 2017b] Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, and Jian-Huang Lai. Person re-identification by camera correlation aware feature augmentation. IEEE TPAMI, 2017. [Cheng et al., 2016] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. Person re-identification by multichannel parts-based cnn with improved triplet loss function. In CVPR, 2016.

Figure 5: Comparing the gallery true match ranks of each probe image (single-query) with occlusion and/or misalignment by the local and global features. Each probe may have multiple truth matches in the gallery. Smaller numbers mean better ranking performances. [Edelman, 1998] Shimon Edelman. Representation is representation of similarities. Behavioral and Brain Sciences, 21(04):449– 467, 1998. [Farenzena et al., 2010] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. Person re-identification by symmetry-driven accumulation of local features. In CVPR, 2010. [Girshick et al., 2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [Gong et al., 2014] Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen Change Loy. Person re-identification. Springer, January 2014. [Gray and Tao, 2008] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, 2008. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014. [Koestinger et al., 2012] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012. [Kong et al., 2014] Deguang Kong, Ryohei Fujimaki, Ji Liu, Feiping Nie, and Chris Ding. Exclusive feature learning on arbitrary structures via l1,2-norm. In NIPS, 2014. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [Kviatkovsky et al., 2013] Igor Kviatkovsky, Amit Adam, and Ehud Rivlin. Color invariants for person reidentification. IEEE TPAMI, 35(7):1622–1634, 2013. [LeCun et al., 1998] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. IEEE, 86(11):2278–2324, 1998. [Li et al., 2012] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentification with transferred metric learning. In ACCV, 2012. [Li et al., 2014] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person reidentification. In CVPR, 2014.

[Liao et al., 2015] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015. [Loy et al., 2009] Chen Change Loy, Tao Xiang, and Shaogang Gong. Multi-camera activity correlation analysis. In CVPR, 2009. [Ma et al., 2017] Xiaolong Ma, Xiatian Zhu, Shaogang Gong, Xudong Xie, Jianming Hu, Kin-Man Lam, and Yisheng Zhong. Person re-identification by unsupervised video matching. Pattern Recognition, 65:197–210, 2017. [Matsukawa et al., 2016a] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and Yoichi Sato. Hierarchical gaussian descriptor for person re-identification. In CVPR, 2016. [Matsukawa et al., 2016b] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and Yoichi Sato. Hierarchical gaussian descriptor for person re-identification. In CVPR, 2016. [Navon, 1977] David Navon. Forest before trees: The precedence of global features in visual perception. Cognitive Psychology, 9(3):353–383, 1977. [Paisitkriangkrai et al., 2015] Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, 2015. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [Subramaniam et al., 2016] Arulkumar Subramaniam, Moitreya Chatterjee, and Anurag Mittal. Deep neural networks with inexact matching for person re-identification. In NIPS, 2016. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [Torralba et al., 2006] Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, 113(4):766, 2006. [Varior et al., 2016] Rahul Rama Varior, Mrinal Haloi, and Gang Wang. Gated siamese convolutional neural network architecture for human re-identification. In ECCV, 2016. [Wang et al., 2013] Hua Wang, Feiping Nie, and Heng Huang. Multi-view clustering and feature learning via structured sparsity. In ICML, 2013. [Wang et al., 2014a] H. Wang, S. Gong, and T. Xiang. Unsupervised learning of generative topic saliency for person reidentification. In British Machine Vision Conference, Nottingham, UK, September 2014. [Wang et al., 2014b] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by video ranking. In ECCV, 2014. [Wang et al., 2016a] Faqiang Wang, Wangmeng Zuo, Liang Lin, David Zhang, and Lei Zhang. Joint learning of single-image and cross-image representations for person re-identification. In CVPR, 2016. [Wang et al., 2016b] Hanxiao Wang, Shaogang Gong, and Tao Xiang. Highly efficient regression for scalable person reidentification. In BMVC, 2016.

[Wang et al., 2016c] Hanxiao Wang, Shaogang Gong, Xiatian Zhu, and Tao Xiang. Human-in-the-loop person re-identification. In ECCV, 2016. [Wang et al., 2016d] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person re-identification by discriminative selection in video ranking. IEEE TPAMI, 38(12):2501–2514, 2016. [Xiao et al., 2016] Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016. [Xiong et al., 2014] Fei Xiong, Mengran Gou, Octavia Camps, and Mario Sznaier. Person re-identification using kernel-based metric learning methods. In ECCV. Springer, 2014. [Zhang et al., 2016] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a discriminative null space for person re-identification. In CVPR, 2016. [Zhao et al., 2013] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Unsupervised salience learning for person re-identification. In CVPR, 2013. [Zheng et al., 2013] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Reidentification by relative distance comparison. IEEE TPAMI, 35(3):653–668, March 2013. [Zheng et al., 2015] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person reidentification: A benchmark. In ICCV, 2015.