Iterated Support Vector Machines for Distance Metric Learning

8 downloads 39591 Views 360KB Size Report
Feb 2, 2015 - ing between PSD projection and dual SVM learning. Second, by ...... 8. http://empslocal.ex.ac.uk/people/staff/yy267/software.html. TABLE 2: The UCI ... best average ranks, respectively, demonstrating strong classification ...
1

Iterated Support Vector Machines for Distance Metric Learning

arXiv:1502.00363v1 [cs.LG] 2 Feb 2015

Wangmeng Zuo, Member, IEEE, Faqiang Wang, David Zhang, Fellow, IEEE, Liang Lin, Member, IEEE, Yuchi Huang, Member, IEEE, Deyu Meng, and Lei Zhang, Senior Member, IEEE Abstract—Distance metric learning aims to learn from the given training data a valid distance metric, with which the similarity between data samples can be more effectively evaluated for classification. Metric learning is often formulated as a convex or nonconvex optimization problem, while many existing metric learning algorithms become inefficient for large scale problems. In this paper, we formulate metric learning as a kernel classification problem, and solve it by iterated training of support vector machines (SVM). The new formulation is easy to implement, efficient in training, and tractable for large-scale problems. Two novel metric learning models, namely Positive-semidefinite Constrained Metric Learning (PCML) and Nonnegative-coefficient Constrained Metric Learning (NCML), are developed. Both PCML and NCML can guarantee the global optimality of their solutions. Experimental results on UCI dataset classification, handwritten digit recognition, face verification and person re-identification demonstrate that the proposed metric learning methods achieve higher classification accuracy than state-of-the-art methods and they are significantly more efficient in training. Index Terms—metric learning, support vector machine, kernel method, Lagrange duality, alternative optimization



1

I NTRODUCTION

D

ISTANCE metric learning aims to train a valid distance metric which can enlarge the distances between samples of different classes and reduce the distances between samples of the same class [1]. Metric learning is closely related to k-Nearest Neighbor (k-NN) classification [2], clustering [3], ranking [4], [5], feature extraction [6] and support vector machine (SVM) [7], and has been widely applied to face recognition [8], person re-identification [9], [10], image retrieval [11], [12], activity recognition [13], document classification [14], and link prediction [15], etc. One popular metric learning approach is the Mahalanobis distance metric learning, which is to learn a linear transformation matrix L or a matrix M = LT L from the training data. Given two samples xi and xj , the Mahalanobis distance between them is defined as:

d2M (xi , xj ) = kL(xi − xj )k22 T

= (xi − xj ) M (xi − xj ) .

(1)

To satisfy the nonnegative property of a distance metric, M should be positive semidefinite (PSD). According • W. Zuo and F. Wang are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China. (email: [email protected]; [email protected]) • D. Zhang and L. Zhang are with the Department of Computing, the Hong Kong Polytechnic University, Kowloon, Hong Kong. (e-mail: [email protected]; [email protected]) • L. Lin is with the School of Super-computing, Sun Yat-Sen University, Guangzhou, 510275, China. (e-mail: [email protected]) • Y. Huang is with the NEC Laboratories China, Beijing, 100084, China. (e-mail: huang [email protected]) • D. Meng is with the Institute of Information and System Sciences, Faculty of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, 710049, China. (e-mail: [email protected]) Manuscript received XXX; revised XXX.

to which one of M and L is learned, Mahalanobis distance metric learning methods can be grouped into two categories. Methods that learn L, including neighborhood components analysis (NCA) [16], large margin components analysis (LMCA) [17] and neighborhood repulsed metric learning (NRML) [18], are mostly formulated as nonconvex optimization problems, which are solved by gradient descent based optimizers. Taking the PSD constraint into account, methods that learn M, including large margin nearest neighbor (LMNN) [19] and maximally collapsing metric learning (MCML) [20], are mostly formulated as convex semidefinite programming (SDP) problems, which can be optimized by standard SDP solvers [19], projected gradient [3], Boosting-like [21], or Frank-Wolfe [22] algorithms. Davis et al. [23] proposed an information-theoretic metric learning (ITML) model with an iterative Bregman projection algorithm, which does not need projections onto the PSD cone. Besides, the use of online solvers for metric learning has been discussed in [9], [24], [25]. On the other hand, kernel methods [26]–[31] have been widely studied in many learning tasks, e.g., semisupervised learning, multiple instance learning, multitask learning, etc. Kernel learning methods, such as support vector machine (SVM), exhibit good generalization performance. There are many open resources on kernel classification methods, and a variety of toolboxes and libraries have been released [32]–[38]. It is thus important to investigate the connections between metric learning and kernel classification and explore how to utilize the kernel classification resources in the research and development of metric learning methods. In this paper, we propose a novel formulation of metric learning by casting it as a kernel classification problem, which allows us to effectively and efficiently

2

TABLE 1: Summary of main abbreviations Abbreviation

Full Name

PSD SDP k-NN KKT SVM LMCA [17] LMNN [2] NCA [16] MCML [20] ITML [23] LDML [8] DML-eig [22]

Positive semidefinite (matrix) Semidefinite programming k-nearest neighbor (classification) Karush-Kuhn-Tucker (condition) Support vector machine Large margin components analysis Large margin nearest neighbor Neighborhood components analysis Maximally collapsing metric learning Information-theoretic metric learning Logistic discriminant metric learning Distance metric learning with eigenvalue optimization Parametric local metric learning Keep it simple and straightforward metric learning Positive-semidefinite constrained metric learning Nonnegative-coefficient constrained metric learning

PLML [39] KISSME [9] PCML NCML

learn distance metrics by iterated training of SVM. The off-the-shelf SVM solvers such as LibSVM [33] can be employed to solve the metric learning problem. Specifically, we propose two novel methods to bridge metric learning with the well-developed SVM techniques, and they are easy to implement. First, we propose a Positive-semidefinite Constrained Metric Learning (PCML) model, which can be solved via iterating between PSD projection and dual SVM learning. Second, by re-parameterizing the matrix M, we transform the PSD constraint into a nonnegative coefficient constraint and consequently propose a Nonnegativecoefficient Constrained Metric Learning (NCML) model, which can be solved by iterated learning of two SVMs. Both PCML and NCML have globally optimal solutions, and our extensive experiments on UCI dataset classification, handwritten digit recognition, face verification and person re-identification clearly demonstrate the effectiveness of them. The remainder of this paper is organized as follows. Section 2 reviews the related works. Section 3 presents the PCML model and the optimization algorithm. Section 4 presents the model and algorithm of NCML. Section 5 presents the experimental results, and Section 6 concludes the paper. The main abbreviations used in this paper are summarized in Table 1.

2

R ELATED WORK

Compared with nonconvex metric learning models [16], [17], [40], convex formulation of metric learning [2], [3], [20]–[22] has drawn increasing attentions due to its desired properties such as global optimality. Most convex metric learning models can be formulated as SDP or quadratic SDP problems. Standard SDP solvers,

however, are inefficient for metric learning, especially when the size of training samples is big or the feature dimension is high. Therefore, customized optimization algorithm needs to be developed for each specific metric learning model. For LMNN, Weinberger et al. developed an efficient solver based on the sub-gradient descent and the active set techniques [41]. In ITML, Davis et al. [23] suggested an iterative Bregman projection algorithm. Iterative projected gradient descent method [3], [42] has been widely employed for metric learning but it requires an eigenvalue decomposition in each iteration. Other algorithms such as block-coordinate descent [43], smooth optimization [44], and Frank-Wolfe [22] have also been studied for metric learning. Unlike the customized algorithms, in this work we formulate metric learning as a kernel classification problem and solve it using the offthe-shelf SVM solvers, which can guarantee the global optimality and the PSD property of the learned M, and is easy to implement and efficient in training. Another line of work aims to develop metric learning algorithms by solving the Lagrange dual problems. Shen et al. derived the Lagrange dual of the exponential loss based metric learning model, and proposed a boostinglike approach, namely BoostMetric, where the matrix M is learned as a linear positive combination of rank-one matrices [21], [45]. MetricBoost [46] and FrobMetric [47], [48] were further proposed to improve the performance of BoostMetric. Liu and Vemuri incorporated two regularization terms in the duality for robust metric learning [49]. Note that BoostMetric [21], [45], MetricBoost [46], and FrobMetric [47] are proposed for metric learning with triplet constraints, whereas in many applications such as verification, only pairwise constraints are available in the training stage. Several SVM-based metric learning approaches [50]– [53] have also been proposed. Using SVM, Nguyen and Guo [50] formulated metric learning as a quadratic semidefinite programming problem, and suggested a projected gradient descent algorithm. The formulations of the proposed PCML and NCML in this work are different from the model in [50], and they are solved by the dual problems with the off-the-shelf SVM solvers. Brunner et al. [51] proposed a pairwise SVM method to learn a dissimilarity function rather than a distance metric. Different from [51], the proposed PCML and NCML learn a distance metric and the matrix M is constrained to be a PSD matrix. Do et al. [52] studied SVM from a metric learning perspective and presented an improved variant of SVM classification. Wang et al. [53] developed a kernel classification framework for metric learning and proposed two learning models which can be efficiently implemented by the standard SVM solvers. However, they adopted a two-step greedy strategy to solve the models and neglected the PSD constraint in the first step. In this work, the proposed PCML and NCML models have different formulations from [53], and their solutions are globally optimal.

3

Algorithm 1 Algorithm of PCML

3 P OSITIVE - SEMIDEFINITE C ONSTRAINED M ETRIC L EARNING (PCML) Denote by { (xi , yi )| i = 1, 2, · · · , N } a training set, where xi ∈ Rd is the ith training sample, and yi is the class label of xi . The Mahalanobis distance between xi and xj can be equivalently written as:   T d2M (xi , xj ) = tr MT (xi − xj )(xi − xj ) E D (2) = M, (xi − xj ) (xi − xj )T ,  where M is a PSD matrix, hA, Bi = tr AT B is defined as the Frobenius inner product of two matrices A and B, and tr(•) stands for the matrix trace operator. For each pair of xi and xj , we define a matrix Xij = (xi − T xj )(xi − xj ) . With Xij , the Mahalanobis distance can be rewritten as d2M (xi , xj ) = hM, Xij i. 3.1 PCML and Its Dual Problem Let S = {(xi , xj ) : the class labels of xi and xj are the same} be the set of similar pairs, and let D = {(xi , xj ) : the class labels of xi and xj are different} be the set of dissimilar pairs. By introducing an indicator variable hij  1, if (xi , xj ) ∈ D hij = (3) −1, if (xi , xj ) ∈ S,

the PCML model can be formulated as: X 1 2 min kMkF + C ξij i,j M,b,ξ 2 s.t. hij (hM, Xij i + b) ≥ 1 − ξij , ξij ≥ 0, ∀i, j

(4)

M < 0, where ξij denotes the slack variables, b denotes the bias, and kkF denotes the Frobenius norm. The PCML model defined above is convex and can be solved using the standard SDP solvers. However, the high complexity of general-purpose interior-point SDP solver makes it only suitable for small-scale problems. In order to improve the efficiency, in the following we first analyze the Lagrange duality of the PCML model, and then propose an algorithm to iterate between SVM training and PSD projection to learn the Mahalanobis distance metric. By introducing the Lagrange multipliers λ and a PSD matrix Y, the Lagrange dual of the problem in (4) can be formulated as:

2 X 1

X

λij hij Xij + Y + λij max − i,j i,j λ,Y 2 F (5) X s.t. λij hij = 0, 0 ≤ λij ≤ C, ∀i, j, Y < 0. i,j

Please refer to Appendix A for the detailed derivation of the dual problem. Based on the Karush-Kuhn-Tucker (KKT) conditions, the matrix M can be obtained by X M= λij hij Xij + Y. (6)

Input: S = {(xi , xj ) : the class labels of xi and xj are the same}, D = {(xi , xj ) : the class labels of xi and xj are different}, and hij . Output: M. Initialize Y(0) , t ← 0. repeat

(t+1) = 1 − hij Xij , Y(t) . 1. Update η (t+1) with ηij 2. Update λ(t+1) by solving the subproblem (7) using an SVM solver. P (t+1) (t+1) 3. Update Y0 = − i,j λij hij Xij . (t+1)

T

4. Update Y(t+1) = U(t+1) Λ+ U(t+1) , where (t+1) (t+1) (t+1) (t+1) (t+1) T Y0 = U Λ U and Λ+ =  max Λ(t+1) , 0 . 5. t ← t + 1. until convergence P (t−1) M = i,j λij hij Xij + Y(t−1) . return M

3.2 Alternative Optimization Algorithm To solve the dual problem efficiently, we propose an optimization approach by updating λ and Y alternatively. Given Y, we introduce a new variable η with T ηij = 1 − hij hXij , Yi = 1 − hij (xi − xj ) Y (xi − xj ), and the subproblem on λ can be formulated as: X 1X X λij λkl hij hkl hXij , Xkl i + ηij λij max − i,j k,l i,j λ 2X s.t. λij hij = 0, 0 ≤ λij ≤ C, ∀i, j. i,j (7) The subproblem (7) is a QP problem. We can define a kernel function of sample pairs as follows: K ((xi , xj ) , (xk , xl )) = hXij , Xkl i  2 = (xi − xj )T (xk − xl ) .

(8)

Substituting (8) into (7), the subproblem on λ becomes a kernel-based classification problem, and can be efficiently solved by using the existing SVM solvers such as LibSVM [33]. Given λ, the subproblem on Y can be formulated as the projection of a matrix onto the convex cone of PSD matrices: min Y

2

kY − Y0 kF ,

s.t.

Y < 0,

(9)

P where Y0 = − i,j λij hij Xij . Through the eigendecomposition of Y0 , i.e., Y0 = UΛUT and Λ is the diagonal matrix of eigenvalues, the solution to the subproblem on Y can be explicitly expressed as Y = UΛ+ UT , where Λ+ = max (Λ, 0). Finally, the PCML algorithm is summarized in Algorithm 1.

i,j

The strong duality allows us to first solve the equivalent dual problem in (5) and then obtain the matrix M by (6). However, due to the PSD constraint Y < 0, the problem in (5) is still difficult to optimize.

3.3 Optimality Condition As shown in [54], [55], the general alternating minimization approach will converge. By alternatively updating

4

i,j

T

(n)

As shown in Subsection 3.2, Y0 = U(n) Λ(n) U(n) , T (n) Y(n) = U(n) Λ+ U(n) , and hence M(n) = (n) (n) (n) (n) (n) (n) T U Λ− U , where Λ− = Λ+ − Λ . Thus,

(n)

M 2 can be computed by F

  T

(n) 2

M = tr M(n) M(n) F   T T (n) (n) (12) = tr U(n) Λ− U(n) U(n) Λ− U(n)     T (n)2 (n)2 . = tr Λ− = tr U(n) Λ− U(n)

Substituting (11) and (12) into (10), the duality gap of PCML can be obtained as follows   X X (n) (n) (n) 2 (n) DualGapPCML = C ξij − λij + tr Λ− . i,j i,j (13) Based on the KKT conditions of the PCML dual prob(n) lem in (5), ξij can be obtained by   0, ∀λ(n) ij < C (n) i E D ξij = h (n)  1 − hij M(n) , Xij + b(n) , ∀λij = C, +

(14)

where

b(n) =

D E 1 (n) − M(n) , Xij , ∀0 < λij < C. hij

(15)

Please refer to Appendix A for the detailed derivation of (n) ξij and b(n) . The duality gap is always nonnegative and approaches to zero when the primal problem is convex. Thus, it can be used as the termination condition of the algorithm. Fig. 1 plots the curve of duality gap versus the number of iterations on the PenDigits dataset by PCML. One can see that the duality gap converges to zero in less than 20 iterations and our algorithm will reach the global optimum. In Algorithm 1, we adopt the following termination condition: (t)

(1)

DualGapPCML < ε · DualGapPCML ,

(16)

where ε is a small constant and we set ε = 0.01 in the experiment.

150 Duality Gap

λ and Y, the proposed algorithm can reach the global optimum of the problems in (4) and (5). The optimality condition of the proposed algorithm can be checked by the duality gap in each iteration, which is defined as the difference between the primal and dual objective values:

2 X X 1

(n) (n) (n) DualGapPCML = M(n) + C ξij − λ i,j i,j ij 2 F

2 1

X

(n) λij hij Xij + Y(n) , + i,j 2 F (10) where M(n) , ξ (n) , λ(n) , and Y(n) are feasible primal and (n) dual variables, and DualGapPCML is the duality gap in the nth iteration. According to (6), we can derive that X (n) (n) M(n) = λij hij Xij + Y(n) = Y(n) − Y0 . (11)

100

50

0

1

3

5 7 9 11 13 15 1718 Number of Iterations

Fig. 1: Duality gap vs. number of iterations on the PenDigits dataset for PCML.

3.4 Remarks Warm-start: In the updating of λ, we adopt a simple warm-start strategy. We use the solution of the previous iteration as the initialization of the next iteration. Since the previous solution can serve as a good guess, warmstart results in significant improvement in efficiency. Construction of pairwise constraints: Based on the training set, we can introduce N 2 pairwise constraints in total. However, in practice we only need to choose a subset of pairwise constraints to reduce the computational cost. For each sample, we find its k nearest neighbors to construct similar pairs and its k farthest neighbors to construct dissimilar pairs. Thus, we only need 2kN pairwise constraints. By this strategy, we can reduce the scale of pairwise constraints from O N 2 to O (kN ). Since k is usually small constant (=1∼3) in practice, the computational cost of metric learning is much reduced. Similar strategy for constructing pairwise or triplet constraints can be found in [2], [11]. Computational Complexity: We use the LibSVM library for SVM training. The computational complexity of SMO-type algorithms [34] is O(k 2 N 2 d). For PSD projection, the complexity of conventional SVD algorithms is O(d3 ).

4 N ONNEGATIVE - COEFFICIENT C ONSTRAINED M ETRIC L EARNING (NCML) Given a set of rank-1 PSD matrices Mt = mt mTt (t = 1, · · · , T ), Pa linear combination of Mt is defined as M = t αt Mt , where αt is the scalar combination coefficient. One can easily prove the following Theorem 1. Theorem P 1: If the scalar coefficient αt ≥ 0, ∀t, the matrix M = t αt Mt is a PSD matrix, where Mt = mt mTt is a rank-1 PSD matrix. Proof: Denote by u ∈ Rd a random vector. Based on the expression of M, we have:  X uT Mu = uT αt mt mTt u t X X 2 = αt uT mt mTt u = αt uT mt . t

t

5

2 Since uT mt ≥ 0 and αt ≥ 0, ∀t, we have uT Mu ≥ 0. Therefore, M is a PSD matrix. 4.1 NCML and Its Dual Problem Motivated by Theorem 1, we propose to transform the PSD constraint in (4) by re-parameterizing the distance metric M, and develop a nonnegative-coefficient constrained metric learning (NCML) method to learn the PSD matrix M. Given the training data S and D, a rank-1 PSD matrix Xij can be constructed for each pair (xi , xj ). By assuming that the learned matrix should be the linear combination of Xij with the nonnegative coefficient constraint, the NCML model can be formulated as: X 1 2 min kMkF + C ξij i,j M,b,α,ξ 2 s.t. hij (hM, Xij i + b) ≥ 1 − ξij , αij ≥ 0, ξij ≥ 0, ∀i, j X M= αij Xij . i,j (17) P By substituting M with i,j αij Xij , we reformulate the NCML model as follows: X 1X X min αij αkl hXij , Xkl i + C ξij i,j k,l i,j α,b,ξ 2 X  (18) s.t. h α hX , X i + b ≥ 1 − ξ

Algorithm 2 Algorithm of NCML Input: Training set {(xi , xj ) , hij }. Output: The matrix M. Initialize η (0) with small random values, t ← 0. repeat (t+1) 1. Update δ (t+1)  with δij  P (t) 1 − hij kl ηkl hXij , Xkl i .

=

2. Update β (t+1) by solving the subproblem (15) using an SVM solver. (t+1) = 3. Update γ (t+1) with γij P (t+1) β h hX , X i. kl ij kl kl kl 4. Update µ(t+1) by solving the subproblem (18) using an SVM solver. (t+1) (t+1) (t+1) − hij βij . ← µij 5. Update η (t+1) with ηij 6. t ← t + 1. until convergence P (t) M = ij µij Xij . return M

where Pδ is the variable with δij = (1 − hij kl ηkl hXij , Xkl i). Clearly, the subproblem on β is exactly the dual problem of SVM, and it can ij kl ij kl ij k,l be efficiently solved by any standard SVM solvers, e.g., LibSVM [33]. αij ≥ 0, ξij ≥ 0, ∀i, j. Given β, the subproblem on η can be formulated as By introducing the Lagrange multipliers η and β, the follows: Lagrange dual of the primal problem in (18) can be X 1X X formulated as: η η hX , X i + ηij γij min ij kl ij kl i,j k,l i,j η 2 1X X (23) X (βij hij + ηij ) (βkl hkl + ηkl ) hXij , Xkl i max − i,j k,l η,β 2 s.t. η hX , X i ≥ 0, ∀i, j, ij ij kl X k,l + βij P i,j where γij = kl βkl hkl hXij , Xkl i. To simplify the subX s.t. ηkl hXij , Xkl i ≥ 0, 0 ≤ βij ≤ C, ∀i, j problem on η, we derive the Lagrange dual of (23) based k,l X on the KKT condition: βij hij = 0. i,j ηij = µij − hij βij , ∀i, j, (24) (19) Please refer to Appendix B for the detailed derivation of the dual problem. Based on the KKT conditions, the where µ is the Lagrange dual multiplier. The Lagrange dual problem of (23) is formulated as follows: coefficient αij can be obtained by: X 1X X αij = βij hij + ηij . (20) max − µij µkl hXij , Xkl i + γij µij µ i,j k,l i,j 2 Thus, we can first solve the above dual problem, and s.t. µij ≥ 0, ∀i, j. then obtain the matrix M by (25) X Please refer to Appendix C for the detailed derivation. M= (βij hij + ηij )Xij . (21) i,j Clearly, problem (25) is a simpler QP problem than (23), which can be efficiently solved by the standard SVM 4.2 Optimization Algorithm solvers. There are two groups of variables, η and β, in problem By alternatively updating µ and β, we can solve the (19). We adopt an alternative optimization approach to NCML dual problem (19). After obtaining the optimal solve them. First, given η, the variables βij can be solved solutions of µ and β, the optimal solution of α in as follows: problem (18) can be obtained by X 1X X max − βij βkl hij hkl hXij , Xkl i + δij βij αij = µij , ∀i, j. (26) i,j k,l i,j β 2 X P s.t. 0 ≤ βij ≤ C, ∀i, j, βij hij = 0, We then have M = ij αij Xij . The NCML algorithm is i,j (22) summarized in Algorithm 2.

6

4.3 Optimality Condition We check the duality gap of NCML to investigate the optimality condition of it. From the primal and dual objectives in (18) and (19), the NCML duality gap in the nth iteration is X (n) 1 X (n) (n) (n) DualGapNCML = ξij αij αkl hXij , Xkl i + C 2 i,j i,j,k,l   1 X  (n) (n) (n) (n) βkl hkl + ηkl hXij , Xkl i + βij hij + ηij 2 i,j,k,l X (n) βij , −

1200 1000 Duality Gap

Analogous to PCML, the updating of β and µ in NCML can be speeded up by using the warm-start strategy. As shown in Fig. 2, the proposed NCML algorithm will converge in 10∼15 iterations.

800 600 400 200 0

1

3

5 7 9 11 13 Number of Iterations

15

Fig. 2: Duality gap vs. number of iterations on the PenDigits dataset for NCML.

4.4 Remarks

Computational complexity: We use the same strategy as that in PCML to construct the pairwise constraints i,j (27) for NCML. In each iteration, NCML calls for the SVM (n) (n) where αij and ξij are the feasible solutions to the solver twice while PCML calls for it only once. When the (n) (n) primal problem, βij and ηij are the feasible solutions SMO-type algorithm [34] is adopted for SVM training,  the computational complexity of NCML is O k 2 N 2 d . (n) to the dual problem, and DualGapNCML is the duality One extra advantage of NCML lies in its lower com(n) (n) gap in the nth iteration. As ηij and µij are the optimal putational cost with respect to d, which involves the solutions to the primal subproblem on η in (23) and computation of hX , X i and the construction of matrix ij kl 2 its dual problem in (25), respectively, the duality gap T , the cost of M. Since hX , X i = (x − x ) (x − x ) ij kl i j k l of subproblem on η is zero, i.e., computing hX , X i is O (d). The cost of constructing ij kl X  1X X (n) (n) (n) (n) the matrix M is less than O kN d2 , and this operation ηij ηkl hXij , Xkl i + ηij γij i,j k,l i,j 2 is required only once after the convergence of β and µ. X 1X X (n) (n) (n) (n) Nonlinear extensions: Note that hXij , Xkl i = + µij µkl hXij , Xkl i − γij µij = 0.  i,j k,l i,j 2 T tr X X can be treated as an inner product of two ij kl (28) (n) (n) pairs of samples: (xi , xj ) and (xk , xl ). Analogous to As shown in (26), αij and µij should be equal. We PCML, if we can define a kernel K ((xi , xj ), (xk , xl )) on substitute (28) into (27) as follows: (x , x ) and (x , x ), we can substitute hXij , Xkl i with i j k l X X X (n) (n) (n) (n) (n) DualGapNCML = C ξij − βij + µij γij . K ((xi , xj ), (xk , xl )) to develop new linear or even noni,j i,j i,j (29) linear metric learning algorithms, and the Mahalanobis Based on the KKT conditions of the NCML dual distance between any two samples xm and xn can be (n) problem in (19), ξij can be obtained by (30) (see page formulated as: T 7), where [z] = max (z, 0) and b(n) can be obtained by (xm − xn ) M (xm − xn ) = X (33) X αij K ((xi , xj ) , (xm , xn )) . 1 (n) i,j b(n) = − αkl hXij , Xkl i k,l hij Another nonlinear extension strategy is to (31) (n+1) δij define a kernel k (xi , xj ) on xi and xj . Since (n) (n) 2 for all 0 < βij < C. − γij = hXij , Xkl i = xTi xk − xTi xl − xTj xk + xTj xl , hij we can substitute hXij , Xkl i with Please refer to Appendix B for the detailed derivation 2 (k (x , x ) − k (x , x ) − k (x , x ) + k (x , x )) and (n) i k i l j k j l of ξij and b(n) . formulate the Mahalanobis distance between xm and Fig. 2 plots the curve of duality gap versus the number xn as: of iterations on the PenDigits dataset by NCML. One T (xm − xn ) M (xm − xn ) can see that the duality gap converges to zero in 15 !2 iterations, and NCML reaches the global optimum. In the X (34) k (xi , xm ) − k (xi , xn ) implementation of Algorithm 2, we adopt the following . = αij i,j − k (xj , xm ) + k (xj , xn ) termination condition: (t)

(1)

DualGapNCML < ε · DualGapNCML ,

(32)

where ε is a small constant and we set ε = 0.01 in the experiment.

That is to say, NCML allows us to learn nonlinear metrics for histograms and structural data by designing proper kernel functions and incorporating appropriate regularizations on α. Metric learning for structural data beyond

7

+

vector data has been recently receiving considerable research interests [5], [56], and NCML can provide a new perspective on this topic. SVM solvers: Although our implementation is based on LibSVM, there are a number of well-studied SVM training algorithms, e.g., core vector machines [35], LaRank [36], BMRM [37], and Pegasos [38], which can be utilized for large scale metric learning. Moreover, we can refer to the progresses in kernel methods [26]–[28] for developing semi-supervised, multiple instance, and multitask metric learning approaches.

5

E XPERIMENTAL R ESULTS

We evaluate the proposed PCML and NCML models for k-NN classification (k = 1) using 9 UCI datasets, 4 handwritten digit datasets, 2 face verification datasets and 2 person re-identification datasets. We compare PCML and NCML with the baseline Euclidean distance metric and 7 state-of-the-art metric learning models, including NCA [16], ITML [23], MCML [20], LDML [8], LMNN [2], PLML [39], and DML-eig [22]. On each dataset, if the partition of training set and test set is not defined, we evaluate the performance of each method by 10fold cross-validation, and the classification error rate and training time are obtained by averaging over 10 runs of 10-fold cross-validation. PCML and NCML are implemented using the LibSVM1 toolbox. The source codes of NCA2 , ITML3 , MCML4 , LDML5 , LMNN6 , PLML7 , and DML-eig8 are online available, and we tune their parameters to get the best results. 5.1 Results on the UCI Datasets We first use 9 datasets from the UCI Machine Learning Repository [57] to evaluate the proposed models. The information of the 9 UCI datasets is summarized in Table 2. On the Satellite, SPECTF Heart, and Letter datasets, the training set and test set are defined. On the other datasets, we use 10-fold cross-validation to evaluate the metric learning models. The proposed PCML and NCML methods involve only one hyper-parameter, i.e., the regularization parameter C. We simply adopt the cross-validation strategy to select C by investigating the influence of C on the 1. http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ 2. http://www.cs.berkeley.edu/∼fowlkes/software/nca/ 3. http://www.cs.utexas.edu/∼pjain/itml/ 4. http://homepage.tudelft.nl/19j49/Matlab Toolbox for Dimensionality Reduction.html 5. http://lear.inrialpes.fr/people/guillaumin/code.php 6. http://www.cse.wustl.edu/∼kilian/code/code.html 7. http://cui.unige.ch/∼wangjun/ 8. http://empslocal.ex.ac.uk/people/staff/yy267/software.html

(30)

(n)

for all βij = C.

+

TABLE 2: The UCI datasets used in our experiments. Dataset

# of training samples

Breast Tissue

96

Cardiotocography 1,914

# of test samples

Feature dimension

# of classes

10

9

6

212

21

10

ILPD

525

58

10

2

Letter

16,000

4,000

16

26

Parkinsons

176

19

22

2

Satellite

4,435

2,000

36

6

Segmentation

2,079

231

19

7

Sonar

188

20

60

2

SPECTF Heart

80

187

44

2

classification error rate. Fig. 3 shows the curves of classification error rate versus C for PCML and NCML on the SPECTF Heart dataset. The curves on other datasets are similar. We can observe that when C < 1, the classification error rates of PCML and NCML will be low and stable. When C is higher than 1, the classification error rates jump dramatically. Thus, we set C < 1 in our experiments. 38

40

36

38

34

36

Error Rate (%)

k,l

Error Rate (%)

(n)

ξij

 (n)  0 for all βij 0. Thus we can simply take any training point, for which 0 < λij < C, to compute b by

(42)

Equation (37) implies the following relationship between λ, Y and M: X M= λij hij Xij + Y. (43) i,j

Substituting (37)∼(39) back into the Lagrangian, we get the following Lagrange dual problem of PCML:

2 X 1

X

max − λij hij Xij + Y + λij i,j i,j λ,Y 2 F (44) X s.t. λij hij = 0, 0 ≤ λij ≤ C, ∀i, j, Y < 0. i,j

As we can see from (43) and (44), M is explicitly determined by the training procedure, but b is not. Nevertheless, b can be easily found by using the KKT complementarity condition in (39) and (42), which show that ξij = 0 if λij < C, and hij (hM, Xij i + b)−1+ξij = 0

Its Lagrangian can be defined as: X 1X αij αkl hXij , Xkl i + C ξij L (β, σ, ν, α, b, ξ) = i,j,k,l i,j i h X2  X αkl hXij , Xkl i + b − 1 + ξij − βij hij kl i,j X X − νij ξij − σij αij , i,j i,j (48) where β, σ and ν are the Lagrange multipliers which satisfy βij ≥ 0, σij ≥ 0 and νij ≥ 0, ∀i, j. Converting the original problem to its dual problem needs the following KKT conditions: ∂L (β, σ, ν, α, b, ξ) =0⇒ ∂αij X X αkl hXij , Xkl i − βkl hkl hXij , Xkl i − σij = 0, k,l k,l (49) X ∂L (β, σ, ν, α, b, ξ) (50) =0⇒ βij hij = 0, i,j ∂b ∂L (β, σ, ν, α, b, ξ) = 0 ⇒ C − βij − νij = 0 ⇒ ∂ξij (51) 0 ≤ βij ≤ C, X  hij αkl hXij , Xkl i + b − 1 + ξij ≥ 0, k,l (52) ξij ≥ 0, αij ≥ 0, ∀i, j, βij ≥ 0, σij ≥ 0, νij ≥ 0, ∀i, j, (53) i h X  αkl hXij , Xkl i + b − 1 + ξij = 0, βij hij k,l (54) νij ξij = 0, σij αij = 0, ∀i, j. Here we P introduce a coefficient vector η, which satisfies σij = k,l ηkl hXij , Xkl i. Note that hXij , Xkl i is a positive definite kernel. So we can guarantee that every η

13

corresponds to a unique σ, and vice versa. Equation (49) implies the following relationship between α, β and η: αij = βij hij + ηij ,

∀i, j.

i,j

X

X

k,l

i,j

ηkl hXij , Xkl i ≥ 0, 0 ≤ βij ≤ C, ∀i, j

βij hij = 0.

(56) Analogous to PCML, we can use the KKT complementarity condition in (50) to compute b and ξij in NCML. Equations (51) and (54) show that ξij = 0 if βij < C, and P hij ( kl αkl hXij , Xkl i + b) − 1 + ξij = 0 if βij > 0. Thus we can simply take any training data point, for which 0 < βij < C, to compute b by X 1 b= − αkl hXij , Xkl i. (57) k,l hij

After obtain b, we can compute βij by   0, ∀ βij < C X i ξij = h αkl hXij , Xkl i + b , ∀ βij = C,  1 − hij k,l

where the term [z]+ hinge loss.

ηij = µij − hij βij ,

(55)

Substituting (49)∼(51) back into the Lagrangian, we get the Lagrange dual problem of NCML as follows: 1 X max − (βij hij + ηij ) (βkl hkl + ηkl ) hXij , Xkl i η,β 2 i,j,k,l X βij + s.t.

Equation (61) implies the following relationship between µ, η and β:

+

(58) = max (z, 0) denotes the standard

R EFERENCES [1] [2] [3] [4]

S UBPROBLEM

THE

ON

η

IN

The subproblem on η is formulated as follows: X 1X X ηij ηkl hXij , Xkl i + ηij γij min i,j k,l i,j η 2 (59) X s.t. ηkl hXij , Xkl i ≥ 0, ∀i, j,

[6] [7] [8] [9]

k,l

where γij =

P

k,l

βkl hkl hXij , Xkl i. Its Lagrangian is:

1X

X

L (µ, η) = i,j 2 X X + ηij γij − i,j

ηij ηkl hXij , Xkl i X µij ηkl hXij , Xkl i,

k,l

i,j

[10]

(60)

∂L (µ, η) =0⇒ ∂ηij X X ηkl hXij , Xkl i + γij − k,l

[11]

k,l

where µ is the Lagrange multiplier which satisfies µij ≥ 0, ∀i, j. Converting the original problem to its dual problem needs the following KKT condition:

[12]

[13]

k,l

µkl hXij , Xkl i = 0. (61)

(62)

Substituting (61) and (62) back into the Lagrangian, we get the following Lagrange dual problem of the subproblem on η: X 1X X µij µkl hXij , Xkl i + γij µij max − i,j k,l i,j µ 2 1X X − βij βkl hij hkl hXij , Xkl i i,j k,l 2 s.t. µij ≥ 0, ∀i, j. (63) Since β is fixed in this subproblem, P P i,j k,l βij βkl hij hkl hXij , Xkl i remains constant in (63). Thus we can omit this term and have the following simplified Lagrange dual problem: X 1X X µij µkl hXij , Xkl i + γij µij max − i,j k,l i,j µ 2 s.t. µij ≥ 0, ∀i, j. (64)

[5]

A PPENDIX C T HE D UAL OF NCML

∀i, j.

[14]

A. Bellet, A. Habrard, and M. Sebban, “A survey on metric learning for feature vectors and structured data,” arXiv:1306.6709, 2013. K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009. E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2002, pp. 505–512. B. McFee and G. Lanckriet, “Metric learning to rank,” in Proc. 27th Int. Conf. Mach. Learn. (ICML 2010), 2010, pp. 775–782. D. Lim, B. McFee, and G. R. Lanckriet, “Robust structural metric learning,” in Proc. 30th Int. Conf. Mach. Learn. (ICML 2013), 2013, pp. 615–623. W. Zhang, Z. Lin, and X. Tang, “Learning semi-riemannian metrics for semisupervised feature extraction,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 4, pp. 600–611, 2011. Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learning for kernel machines,” arXiv:1208.3422, 2013. M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in Proc. IEEE 12th Int. Conf. Comput. Vis. (ICCV 2009), 2009, pp. 498–505. M. Kostinger, M. Hirzer, P. Wohlhart, P. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Proc. 2012 IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR 2012), 2012, pp. 2288–2295. Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in Proc. 16th IEEE Int. Conf. Comput. Vis. (ICCV 2013), 2013, pp. 3610–3617. S. Hoi, W. Liu, and S. Chang, “Semi-supervised distance metric learning for collaborative image retrieval,” in Proc. 2008 IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR 2008), 2008, pp. 1–7. L. Yang, R. Jin, L. Mummert, R. Sukthankar, A. Goode, B. Zheng, S. C. H. Hoi, and M. Satyanarayanan, “A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 30–44, Jan. 2010. D. Tran and A. Sorokin, “Human activity recognition with metric learning,” in Proc. 10th Eur. Conf. Comput. Vis. (ECCV 2008), 2008, pp. 548–561. G. Lebanon, “Metric learning for text documents,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497– 508, Apr. 2006.

14

[15] B. Shaw, B. Huang, and T. Jebara, “Learning a distance metric from a network,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2011, pp. 1899–1907. [16] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2004, pp. 513–520. [17] L. Torresani and K. Lee, “Large margin component analysis,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2006, pp. 1385–1392. [18] J. Lu, X. Zhou, Y. Tan, Y. Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship verification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 2, pp. 331–345, Feb. 2014. [19] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2005, pp. 1473–1480. [20] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2005, pp. 451–458. [21] C. Shen, J. Kim, L. Wang, and A. Hengel, “Positive semidefinite metric learning using boosting-like algorithms,” J. Mach. Learn. Res., vol. 13, pp. 1007–1036, 2012. [22] Y. Ying and P. Li, “Distance metric learning with eigenvalue optimization,” J. Mach. Learn. Res., vol. 13, pp. 1–26, 2012. [23] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon, “Informationtheoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn. (ICML 2007), 2007, pp. 209–216. [24] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: generalizing to new classes at near-zero cost,” in Proc. 2012 Eur. Conf. Comput. Vis. (ECCV 2012), 2012, pp. 488–501. [25] G. Checkik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, 2010. [26] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006. [27] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for multiple-instance learning,” in Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 561–568. [28] T. Evgeniou, F. France, and M. Pontil, “Regularized multi-task learning,” in Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2004, pp. 109–117. [29] E. Pekalska and B. Haasdonk, “Kernel discriminant analysis for positive definite and indefinite kernels,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 6, pp. 1017–1032, Jun. 2009. [30] S. Anand, S. Mittal, O. Tuzel, and P. Meer, “Semi-supervised kernel mean shift clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1201–1215, Jun. 2014. [31] S. Maji, A. C. Berg, and J. Malik, “Efficient classification for additive kernel SVMs,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 66–77, Jan. 2013. [32] V. Vapnik, The nature of statistical learning theory. New York: Springer, 1995. [33] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, pp. 1–27, 2011. [34] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999, pp. 185–208. [35] I. W. Tsang, J. T. Kwok, and P. M. Cheung, “Core vector machines: Fast SVM training on very large data sets,” J. Mach. Learn. Res., vol. 3, pp. 363–392, 2005. [36] A. Bordes, L. Bottou, P. Gallinari, and J. Weston, “Solving multiclass support vector machines with larank,” in Proc. 24th Int. Conf. Mach. Learn. (ICML 2007), 2007, pp. 89–96. [37] C. H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan, “A scalable modular convex solver for regularized risk minimization,” in Proc. 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2007, pp. 727–736. [38] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: primal estimated sub-gradient solver for SVM,” Mathematical Programming, vol. 127, no. 1, pp. 3–30, 2011. [39] J. Wang, A. Woznica, and A. Kalousis, “Parametric local metric learning for nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 1610–1618.

[40] G. Niu, B. Dai, M. Yamada, and M. Sugiyama, “Informationtheoretic semi-supervised metric learning via entropy regularization,” in Proc. 29th Int. Conf. Mach. Learn. (ICML 2012), 2012, pp. 89–96. [41] K. Q. Weinberger and L. K. Saul, “Fast solvers and efficient implementations for distance metric learning,” in Proc. 25th Int. Conf. Mach. Learn. (ICML 2008), 2008, pp. 1160–1167. [42] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning: Theory and algorithm,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2009, pp. 862–870. [43] G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, and H.-J. Zhang, “An efficient sparse metric learning in high-dimensional space via l1 penalized log-determinant regularization,” in Proc. 26th Int. Conf. Mach. Learn., 2009, pp. 841–848. [44] Y. Ying, K. Huang, and C. Campbell, “Sparse metric learning via smooth optimization,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2009, pp. 2214–2222. [45] C. Shen, J. Kim, L. Wang, and A. Hengel, “Positive semidefinite metric learning with boosting,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2009, pp. 1651–1659. [46] J. Bi, D. Wu, L. Lu, M. Liu, Y. Tao, and M. Wolf, “Adaboost on low-rank psd matrices for metric learning,” in Proc. 2011 IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR 2011), 2011, pp. 2617–2624. [47] C. Shen, J. Kim, and L. Wang, “A scalable dual approach to semidefinite metric learning,” in Proc. 2011 IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR 2011), 2011, pp. 2601–2608. [48] C. Shen, J. Kim, F. Liu, L. Wang, and A. Hengel, “Efficient dual approach to distance metric learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 2, pp. 394–406, 2014. [49] M. Liu and B. C. Vemuri, “A robust and efficient doubly regularized metric learning approach,” in Proc. 2012 Eur. Conf. Comput. Vis. (ECCV 2012), 2012, pp. 646–659. [50] N. Nguyen and Y. Guo, “Metric learning: A support vector approach,” in Proc. ECML/PKDD, 2008, pp. 125–136. [51] C. Brunner, A. Fischer, K. Luig, and T. Thies, “Pairwise support vector machines and their applications to large scale problems,” J. Mach. Learn. Res., vol. 13, pp. 2279–2292, 2012. [52] H. Do, A. Kalousis, J. Wang, and A. Woznica, “A metric learning perspective of SVM: on the relation of SVM and LMNN,” arXiv:1201.4714, 2012. [53] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang, “A kernel classification framework for metric learning,” to be appear in IEEE Trans. Neural Netw. Learn. Syst. [54] I. Csiszar and G. Tusnady, “Information geometry and alternating minimization procedures,” Statistics and decisions, Supplement Issue, vol. 1, pp. 205–237, 1984. [55] A. Gunawardana and W. Byrne, “Convergence theorems for generalized alternating minimization procedures,” J. Mach. Learn. Res., vol. 6, pp. 2049–2073, 2005. [56] K. Huang, Y. Ying, and C. Campbell, “Generalized sparse metric learning with relative comparisons,” Knowledge and Inf. Syst., vol. 28, no. 1, pp. 25–45, 2011. [57] A. Frank and A. Asuncion, “UCI machine learning repository [http://archive.ics.uci.edu/ml],” 2010. [58] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2006. [59] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Univ. of Massachusetts, Tech. Rep., 2007. [60] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in Proc. 2009 IEEE Int. Conf. Comput. Vis. (ICCV 2009), 2009, pp. 365–372. [61] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [62] S. Gong, M. Cristani, S. Yan, and C. C. Loy, Person Re-identification. Springer, 2014. [63] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in Proc. 2008 Eur. Conf. Comput. Vis. (ECCV 2008), 2008, pp. 262–275. [64] D. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and V. Murino, “Custom pictorial structures for re-identification,” in Proc. British Machine Vision Conf., 2011. [65] X. Zhou, N. Cui, Z. Li, F. Liang, and T. Huang, “Hierarchical gaussianization for image classification,” in Proc. 12th IEEE Int. Conf. Comput. Vis. (ICCV 2009), 2009, pp. 1971–1977.