Similarity Learning for Nearest Neighbor Classification - CiteSeerX

13 downloads 0 Views 90KB Size Report
The k Nearest Neighbor (kNN) rule is one of the oldest and simplest classification rule: for ... good performance, is still heavily used for classification purposes.
Similarity Learning for Nearest Neighbor Classification Ali-Mustafa Qamar, Eric Gaussier Laboratoire d’Informatique de Grenoble (LIG) [email protected]

Abstract We propose in this paper an algorithm for learning a general class of similarity measures for kNN classification. This class encompasses, among others, the standard cosine measure, as well as the Dice and Jaccard coefficients. The algorithm we propose is an extension of the voted perceptron algorithm and allows one to learn different types of similarity functions (either based on diagonal, symmetric or asymmetric similarity matrices). The results we obtained show that learning similarity measures yields significant improvements on several collections, for two prediction rules: the standard kNN rule, which was our primary goal, and a symmetric version of it.

1. Introduction The k Nearest Neighbor (kNN) rule is one of the oldest and simplest classification rule: for each new example x, compute its k nearest neighbors in a set of already classified examples, and assign x to the class which is the most represented in the set of nearest neighbors. The kNN rule has been studied by many researchers, from different communities. In the database community, for example, it is used to determine the instances closest to a given query point. In case-based reasoning, pattern recognition and machine learning, the kNN rule, because of its simplicity and good performance, is still heavily used for classification purposes. Nevertheless, several successful attempts have been made to improve the standard kNN rule by taking into account the geometry of the space in which the examples lie, e.g. by replacing the standard Euclidean distance with a Mahalanobis one. distances (instead of the Euclidean one). These attempts have led to a line of research which is known as metric learning for kNN classification. Most works on metric learning for kNN classification have focused on distance learning (see for example [12, 2, 15, 16]). However, in many practical situations, similarities may be preferred over distances. This is typically the case when one is work-

Jean-Pierre Chevallet, Joo Hwee Lim Image, Perception, Access & Language Institute for Infocomm Research (I2R) [email protected]

ing on texts, for which the cosine measure has been deemed more appropriate than standard distances like the Euclidean or Mahalanobis ones. Furthermore, several experiments, including the ones we are reporting here, show that the use of the cosine similarity should be preferred over the Euclidean distance on several, non textual collections (as Iris, Wine and Balance). Being able to efficiently learn appropriate similarity measures, as opposed to distances, for kNN classification is thus of high importance for various collections. If several works have partially addressed this problem (as for example [1, 11, 9]) for different applications, we know of no work which have fully addressed it in the context of kNN classification. This is exactly the goal of the present research. The remainder of this paper is organized as follows. Section 2 formalizes the problem we are interested in, and defines the class of similarity functions we are targeting. Section 3 presents the algorithms used for training the similarity function and for predicting categories for new examples. Section 4 provides a theoretical justification of these algorithms, while section 5 provides an experimental validation. Lastly, we discuss the relation of our algorithms with related ones in section 6.

2. Formulation of the Problem We focus here on the problem of learning a similarity for kNN classification. Let x and y be two examples in Rp . We consider similarity functions of the form: sA (x, y) =

xT Ay N(x, y)

(1)

where T denote the transpose, A is a (p × p) matrix and N(x, y) is a normalization which depends on x and y (this normalization is typically used to map the similarity function to a particular interval, as [0, 1]). Equation 1 generalizes several, standard similarity functions. For example, the cosine measure, widely used in text retrieval, is obtained by setting A to the identity matrix I, and N(x, y) to the product of the L2 norms of x and y. The

Dice coefficient is obtained, from presence/absence vectors (i.e. all coordinates of x and y are 0 or 1), by setting A to 2I, and N(x, y) to the sum of the L1 norms of x and y. Similarly, the Jaccard coefficient, again computed between presence/absence vectors, corresponds to A = I and N(x, y) = |x| + |y| − xT y (|x| denotes the L1 norm). Furthermore, the fact that we do not impose any condition on A (apart from being square) allows one to consider both symmetric and asymmetric similarity functions, depending on the targeted task. For example, Bao et al., [1], make use of two asymmetric similarity functions: the Relative Frequency Model, which is an asymmetric version of the cosine, and the Inclusion Proportion Model, which is an asymmetric version of the Dice coefficient, and show that these asymmetric measures are better than their symmetric counterparts to retrieve partial copies of text documents. The problem we address here is the one of learning a similarity function of the above, general form from training data, to be used in kNN classification. Let (x(1) , c(1) ), · · · , (x(n) , c(n) ) be a training set of n labeled examples with inputs x(i) ∈ Rp and discrete (but not necessarily binary) class labels c(i) (c(i) represents the class of the ith example). We wish to learn a (p × p) matrix A that aims at optimizing kNN classification when the neighbourhood function is given by equation 1. To do so, we introduce, as in [16], for each x(i) , its k target neighbors, which are the k elements in c(i) closest to x(i) , according to a base similarity measure. For example, one may be interested in learning a matrix A which generalizes the cosine similarity. In this case, the target neighbors will be defined according to the standard cosine similarity, and will not change during the learning of A. We will denote the target neighbors of x(i) (i) by: yl , 1 ≤ l ≤ k. We can then formalize a notion of separability, capturing the fact that any example should be closer to its k target neighbors than to any other set of k examples. Definition 1 Let S = ((x(1) , c(1) ), · · · , (x(n) , c(n) )) be a training sequence of n vectors in Rd and let k be an in(i) (i) teger. Let (y1 , · · · yk ) be the k target neighbors of x(i) (i) (i) in c . Lastly, let c¯ denote the complement of c(i) in the category set. We say that S is separable with some margin γ > 0 iff there exists a (p × p) matrix A, with kAk = 1, such that: ∀i, ∀(z1 , · · · , zk ) ∈ c¯(i) ,

k X

(i)

sA (x(i) , yl )−sA (x(i) , zl ) ≥ γ

l=1

kAk represents the Frobenius norm of the matrix A. Of course, in practice, our data is not likely to be separable in the above sense. We can nevertheless define a measure of how close a matrix A is to separate the data with margin γ, as follows.

Definition 2 Let S = ((x(1) , c(1) ), · · · , (x(n) , c(n) )) be a training sequence of n vectors in Rp , let A be a (p × p) matrix such that kAk = 1, and let 0 < γ. We define the γ-related measure of example i as: ǫi = max(0, γ − mi ) with mi =

k X

(i)

sA (x(i) , yl )−max(z1 ,··· ,zk )∈¯c(i)

l=1

k X

sA (x(i) , zl )

l=1

We then define the overall separation measure DA,γ of S wrt A and γ as: v u n uX ǫ2i DA,γ = t i=1

If the data is separable with margin γ according to definition 1, then there exists A such that: DA,γ = 0. If no example can be separated by A with margin γ, then DA,γ > 0, with the property that the lower DA,γ , the higher the capacity of A to separate S with margin γ. The notion of separation we are considering is relatively loose as we do not directly impose, for example, that all target neighbors be in the k nearest neighbors of an example. Rather, we are imposing that any point be, globally, closer to k points from the same class than to k points from any other class. This simplification, also used in [16], allows one to avoid setting complex constraints on each target neighbor, while still retaining the idea behind kNN classification.

3. A Similarity Learning Algorithm We provide in this section an algorithm to learn similarity functions of the form given by equation 1. This algorithm, which we will refer to as SiLA, is a variant of the voted perceptron algorithm proposed in [7], and used in [3]. It allows learning diagonal, symmetric or square matrices, depending on the final form of the similarity function one is interested in. The core of SiLA is an on-line update rule which iteratively improves the current estimate of A. The overall goal is to move target examples closer to their input point whenever the input point is closer to a set of differently labeled examples. We provide in section 4 a theoretical motivation for SiLA. In the remainder of the paper, we use kNN(A, x, s) to denote the k nearest neighbors of example x in class s when the similarity function is given by equation 1 with matrix A. For each example i, T (i) will denote the set of target neighbors of x(i) . The training algorithm we use is given below.

SiLA - Training Input: training set ((x(1) , c(1) ), · · · , (x(n) , c(n) )) of n vectors in Rp , number of epochs M ; Aml denotes the element of A at row m ad column l Output: list of weighted (p × p) matrices ((A1 , w1 ), · · · , (Aq , wq )) Initialization t = 1, A(1) = 0 (null matrix), w1 = 0 Repeat M times (epochs) 1. for i = 1, · · · , n 2. B(i) = kNN(A(t) , x(i) , c¯(i) ) P P 3. if y∈T (i) sA (x(i) , y) − z∈B(i) sA (x(i) , z) ≤ 0 4. ∀(m, l), 1 ≤ m, l ≤ p, P (t+1) (t) Aml = Aml + y∈T (i) fml (x(i) , y) P − z∈B(i) fml (x(i) , z) 5. wt+1 = 1 6. t = t + 1 7. else 8. wt = wt + 1 When an input example is not separated from differently labeled examples, the current A matrix is updated by the difference between the coordinates of the target neighbors and the closest differently labeled examples (line 4 of the algorithm), which corresponds to a standard perceptron update. When the current estimate of A correctly classifies the input example under focus, then its weight is increased by 1, so that the weights finally correspond to the number of examples correctly classified by A over the different epochs. The functions fml allows one to learn different types of matrices and hence different types of similarities: For a diagonal matrix, fml (x, y) = δ(m, l)xm yl /N(x, y) (with δ the Kronecker symbol), for a symmetric matrix, fml (x, y) = xm yl + xl ym /N(x, y), and for a square matrix, fml (x, y) = xm yl /N(x, y). The weighted matrices provided by the above algorithm can be used to predict the class(es) to which a new example should be assigned. We consider two basic rules for prediction. The first one corresponds to the standard kNN rule, whereas the second one directly corresponds to the notion of separation we have introduced before, and is based on the consideration of the same number of examples in the different classes. The new example is simply assigned to the closest class, the similarity with a class being defined as the sum of the similarities between the new example and its k nearest neighbors in the class.

4. Analysis We provide in this section performance bounds for SiLA algorithm. These bounds, and the theorems they rely on, di-

rectly parallel the ones provided by [7], and used in [3]. To see the parallel between our work and the above-mentioned xT Ay can be rewritten as: ones, first note that N (x,y) xT Ay = αT · φ(x, y) N(x, y) with: (α, φ(x, y)) ∈ Rp × Rp (α, φ(x, y)) ∈ R

p2

×R

p2

when A is diagonal otherwise

α can be seen as the vector equivalent to matrix A. The cosine similarity is obtained, with this representation, by setting α to the unit vector (αm = 1, 1 ≤ m ≤ p) and xm ym φm (x, y) = kxkkyk . By setting φ to the tensor product between vectors x and y, one obtains a representation equivalent to the one with an unconstrained, square matrix A. By setting φ to the “symmetric product”, i.e. φml (x, y) = xm yl +xl ym N(x,y) , one obtains a representation equivalent to the one with a symmetric matrix A. The theorems justifying the use of the voted perceptron algorithm can be extended to SiLA, and we present them below. The justification of SiLA proceeds in three steps: (a) theorem 1 justifies the core on-line update of SiLA in the separable case, (b) theorem 2 provides a similar justification for the non-separable case, and (c) theorem 3 provides justification for the batch version used for prediction. We omit the proofs of these theorems, which are mainly technical and parallel the ones given in [7]. Theorem 1 (separable case) For any training sequence S = ((x(1) , c(1) ), · · · , (x(n) , c(n) )) separable with margin γ, for one iteration (epoch) of the (on-line) update rule of SiLA Number of mistakes ≤ R2 /γ 2 where R is a constant such that ∀i, ∀(z1 , · · · , zk ) ∈ P Pk c¯i , k y∈T (i) φ(x(i) , y) − n=1 φ(x(i) , zn )k ≤ R Theorem 1 implies that, if the data is separable, then the update rule of SiLA makes a number of mistakes bounded above by a quantity which depends on the margin of the data (the larger the margin, the less mistakes made). The more general case where the data is not separable is covered by theorem 2, which makes use of the measure DA,γ (or equivalently Dα,γ with the new representation) introduced in definition 2. Theorem 2 (non separable case) For any training sequence S = ((x(1) , c(1) ), · · · , (x(n) , c(n) )) separable with margin γ, for one iteration (epoch) of the (on-line) update rule of SiLA Number of mistakes ≤ minα,γ

(R+Dα,γ )2 γ2

where R is a constant such that ∀i, ∀(z1 , · · · , zk ) ∈ Pk P c¯i , k y∈T (i) φ(x(i) , y)− n=1 φ(x(i) , zn )k ≤ R, and the min is taken over α and γ such that kαk = 1, γ > 0. This theorem implies that, provided the data is close to being separable, the update rule of SiLA converges in a finite number of steps, and has a number of mistakes bounded by a quantity which is smaller when the separation of the data is better (as measured by D). However, we are here not only interested in the convergence of the update rule (which corresponds to an on-line version of the algorithm), but also on the convergence of the batch version used for prediction. The following theorem provides both a proof of this convergence and shows that the batch version is able to generalize well, i.e. to behave adequately on test (unseen) data. This theorem is based on the on-line to batch conversion studied in [10]. Theorem 3 (generalization) Assume all examples are generated i.i.d. at random. Let E be the expected number of mistakes that the update rule of SiLA makes on a randomly generated sequence of m + 1 examples. Then given m random training examples, the expected probability that the deterministic leave-one-out conversion of this algorithm makes a mistake on a randomly generated test instance is at 2E most: m+1 . The deterministic leave-one-out conversion of the training version of SiLA corresponds to the weighted sum (A = P q l=1 wl Al ) used in the prediction rules given above. One can find in [5] a study of similar on-line to batch conversions, showing that it may be beneficial to weigh down (or even forget) the matrices (or vectors) learned in the first iterations of the on-line algorithm. That is, instead of basing the prediction on the complete sequence ((A1 , w1 ), · · · , (Aq , wq )), base it instead on, say, the last r elements. We use this strategy in our experiments.

5. Experimental Validation We used four different datasets to assess SiLA. The first three ones, part of the UCI database ([6]), are Iris, Wine and Balance. These are standard collections which have been used by different research communities (machine learning, pattern recognition, statistics). Even though kNN has been traditionally used, on these collections, with the Euclidean distance (or with a Mahalanobis distance learned from the data, as in [16, 4]), we show here that the cosine should be preferred to the Euclidean distance on all these collections. The Iris Plant data set consists of 150 examples, 4 attributes and 3 classes, each class being composed of 50 instances. In all our splits, we used, 120 examples for training and validation, and 30 for testing. The Wine Recognition data set consists of 178 examples, 13 attributes and 3 unbalanced classes. The attributes represent the con-

stituents found in each of the three types of wines. In all our splits, we used 143 examples for training and validation, and 35 for testing. The Balance Scale data set is composed of 625 examples, 4 attributes and 3 classes. In all our splits, we used 500 examples for training and validation, and 125 for testing. The fourth collection we use is the 20-newsgroups data set, which is composed of posted articles from 20 newsgroups and approximately contains 20,000 documents. We used the 18828 version in which the cross-postings have been removed and includes only the ”From” and ”Subject” headers. The Rainbow package [13] was used to tokenize the data set where each document was formed of the weighted word-counts of the 20,000 most common words. This was followed by performing singular value decomposition using SVDlibc1 which reduced the dimensions to 200. Many of the resulting documents contained zero vector which were subsequently removed. This reduced the number of documents to 4561. 2281 documents were used for training and validation, while 2280 documents were used in the test phase. Because of their relatively small size, we used, for the three UCI data sets, 5-fold nested cross-validation to learn the weighted matrix sequence ((A1 , w1 ), · · · , (Aq , wq )), and determine the number of iterations to be used as well as the elements in the sequence to be retained. As we mentioned before, it may be beneficial to retain only the last r elements in the sequence ((A1 , w1 ), · · · , (Aq , wq )). The training part of the data is used to learn the sequence, whereas the validation part is used to determine when to stop iterating (i.e. setting the number of epochs) and, finally, what is the best value of r. In both cases, we used a greedy strategy which consists in generating many hypotheses and selecting the best one, namely the one which minimizes the classification accuracy on the validation set. Figure 1 shows the performance obtained with different iterations. As one can see, the performance increases with the number of iterations, up to a certain point where a plateau is reached and the performance remains stable. Note however that, due to the scale used on the x-axis of figure 1, small variations in the performance can not be displayed, but are nevertheless observed in practice. We used, in our experiments, the two prediction rules described in section 3. The first one is the standard kNN rule in which the classification is based on the k nearest neighbors while the second one, termed SkNN, where ’S’ stands for symmetric, relies on the difference of similarity between k nearest neighbors from the same class and k from other classes2 . As the cosine similarity is certainly the most widely used similarity, we restricted ourselves, in our 1 Available

at http://tedlab.mit.edu/ dr/svdlibc/ can find in [14] a different version of a symmetric kNN rule in which one considers not only the k nearest neighbors of a given example x, but also the points for which x is a nearest neighbor. 2 One

0.92

Wine, k=1 Wine, k=3

0.9 0.88

Accuracy

0.86 0.84 0.82 0.8 0.78 0.76 0.74

0

5000

10000

15000 No of iterations

20000

25000

30000

Figure 1. Evolution of the performance wrt the number of iterations (Wine collection). experiments, to the cosine similarity and its generalization provided by equation 1. We thus obtain four kNN-based methods that we wish to compare: (a) Standard kNN rule with the cosine similarity, which we will refer to as kNNcos; (b) Standard kNN rule with the similarity, based on the cosine and corresponding to equation 1, learned with SiLA. We will refer to this method as kNN-A; (c) The symmetric prediction rule (see section 3) with the cosine similarity, which we will refer to as SkNN-cos; (d) The symmetric prediction rule with the similarity, also based on the cosine and corresponding to equation 1, learned with SiLA. We will refer to this method as SkNN-A. Unless otherwise stated, we used a binary version of SiLA, in which a sequence of weighted matrices is learned for each class (in a scenario one vs the others), and assessed the quality of a given method with its average accuracy (i.e. the accuracy averaged over the different classes). For the two methods based on SiLA, the number of iterations and the weighted matrices to be retained for prediction were determined on the validation sets. For the three UCI data sets, we used 5-fold nested cross-validation, as mentioned above. For each class, we averaged the results obtained on the different splits to get an accuracy per class, and averaged these accuracies over all the classes to get the global accuracy. Table 1 gives the results we obtained. For each method, we show only the best results obtained over different values of k. The main conclusions that can be drawn from table 1 are: 1. On each collection, the best results are obtained with the methods based on SiLA; for Iris these results are equivalent to the one obtained with the standard kNN rule with the cosine similarity. Furthermore, the values obtained for the standard deviation (given in parentheses in table 1) show that all the methods are stable on the collections considered. 2. Using SiLA with a base cosine similarity and the standard kNN rule (kNN-A) yields very good results on all four collections; the symmetric counterpart (SkNN-A) performs slightly better on the Balance and Wine collections (gaining respectively 0.4% and 0.2%), but significantly worse on News, the two methods being on

par on Iris. 3. The cosine similarity yields results which are always above the one of the Euclidean distance, for all the values of k we considered. This results justifies the use of the cosine similarity, instead of the Euclidean distance, on our collections. We finally compared our results with the ones obtained with several recent and concurrent approaches, namely the Maximally Collapsing Metric Learning (MCML) method presented in [8], the Large Margin Nearest Neighbor (LMNN) method of [16], the Information Theoretic Metric Learning (ITML) method of [4], and a multiclass version of SVMs. For the comparison, we rely on a multiclass version of our algorithm, using the accuracy as the performance measure. We also focus on the three UCI collections, which are the three collections common to the different studies on metric learning for kNN classification. As there is no preprocessing involved with the three UCI collections, we directly rely on the results reported in previous works for the different methods. Table 2 summarizes the results of the comparison (the best results are indicated in bold). As one can note from table 2, the method based on SiLA outperforms the other methods on two out of three collections. For Balance, this is due to the conjunction of two facts: (a) the cosine similarity is more appropriate than the Euclidean distance, and (b) SiLA further improves the base cosine results. For Iris, this is only due to the use of the cosine similarity, as the results for SiLA (kNN-A) are the same as the ones directly obtained with the cosine. It is interesting to note, however, that, on this collection, the results reported in [16] show that LMNN degrades the performance of its base classifier (i.e. a standard kNN with Euclidean distance). The other algorithms seem to avoid overtraining in this case. The results for Wine are interesting as this is the only collection on which both LMNN and ITML clearly outperform other approaches. Contrary to the other collections, Wine has a completely different scale for the different attributes, so that one attribute tends to dominate the comparison between two examples. SiLA (as MCML and Multiclass SVM) does not seem to be able to appropriately weigh down this attribute, whereas both LMNN and ITML can.

Balance Wine Iris News

kNN-euclidean 0.911 0.840 0.978

kNN-cos 0.959 (0.0133) 0.905 (0.0358) 0.987 (0.0178) 0.929

kNN-A 0.979 (0.0104) 0.916 (0.0351) 0.987 (0.0178) 0.947

SkNN-cos 0.969 (0.0122) 0.909 (0.0432) 0.982 (0.0166) 0.907

SkNN-A 0.983 (0.0115) 0.916 (0.0422) 0.987 (0.0178) 0.902

Table 1. Overall results obtained on the four collections. Balance Wine Iris

SiLA (kNN-A) 0.952 0.863 0.982

MCML 0.925 0.837 0.967

LMNN 0.916 0.974 0.953

ITML 0.920 0.974 0.961

Multiclass SVM 0.922 0.801 0.956

Table 2. Comparison of different similarity/metric learning algorithms on 3 UCI databases

6. Conclusion We have proposed in this paper an algorithm for learning a general class of similarity measures for kNN classification. This class encompasses, among others, the standard cosine measure, as well as the Dice and Jaccard coefficients. Even though the cosine measure is often associated with textual collections, our experiments show that its use should be preferred over the Euclidean distance on several, non textual collections (as Iris, Wine and Balance). Being able to efficiently learn appropriate similarity measures, as opposed to distances, for kNN classification is thus of high importance for various collections. The algorithm we have proposed to do so is an extension of the voted perceptron algorithm introduced in [7], with established performance and generalization bounds, for both the on-line and batch versions. This extension allows learning different types of similarity functions, and yields significant improvements on several collections.

References [1] J.-P. Bao, J.-Y. Shen, X.-D. Liu, and H.-Y. Liu. Quick asymmetric text similarity measures. Machine Learning and Cybernetics, 2003 International Conference on, 2003. [2] L. Baoli, L. Qin, and Y. Shiwen. An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing (TALIP), 2004. [3] M. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP ’02: Proceedings of the Conference on Empirical methods in natural language processing, 2002. [4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning, 2007. [5] O. Dekel and Y. Singer. Data-driven online to batch conversions. In NIPS, 2005.

[6] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository of machine learning databases, 1998. [7] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3), 1999. [8] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2005. [9] M. Grabowski and A. Szałas. A technique for learning similarities on complex structures with applications to extracting ontologies. In J. Szczepaniak, P.S. Kacprzyk and A. Niewiadomski, editors, Proceedings of the 3rd Atlantic Web Intelligence Conf. AWIC 2005, LNAI. Springer Verlag, 2000. [10] D. P. Helmbold and M. K. Warmuth. On weak learning. J. Comput. Syst. Sci., 50(3), 1995. [11] A. Hust. Learning Similarities for Collaborative Information Retrieval. In Proceedings of KI 2004 workshop ”Machine Learning and Interaction for Text-Based Information Retrieval”, TIR-04, pages 43–54, September 2004. [12] L. R. M. Diligenti, M. Maggini. Learning similarities for text documents using neural networks. In Proceedings of IAPR – TC3 International Workshop on Artificial Neural Networks in Pattern Recognition (ANNPR 2003), 2003. [13] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. [14] R. Nock, M. Sebban, and D. Bernard. A simple locally adaptive nearest neighbor rule with application to pollution forecasting. International Journal for Pattern Reconition and Artificial Intelligence, 17(8), 2003. [15] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In ICML ’04: Proceedings of the twenty-first international conference on Machine learning, New York, NY, USA, 2004. ACM. [16] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems 18. 2006.