The final publication is available at IOS Press through http://dx.doi.org/10.3233/IDA-163196.

Multi-Modal Deep Distance Metric Learning

Multi-Modal Deep Distance Metric Learning Seyed Mahdi Roostaiyan, Ehsan Imani, Mahdieh Soleymani Baghshah*

Seyed Mahdi Roostaiyan, Ehsan Imani*, Mahdieh Soleymani Baghshah

Computer Engineering Department, Sharif University of Technology, Tehran, Iran [email protected], [email protected], [email protected]

Computer Engineering Department, Sharif University of Technology, Tehran, Iran *

Corresponding author: M. Soleymani Baghshah, Department of Computer Engineering, Sharif University of Technology (SUT), Azadi St., Tehran, Iran. PO Box: 1458889694. Tel.: +98 2166166654; Fax: +98 21 6601 9246; E-mail: [email protected]

Abstract In many real-world applications, data contain heterogeneous input modalities (e.g., web pages include images, text, and etc.). Moreover, data such as images are usually described using different views (i.e. different sets of features). Learning a distance metric or similarity measure that originates from all input modalities or views is essential for many tasks such as content-based retrieval ones. In these cases, similar and dissimilar pairs of data can be used to find a better representation of data in which similarity and dissimilarity constraints are better satisfied. In this paper, we incorporate supervision in the form of pairwise similarity and/or dissimilarity constraints into multi-modal deep networks to combine different modalities into a shared latent space. Using properties of multi-modal data, we design multi-modal deep networks and propose a pre-training algorithm for these networks. In fact, the proposed network has the ability of learning intra- and inter-modal high-order statistics from raw features and we control its high flexibility via an efficient multi-stage pre-training phase corresponding to properties of multimodal data. Experimental results show that the proposed method outperforms recent methods on image retrieval tasks. Keywords: multi-modal data; metric learning; deep networks; similar-dissimilar pairs; pre-training.

1

1. Introduction A proper distance metric (or similarity measure) plays an important role in many learning and retrieval tasks. Until now, many methods have been proposed for metric learning [1-5]. In these methods, it is usually assumed that supervisory information in the form of relative distance constraints or similar/dissimilar pairs is available. Some of these methods, learn linear [1, 2] or nonlinear [4] transformations on the feature space to find a new representation space in which distance (or similarity) constraints are better satisfied. However, those methods that learn nonlinear transformations (or implicitly kernel matrices) are usually either limited to learning a restricted form of non-linear transformations or very time consuming when they are allowed to be flexible (i.e. they learn the whole kernel matrix). Moreover, the flexible methods that learn the whole kernel matrix are transductive and cannot be used to find similarities for new data. Recently, some deep metric learning methods [6-9] have been proposed that can learn a non-linear transformation to achieve a new representation space in which the distance constraints are better satisfied. However, these deep models of metric learning have been designed for input data containing one modality. Therefore, they have not used properties of multi-modal data for designing the architecture and training of the deep networks. On the other hand, real-world data usually contains different modalities such as text, image, and video. Video and corresponding audio [10] and annotated images [11, 12] are examples of multi-modal data. In recent years, several multi-modal methods have been proposed to incorporate heterogeneous modalities in the classification and retrieval tasks [11-14]. Besides, there are similar challenges in multi-view descriptions of data. Any view is indeed a description of data with a particular feature extraction method [15, 16]. Because of the somewhat similar nature of multi-modal and multi-view data, they pose similar challenges and even in some studies identical models have been introduced for both of them [17, 18]. Deep networks are flexible and effective models that have been used in a wide range of applications [19]. They can model distributions of different modalities and make a connection between them by a common or shared space that is obtained using a layer above modality specific networks. Multi-modal deep models have recently attracted much attention in many applications such as cross-modal [11, 20] and multi-modal retrieval tasks [14, 21]. A popular deep network architecture for multi-modal data has been shown in Fig. 1(a). This architecture which includes modality specific networks and a single layer on the top of these networks (to find a shared representation) has been used in many multi-modal deep learning models [10, 14]. However, this architecture has been already used for metric learning. In this paper, we propose a deep metric learning method for multi-modal data (to the best of our knowledge, our method is the first deep metric learning method for multi-modal data). The proposed Multi-modal Deep Distance Metric Learning (MMDDML) framework (see Fig. 1(b)) can include some layers for non-linear metric learning on the top of the single layer presenting the shared representation of modalities. We propose an effective approach for unsupervised pre-training of this model using the properties of multi-modal data. Since we intend to use the multi-modal deep network for metric learning, an optimization problem is presented that considers supervisory information in the form of similar/dissimilar pairs. Stochastic gradient descent is employed for training MMD-DML with batches of similar/dissimilar pairs. In this work, our goal is learning a distance metric that incorporates multiple modalities. Retrieval is one of the tasks in which the distance metric has a critical role and also some data views or modalities usually exist in retrieval applications. For example, in Content-Based Image Retrieval (CBIR), different views of an image, which are obtained through various feature extraction techniques, can act as different modalities. Experimental results show the effectiveness of our pre-training method in such retrieval tasks. The rest of this paper is organized as follows: Some related works are reviewed in Section 2. We first present some definitions and preliminaries of our proposed model in Section 3 and then the proposed method is described with details in Section 4. Experimental settings and results of our method for CBIR are presented in Section 5. Finally, we conclude our work in Section 6.

2. Related works The existing methods of representation learning for multi-modal data can be categorized as below:

Multiple Kernel Learning (MKL) Shallow Probabilistic Models Deep Models

These approaches are widely used in unsupervised [10, 11, 20, 22] and supervised [4, 12, 21, 23] multi-modal retrieval.

2

(a)

(b) Figure 1: (a) Unfolded MMD [10] (b) Unfolded MMD-DML.

2.1. Multiple Kernel Learning MKL methods can be used to learn a kernel that is a combination of a set of fixed basis kernels. Although these methods were first applied to single modal data, they can also be utilized for multi-modal data where different kernels are considered for different modalities [24, 25]. Weighted kernel combination is one of the earliest MKL methods [24-27] in which the kernel space is equivalent to weighted concatenation of kernel Hilbert spaces. Lanckriet et al. [24] employed weighted kernel combination in the Support Vector Machine (SVM) and learned optimal kernel weights and SVM parameters simultaneously. Even though most of MKL methods have been designed for classification purposes and used labels as supervisory information. However, they can be adapted to use supervisory information in the form of pairwise distance constraints [4] or triplets distance constraints [25, 27]. Recently, Chen et al. [25, 27] proposed methods for learning mobile applications similarity using weighted kernel combination approach. Lin et al. [28] introduce Weighted Multiple Kernel Embedding (WMKE) method for learning a linear transformation on spaces resulted from weighted kernels combination. Although this method can model correlation between modalities, simple scaling and selecting kernels are the only degrees of freedom considered for integrating modalities. In Multiple Kernel Partial Order Embedding (MKPOE) method [4], distinct linear transformations are learned on kernel spaces simultaneously. Unlike WMKE, this method cannot directly model correlation between diverse modalities. However, it can transform modality spaces efficiently. The learning stages of MKL approaches usually include optimization of a very big positive semi-definite (PSD) matrix. Therefore, these methods are not scalable to massive data in real-world applications such as multi-media retrieval tasks. Xia et al. [18] extended MKPOE method to an online mode by converting constrained optimization problem to its unconstrained equivalent and then projecting parameters to the constraints space. Xia et al. [23] proposed a similar method called Online Multiple Kernel Similarity Learning (OMKS) for CBIR application. They used different features extracted from image as modalities. To increase performance, several different kernels are considered for each modality and these kernels are combined to find the similarity measure. They proposed an efficient two-stage optimization technique for finding kernel space transformations and optimal combination weights. Wu et al. [29] proposed a similar method called OM-DML which simultaneously learns a distinct linear transformation on each modality and also optimal weights for combining modalities. By directly learning a linear 3

transformation instead of learning a Mahalanobis metric, OM-DML eliminates the time-consuming Positive Semi-Definite (PSD) projection step required in the OMKS algorithm. This method is also able to seek low-rank solutions by setting the number of dimensions for new spaces to be less than the number of input dimensions. In this paper, we propose the MMD-DML method which explicitly learns a non-linear transform having the advantage of kernel-based approaches while it does not need the PSD constraints (similar to the methods like OM-DML). Unlike the MKL methods that fix the base kernels, our method can learn a flexible non-linear transform on each modality. Furthermore, unlike MKPOE, OMKS, and MMDDML methods, our method has the ability to model intermodal correlations using the joint multi-layer network on the top of the modality specific networks.

2.2. Probabilistic shallow and deep network models for multi-modal data Shallow and deep networks are capable of providing a powerful framework using nonlinear activation functions or diverse conditional probability distributions and have been used extensively in various areas including multi-modal tasks [10, 12, 1416, 22]. Harmonium [30] is a shallow probabilistic model containing a layer of latent variables as a hidden representation of data. Dual-Wing Harmonium (DWH) [22] is an extension from exponential Harmonium [31] which is applicable to data with two modalities in the visible layer. In this model, image and annotations (along with image) are embedded into a shared latent space. Assumptions about conditional probability distribution can be leveraged as prior information about data. Xie et al. [12] extended DWH in their Multi-Modal Distance Metric Learning (MM-DML) method for distance metric learning through minimizing the cost function that has been defined according to similar and dissimilar pairs. Chen et al. proposed supervised extensions of DWH for large margin predictive subspace learning [15, 16]. Supervisory information in the form of labeled data is utilized in these methods. Several models of multi-modal deep networks have been proposed in recent years. Most of them are unsupervised methods that model data distribution [10, 14]. Some of the existing methods try to find a latent space that can be constructed by each modality [11, 13]. These methods are useful in cross-modal tasks. For example, in multi-modal retrieval based on Stacked Auto Encoders (SAEs) [11], an SAE is trained for each of the two modalities of image-tag bimodal data. After that, these methods try to minimize Euclidean distance between the latent representation of the images and that of their associated tags. Feng et al. [13] proposed a similar method based on Restricted Boltzmann Machine (RBM) to map image and text into a low-dimensional common space for cross-modal retrieval task. They used correlation-based loss function to maintain correspondence between distinct deep RBMs of modalities. A deep model using Canonical Correlation Analysis (CCA) [32] to find a shared latent space has also been introduced in [33]. In this model, each modality is transformed through a separate deep network to a space where the inter-modal correlation of the transformed modalities is maximized. Ngiam et al. [10] proposed an effective Multi-Modal Deep Network (MMD) model that learns a shared representation from different modalities in an unsupervised manner. The MMD model is pre-trained in a greedy layer-wise manner and then fine-tuned for multi-modal or cross-modal tasks by backpropagation. Srivastava et al. [14] proposed Multi-Modal Deep Boltzmann Machines (MMDBM) as an unsupervised method that assigns a deep network to each modality and uses a layer on the top of these networks to find a shared latent space. In this method, for each layer, an RBM is used and the model is trained in a layer-wise manner using contrastive divergence. This method is similar to the MMD method [10] but uses DBM instead of SAE.

3. Preliminaries In this section, we present some definitions and also some basic ideas about metric learning that have been presented in the previous works.

3.1. Definitions In this part, some definitions are provided for the terms used in the following sections and some basic ideas about metric learning are presented. DEFINITION 1: Multi-modal space A multi-modal vector space is 𝔻𝑀 = ℝ𝑑1 × … × ℝ𝑑𝑀 for which any 𝒙 = (𝒙1 , … , 𝒙𝑀 ) ∈ 𝔻𝑀 has 𝑀 modalities such that 𝒙1 ∈ ℝ𝑑1 , …., 𝒙𝑀 ∈ ℝ𝑑𝑀 . DEFINITION 2: Multi-modal retrieval Given a query object 𝑞 ∈ 𝔻𝑀 and a target domain 𝐷𝑡 ⊂ 𝔻𝑀 with 𝑇 objects, we intend to ﬁnd an order 𝑂 = (𝑜1 , … 𝑜𝑇 ) of 𝐷𝑡 such that ∀ 𝑖 < 𝑗, 𝑑𝑖𝑠𝑡(𝑞, 𝑜𝑖 ) < 𝑑𝑖𝑠𝑡(𝑞, 𝑜𝑗 ). 4

DEFINITION 3: Similar/dissimilar pairs Similar and dissimilar pair sets are defined as: 𝒮 = {(𝒙 , 𝒙′)} ⊂ 𝔻𝑀 × 𝔻𝑀 , 𝒟 = {(𝒙, 𝒙′)} ⊂ 𝔻𝑀 × 𝔻𝑀 .

(1)

For each (𝒙 , 𝒙′) ∈ 𝒮, 𝒙 and 𝒙′ are regarded as similar pairs in the training stage and pairs in the set 𝒟 are regarded as dissimilar ones.

3.2. Metric learning In this section, we first present some important and popular optimization problems for metric learning. Then, the most popular multi-modal metric learning method is introduced. Xing et al. proposed a distance metric learning method that minimizes the distance between similar pairs while separating dissimilar pairs by a margin [1]. Hence, the optimization problem does not consider any loss for dissimilar pairs that are far enough from each other: arg min ∑ ‖𝒙 − 𝒚‖2𝑨 s. t. ∀(𝒙, 𝒚) ∈ 𝒟, ‖𝒙 − 𝒚‖2𝑨 ≥ 1, 𝐀 ≽ 0, 𝐀

(2)

(𝒙,𝒚)∈𝒮

where ‖𝒙 − 𝒚‖2𝑨 = (𝒙 − 𝒚)𝑇 𝑨(𝒙 − 𝒚) = 𝑑𝑨 (𝒙, 𝒚) denotes the Mahalanobis distance between data points 𝒙 and 𝒚. Davis et al. [3] proposed an optimization problem that imposed a margin on similar pairs as well as dissimilar ones. Indeed, the distance between similar pairs which are adequately close to each other are not entered in the loss function: arg min 𝑟(𝑨) = 𝑡𝑟(𝑨) − log det(𝑨) s. t. 𝑑𝑨 (𝒙, 𝒚) ≤ 𝓊, (𝒙, 𝒚) ∈ 𝒮, 𝑑𝑨 (𝒙, 𝒚) ≥ ℓ, (𝒙, 𝒚) ∈ 𝒟. (3) 𝑨

Here, 𝑟(𝑨) is a special case of LogDet divergence which has some properties, such as the scale and translation invariance, that are suitable for metric learning [34]. Xie et al. proposed the MM-DML method [11] with the following optimization problem based on dual-wing harmonium: 1 1 2 arg min ℒ(𝒳; Θ) + 𝜆 ∑ ||𝑡(𝒙) − 𝑡(𝒚)||2 s. t. ∀(𝒙, 𝒚) ∈ 𝒟, ||𝑡(𝒙) − 𝑡(𝒚)|| ≥ 1, (4) |𝒮| |𝒳| Θ (𝒙,𝒚)∈𝒮

where Θ is model parameters, ℒ(𝒳; Θ) shows data likelihood in DWH, 𝜆 is a regularizer parameters, and 𝑡(𝒙) is the latent representation of 𝑥. The MM-DML optimization problem in Eq. (4) is an extension of the one introduced in Eq. (2). By softening the constraints, the optimization problem in Eq. (4) can be reformulated as: 1 1 1 arg min ℒ(𝒳; Θ) + 𝜆1 ∑ ||𝑡(𝒙) − 𝑡(𝒚)||2 + 𝜆2 ∑ ma x( 0,1 − ||𝑡(𝒙) − 𝑡(𝒚)||2 ), (5) |𝒮| |𝒳| |𝒟| Θ (𝒙,𝒚)∈𝒮

(𝒙,𝒚)∈𝒟

where 𝜆1 and 𝜆2 are regularization parameters. MM-DML method utilizes the stochastic gradient descent to directly optimize the feature transformation instead of learning the Mahalanobis metric (𝐴) used by Xing et al. [1]. Although the optimization problem of the MM-DML method is not convex and, without an intelligent parameter initialization strategy, MM-DML becomes prone to falling into an improper local-minima, it can provide some benefits. For example, a low-rank solution that is desirable in the context of Mahalanobis metric learning [3] can be achieved by explicitly learning a feature transformation that provides dimensionality reduction. In general, learning a non-linear transformation has some advantages to learning a Mahalanobis metric or learning a kernel matrix. Deep networks provide a powerful framework to learn flexible non-linear transformations. However, all of the existing deep metric learning methods [6-10] are proper for input data containing only one modality.

4. Proposed method In this section, we propose the MMD-DML method that uses the deep learning approach to find a flexible non-linear transformation leading to an effective distance metric for multi-modal data. We use a multi-stage pre-training phase utilizing unlabeled multi-modal data. Then, we impose margin constraints for both similar and dissimilar pairs via an optimization problem inspired by the ITML method [3]. The batch-mode gradient descent technique is utilized to find the solution of the proposed optimization problem that considers similar/dissimilar pairs.

4.1. Optimization problem Fig. 1(b) shows the unfolded structure of the proposed architecture in our MMD-DML method. This model has a separate SAE with an arbitrary number of layers for each modality. Joint SAE (JSAE) takes the concatenation of the latent representa5

tions of the modalities as its input layer and provides a shared representation as the output. The depth of the SAE considered to the 𝑚-th modality is shown as ℎ𝑚 and the depth of JSAE is denoted as ℎ𝑗𝑜𝑖𝑛𝑡 . Let 𝒙0 = 0 (𝒙1 , … , 𝒙0𝑀 ) ∈ 𝔻𝑀 , the representations resulted from the different layers of the SAE considered for the 𝑚-th modality are deℎ ℎ +1 2ℎ noted as 𝒙1𝑚 , … , 𝒙𝑚𝑚 (Fig. 1(b)). Moreover, 𝒙𝑚𝑚 , … , 𝒙𝑚 𝑚 show the decoded representations obtained in the unfolded MMDℎ ℎ DML (Fig. 1(b)). Concatenation of the outputs of modality specific SAEs is shown as 𝒋0 = (𝒙1 1 , … , 𝒙𝑀𝑀 ) that provides the inℎ𝑗𝑜𝑖𝑛𝑡 1 put of JSAE. Representations resulted from encoder layers of JSAE are shown as 𝒋 , … , 𝒋 . Moreover, 𝒋ℎ𝑗𝑜𝑖𝑛𝑡 +1 , … , 𝒋2ℎ𝑗𝑜𝑖𝑛𝑡 denote the outputs of the decoder layers (Fig. 1(b)). The mapping function corresponding to the whole MMD-DML model is ̂𝑙𝑚 (𝑙 = 0, … , ℎ𝑀 − 1) be the recondenoted as 𝑓𝑀 (𝒙; Θ) where Θ shows all model parameters and 𝑓𝑀 (𝒙0 ; Θ) = 𝒋ℎ𝑗𝑜𝑖𝑛𝑡 . Let 𝒙 𝑙 struction of 𝒙𝑚 resulted from applying an encoder and the corresponding decoder (of an auto-encoder network with one hidden layer) on 𝒙𝑙𝑚 shown in Fig. 2(a)-(b). Similarly, let 𝒋̂0 , … , 𝒋̂ℎ𝑗𝑜𝑖𝑛𝑡 −1 be the reconstructions obtained for 𝒋0 , … , 𝒋ℎ𝑗𝑜𝑖𝑛𝑡−1 . We also ̂𝑚 (𝑚 = 1, … , 𝑀) shown in Fig. 2(c). denote the reconstruction of the 𝑚-th modality using the corresponding unfolded SAE as 𝒙 The notation symbols used in our method have been presented in Table 1. Table 1: The notation symbols used in our method. symbol 𝒮 and 𝒟 𝒳 ℒ𝑟𝑚 (. , . ) 𝒙0 = (𝒙10 , … , 𝒙0𝑀 ) ℎ𝑚 𝒙𝑙𝑚 ̂𝑙𝑚 𝒙 ℎ𝑗𝑜𝑖𝑛𝑡

Description Sets of pairwise similarity and dissimilarity constraints The set of available training data (containing only feature vectors and not labels) The loss function used for the reconstruction of the 𝑚-th modality as in Eq. (9) (square loss in our experiments) The input containing 𝑚 modalities The depth of the SAE considered for the 𝑚-th modality The representation obtained in the 𝑙-th encoder layer of the SAE considered to the 𝑚-th modality The reconstruction of 𝒙𝑙𝑚 resulted from applying an auto-encoder network with one hidden layer on 𝒙𝑙𝑚 The depth of JSAE’s encoder used as the shared SAE on the top of the modality specific networks ℎ ℎ 𝒋0 = (𝒙1 1 , … , 𝒙𝑀𝑀 ) The input of the JSAE )the concatenation of the outputs of modality specific SAEs( The depth of the JSAE’s encoder ℎ𝑗𝑜𝑖𝑛𝑡 𝑙 The representation obtained by the 𝑙-th encoder layer of JSAE 𝒋 The reconstruction of 𝒋𝑙 resulted from applying an auto-encoder network with one hidden layer on 𝒋𝑙 𝒋̂𝑙 The mapping function corresponding to the whole MMD-DML model 𝑓𝑀 (. ; Θ) ̂𝑚 The reconstruction of the 𝑚-th modality using the corresponding unfolded SAE shown in Fig. 2(c) 𝒙

Finally, we define the optimization problem of MMD-DML as: arg min ℒ𝑟 (𝒳; Θ) 𝑠. 𝑡. ∀(𝒙, 𝒙′ ) ∈ 𝒮, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) ≤ 𝓊, ∀(𝒙, 𝒙′ ) ∈ 𝒟, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) ≥ ℓ, Θ

𝑙

𝑙

𝑙

(6)

𝑙

where 𝑑(. , . ): ℝ × ℝ → ℝ is a distance metric defined in ℝ and 𝑙 is the number of units in the last layer of the decoder of the joint network. The loss term ℒ𝑟 (𝒳; Θ) shows the average reconstruction error over 𝒳 and is defined as: 𝑀

1 2ℎ ℒ𝑟 (𝒳; Θ) = ∑ ∑ ℒ𝑟𝑚 (𝒙0𝑚 , 𝒙𝑚 𝑚 ), |𝒳| 0 2ℎ

(7)

𝒙 ∈𝒳 𝑚=1

where ℒ𝑟𝑚 (𝒙0𝑚 , 𝒙𝑚 𝑚 ) denotes the reconstruction loss used for the 𝑚-th modality. As suggested by Wang et al [10], these functions can be selected depending on modality distributions. Since various features extracted from images usually follow Gaussian distributions [11], we use convenient squared Euclidean distance loss in all of our CBIR experiments. Using hinge losses instead of hard margin constraints in Eq. (6) we obtain: 1 arg min ℒ𝑟 (𝒳; Θ) + 𝜆1 ∑ max(0, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) − 𝓊) |𝒮| ′ Θ (𝒙,𝒙 )∈𝒮 (8) 1 + 𝜆2 ∑ max(0, ℓ − 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ))), |𝒟| ′ (𝒙,𝒙 )∈𝒟

where 𝜆1 and 𝜆2 are regularization parameters. In the next subsections, we first introduce a pre-training algorithm to initialize the parameters of our MMD-DML model. Then, a gradient descent optimization technique is utilized to solve the optimization problem in Eq. (8) as the fine-tuning step the proposed deep model. Since the hinge loss terms in this equation are not differentiable, we simply use sub-gradient technique by considering the gradient of hinge loss equal to zero in non-differentiable points.

6

4.2. Unsupervised Pre-Training of MMD-DML Ngiam et al. [10] proposed a pre-training method for MMD in which the network is first initialized in a greedy layer-wise manner by sparse RBMs. After that, the unfolded MMD network is pre-trained by the backpropagation algorithm. In our method, unsupervised pre-training of the network consists of three major steps. Different stages for pre-training are shown in Algorithm 1. The first step includes pre-training of the SAE of each modality (Fig. 2(a)). To achieve a proper starting point, every layer is first initialized by Singular Value Decomposition1 (SVD) and then pre-trained by the backpropagation2 algorithm to provide a suitable dimensionality reduction for the next layer (Fig. 2(b)). The SAE whose layers are found in this greedy manner (one after the other) is then trained as a whole multi-layer network by the backpropagation algorithm (Fig. 2(c)). Indeed, we train the network allocated to the 𝑚-th modality to reach the lower reconstruction error for the representation obtained by this network. As mentioned in Section 4.1, reconstruction loss functions of modalities are chosen as: 2 1 (9) ̂𝑚 ) = ||𝒙𝑚 − 𝒙 ̂𝑚 || . ℒ𝑟𝑚 (𝒙𝑚 , 𝒙 2 2 ̂𝑚 is the reconstruction of 𝒙𝑚 obtained by the SAE of the m-th modality as shown in Fig. 2(c). However, the loss funcwhere 𝒙 tions that are utilized to show input reconstruction error are not needed to be the square loss necessarily. They can be chosen depending on modalities distributions as recommended by Wang et al. [10]. In the second step, the JSAE is pre-trained in a similar manner using inputs provided by the modality specific SAEs (Fig. 3). Eventually, in Step 3 of Algorithm 1, the whole unfolded network (Fig 1(b)) is pre-trained by the backpropagation algorithm to find the shared representation that minimizes sum of the squared reconstruction error over all the modalities (i.e. the first term in Eq. (8)).

ℎ

ℎ𝑚 𝑚 𝑋𝒙 𝑚𝑚

(𝒙𝑙𝑚 )′

(𝒙0𝑚 )′

𝒙0𝑚

𝒙𝑙𝑚

(a)

(b)

ℎ

𝒙𝑚𝑚

ℎ

𝒙2 2

𝒙10

𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑃𝑎𝑠𝑠

ℎ

𝒙1 1

2ℎ𝑚 𝒙𝑚𝑋𝑚

2ℎ 𝒙2𝑋22

2ℎ 1 𝒙𝑋 11

𝒙02 (c)

𝒙0𝑚

Figure 2: Pre-training of the modalities SAEs (Step 1 of Algorithm 1). (a) SAE of the 𝑚-th modality (b) Layer-wise SAE pretraining of the 𝑚-th modality (for the 𝑙-th layer) by firstly using SVD initialization and then update weights of this network (that has one hidden layer) using error backpropagation on the reconstruction error (c) Backpropagation to minimize the reconstruction error for the unfolded SAE of each modality.

1

If we have large-scale data, we can simply ignore SVD steps or calculate SVD over a subset of examples. Reconstruction loss function used in the first layer of each modality specific SAE can be selected depending on modality distribution. For other layers of every SAE, however, Euclidean reconstruction loss functions loss is common. All reconstruction loss minimization steps in Algorithm 1 are done by batchmode gradient descent. 2

7

𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑃𝑎𝑠𝑠

(a)

(b) Figure 3: Pre-training of Joint SAE (Step 2 of Algorithm 1). (a) Greedy layer-wise pre-training of Joint SAE by firstly using SVD initialization and then using backpropagation to minimize reconstruction error of each layer (b) Backpropagation to minimize the reconstruction error in the whole unfolded joint SAE.

4.3. Supervised fine-tuning of MMD-DML In this section, we use the gradient descent method to fine-tune the pre-trained MMD-DML network by considering similar/dissimilar distance losses in the second and the third terms of Eq. (8). By utilizing distance losses in Eq. (8), we optimize MMD-DML parameters (weights and biases of MMD-DML encoders) as: Θ∗ = arg min ℒ𝑚𝑒𝑡𝑟𝑖𝑐 (Θ; 𝒮; 𝒟) Θ

= 𝜆1

1 ∑ ma x( 0, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) − 𝓊) |𝒮| ′ (𝒙,𝒙 )∈𝒮

(10)

1 + 𝜆2 ∑ ma x( 0, ℓ − 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ))). |𝒟| ′ (𝒙,𝒙 )∈𝒟

As mentioned in Section 4.1, hinge losses in the above objective function are not differentiable in zero and we use subgradient strategy to train our model. This strategy simply uses the gradient in differentiable sub-regions. In other words, the sub-gradient of the hinge loss is defined as: ∇Θ ma x( 0, z) = 𝕀(𝑧(Θ) > 0)∇Θ z. Finally, the gradient of the cost function Eq. (10) is calculated as: 1 ∇Θ ℒ𝑚𝑒𝑡𝑟𝑖𝑐 (Θ; 𝒮; 𝒟) = 𝜆1 ∑ 𝕀(𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) > 𝓊)∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) |𝒮| ′ (𝒙,𝒙 )∈𝒮

1 −𝜆2 ∑ 𝕀(𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) < ℓ)∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) . |𝒟| ′ (𝒙,𝒙 )∈𝒟

8

(11)

(12)

We utilized the batch-mode stochastic gradient descent technique. Therefore, in each step, we calculate Eq. (12) for a minibatch of similar/dissimilar pairs that is a subset of 𝒮 ∪ 𝒟. Note that Eq. (12) is a summation of gradients attributed to violating similar/dissimilar pairs in the 𝐵 batch. We can calculate gradient originating from every (𝒙, 𝒙′ ) ∈ 𝐵 as: ∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) = (13) ′ ∇𝑓𝑀 (𝒙;𝛩) 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙 ; Θ)) × ∇Θ 𝑓𝑀 (𝒙; Θ) + ∇𝑓𝑀(𝒙′ ;𝛩) 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) × ∇Θ 𝑓𝑀 (𝒙′ ; Θ)

Algorithm 1: Pre-Training of MMD-DML Inputs: A set of multi-modal vectors 𝑿0 = (𝑿10 … 𝑿0𝑀 ) (each row is one of the examples in 𝒳 and 𝑿0𝑚 shows the matrix containing the 𝑚-th modality of examples). Outputs: Parameters of MMD-DML initialized using pre-training Step 1: Pre-training an SAE for each modality for 𝑚 = 1 to 𝑀 do - Greedy layer-wise pre-training of the 𝑚-th SAE corresponding to the 𝒎-th modality: for 𝑙 = 1 to ℎ𝑚 do // in each iteration initialize an auto-encoder (called AE) with one hidden layer, update its weights and finally add its encoder layer as the 𝑙-th layer of the 𝑚-th SAE. 𝑼𝚺𝑽∗ ← SVD (𝑿𝑙−1 𝑚 ). Initialize weights of the AE’s encoder layer and decoder layer using 𝑼 and 𝑼𝑇 matrices respectively and biases to 0. ̂ 𝑙−1 Apply AE (the encoder and decoder layer of this new AE) on 𝑿𝑙−1 𝑚 to find 𝑿𝑚 . ̂ 𝑙−1 Update weights of this auto-encoder by minimizing the reconstruction error between 𝑿𝑙−1 𝑚 and 𝑿𝑚 via the backpropagation algorithm. Add AE’s encoder layer as the 𝑙-th layer of the 𝑚-th SAE. 𝑙 Using the encoder layer of AE on 𝑿𝑙−1 𝑚 to find 𝑿𝑚 . - Backpropagation in the unfolded modality SAE: ̂ 𝑚. Using the encoder layers of the 𝑚-th SAE and then their corresponding decoder layers as in Fig 2(c) to find 𝑿 0 ̂ Update whole weights of the 𝑚-th SAE by minimizing the reconstruction loss between 𝑿𝑚 and 𝑿𝑚 using backpropagation. Step 2: Pre-training JSAE - Greedy layer-wise pre-training of the JSAE: for 𝑙 = 1 to ℎ𝑗𝑜𝑖𝑛𝑡 do 𝑼𝚺𝑽∗ ← SVD (𝑱𝑙−1 ). Initialize weights of the AE’s encoder layer and decoder layer using 𝑼 and 𝑼𝑇 matrices respectively and biases to 0. Apply AE (the encoder and decoder layer of this new AE) on 𝑱𝑙−1 to find 𝑱𝑙−1 . Update weights of this auto-encoder by minimizing the reconstruction error between 𝑱𝑙−1 and 𝑱𝑙−1 via the backpropagation algorithm. Add AE’s encoder layer as the 𝑙-th layer of JSAE. Using the encoder layer of AE on 𝑱𝑙−1 to find 𝑱𝑙 . - Backpropagation in unfolded JSAE: Using the encoder layers of JSAE and then their corresponding decoder layers as in Fig 1(b) to find 𝑱2ℎ𝑗𝑜𝑖𝑛𝑡 . Update whole weights of the JSAE by minimizing the reconstruction loss between 𝑱2ℎ𝑗𝑜𝑖𝑛𝑡 and 𝑱0 using the backpropagation algorithm. Step 3: Backpropagation in unfolded MMD-DML - Update weights of the whole network by minimizing ℒ𝑟 (𝑿0 ; Θ) in Eq. (7) via the backpropagation algorithm.

9

The first term on the right-hand side of Eq. (13) is the gradient of 𝑑(𝑓𝑀 (𝒙; Θ), 𝑜) w.r.t. model parameters where 𝑜 = 𝑓𝑀 (𝒙′ ; Θ) is the fixed desired MMD-DML network output for 𝒙 and 𝑑(𝑓𝑀 (𝒙; Θ), 𝑜) measures the loss of network in generating 𝑜. Similarly, the second term of Eq. (13) is the gradient of 𝑑(𝑓𝑀 (𝒙′ ; Θ), 𝑜′) w.r.t. model parameters where 𝑜 ′ = 𝑓𝑀 (𝒙; Θ) is a fixed desired network output for 𝒙′ . In other words, the gradients in Eq. (13) are both similar to the gradient in neural networks for regression problems. Thus, the partial derivative in Eq. (13) w.r.t. parameters of each layer can be calculated through the backpropagation algorithm [11].

5. Experiments In this paper, we use various feature types such as SIFT and GIST as different modalities of image data and evaluate MMDDML in CBIR as a multi-modal retrieval task.

5.1. Datasets To assess the efficacy of our method in CIBR tasks, we evaluate it on three widely-used datasets3: Caltech-256, Corel5k, and Indoor. These are the most common datasets for CIBR tasks and following [23] as the most related work to ours, we select these datasets. Caltech-256 dataset has 256 image categories plus an extra class named “Clutter” [35]. Similar to the work by Chechik et al. [36], we choose 10, 20, and 50 classes from this dataset. In our experiments, these subsets are referred to as “Cal10”, “Cal20”, and “Cal50” respectively. Corel5k has 50 diverse image categories collected from COREL image CDs [37]. Contrary to Caltech-256, which has varied number of images in each category, each of the classes in Corel5k contains exactly 100 images. Indoor is a dataset previously used for indoor scene recognition [38]. This dataset has 67 categories, each of which contains at least 100 images. Following the work of Xia et al. [23], in order to avoid the dominating effect of one class with high number of images, we find the number of images in the smallest class and randomly choose samples of this size from each class so that the number of images is the same for all classes. Then, we randomly split the data into four partitions – that is – training set, validation set, query set, and test set. Training set contains 50% of images and is used for pre-training the network and extracting pairwise constraints. Validation set contains 10% of images and is used for tuning the hyper-parameters. Query set and test set contain 10% and 30% of the images respectively that are used for evaluation of the method. For comparison of different methods, query objects are chosen from the query set and the test set is regarded as the target domain. To extract pairwise constraints, we create all possible similar pairs in the training set and for each similar pair (𝒙1 , 𝒙2 ) we randomly choose a point 𝒙3 from another class and create a dissimilar pair (𝒙1 , 𝒙3 ). After that, we keep half of the constraints to train the methods. The effect of using varied numbers of constraints on the performance of the methods is shown in Section 5.5. For OMKS, which uses triplets rather than pairwise constraints, we merge each similar pair (𝒙1 , 𝒙2 ) and dissimilar pair (𝒙1 , 𝒙3 ) to create a triplet (𝒙1 , 𝒙2 , 𝒙3 ).

5.2. Extracted features Similar to [23], we use several types of features from each image. These features are Local Binary Pattern, GIST features, Gabor wavelets, color histogram and color moments, edge direction histogram, SIFT features, and SURF features. For SIFT and SURF features, we use 200 and 1000 as codebook size, thus generating four types of features called SIFT200, SIFT1000, SURF200, and SURF1000. Using PCA, we extract 100 features from each feature set whose dimension is more than 100.

5.3. Choosing distance metric and margins We use the measure defined below as the distance metric in Eq. (8): < 𝒉, 𝒉′ > 𝑑(𝒉, 𝒉′) = (1 − ). (14) ||𝒉||||𝒉′ || This is the distance metric related to cosine similarity and ranges over (0,2) in every space. Using this distance metric we can 𝜋

𝜋

restrict the values of 𝓊 and ℓ. Indeed, we specified these margins as 𝓊 = (1 − cos ( )) and ℓ = (1 − cos ( )) in all the be12 6 low experiments. As mentioned by Xing et al [1], the margin value in Eq. (2) corresponds to only scaling and different margin values may yield equivalent solutions. Suppose a near optimal (w.r.t. visual similarity) pre-trained MMD-DML model (see Section 4.2 for the pre-training stage). Distance between pairs in this model, w.r.t. Euclidian metric, ranges over (0, 𝒱). The upper bound of 𝒱 3

The datasets used in our experiments are available in project website of OMKS method :http://www.cais.ntu.edu.sg/∼chhoi/OMKS/

10

can be investigated using the number and the type of activation functions of neurons in the last layer or can been estimated using representation of examples in 𝒳 by finding the maximum distance between data points. Choosing suitable values for 𝓊 and ℓ margins, in this range, results in fast convergence of gradient descent algorithm by reducing the number of iterations and insignificant changes in pre-trained MMD-DML. For example, suppose the margins are chosen so that 𝓊 < ℓ ≪ 𝔼𝒳 [d(𝐱, 𝐲)] where 𝔼𝒳 [d(𝐱, 𝐲)] denotes expected value of distances between data points in the last layer of MMD-DML. In this situation, the dissimilar hinge loss term in Eq. (8) is mostly inactive and, thus, gradient descent tries to shrink the distance between similar pairs while disregarding a significant portion of dissimilar pairs until 𝔼𝒳 [d(𝐱, 𝐲)] gets close to dissimilarity margin. As a result, the network will require more iterations to achieve a desirable solution and will be unable to take advantage of the initial point found by our MMD-DML method. Notice that this performance degradation is due to blindly choosing margins that are inconsistent with the scale of the obtained features in the shared representation. Consequently, we use the cosine similarity that is scale invariant and thus the range of margin values does not depend on the properties of the shared representation space. Experiments in the following subsections show that angular distance metric of Eq. (14) achieves state-of-the-art results in few iterations.

5.4. Network architecture In this section, we evaluate networks with various numbers of layers and different numbers of units to find the desired architecture of the network. We start with a network having single-layer modality-specific SAEs and no JSAE. At each step, we increase the number of layers in modality-specific SAEs by one while fixing other hyper-parameters and evaluate the resulted model using mean Average Precision (mAP). This process is continued until adding a new layer decreases the network performance. The number of units in each layer is chosen so that the first layer reduces the dimensionality of each modality to 50 and the subsequent layers do not further reduce the dimensionality. Fig. 4 shows performance of the network with respect to the number of layers on the different datasets. According to these results, performance tends to degrade when the number of layers goes beyond three (especially for datasets with the smaller number of samples) since the networks with the higher number of parameters are more prone to overfitting. Indeed, when we increase the number of layers, the flexibility of the model can also be promoted. However, the number of adjustable parameters (i.e. weights and biases) are raised and since in many datasets the number of training samples is not sufficient, the overfitting may occur when new layers are added. For the Indoor dataset, the largest dataset, the network can be five layers deep since we have more samples to train the network.

Figure 4: Performance of the network with different depths.

11

For each dataset, we then pick the network with the highest performance and replace its last layer with a single-layer JSAE. This network is evaluated using 64, 128, and 256 output neurons. The results are summarized in Table 2. For the activation functions of the network, as recommended in [39], hyperbolic tangent is used due to its symmetry around the origin which allows faster convergence for all encoders and decoders except decoders of the first layers. Indeed, decoders of the first layers employed linear activation functions. The network is then trained in 300 iterations with a batch size of 250. Table 2: mAP of networks with different output widths. Output width 64 128 256

Cal10 0.38761 0.41103 0.42248

Cal20 0.26526 0.29705 0.26665

Cal50 0.16651 0.18786 0.18763

Corel5k 0.48642 0.48981 0.48206

Indoor 0.07384 0.08843 0.06734

5.5. Compared methods We compare our method with three recent methods introduced for multi-modal retrieval. As a baseline, we also report the results of the Unsupervised Multi-Modal Deep SAE (U-MMD-DML) as the unsupervised version of our method that can be considered as an extension of Bimodal Deep Network proposed in [10] with some difference in training (that have been mentioned in Section 4.1). Details of the methods used for comparison are provided below:

OMKS [23]: Using training triplets, OMKS optimizes several kernel functions for each modality while learning the optimal weights for linear combination of these functions. Similar to [23], we used three RBF kernels with 𝜎 ∈ {2−1 , 20 , 21 } and also a cosine similarity kernel for each modality as the base kernels. MM-DML [12]: This method was described in Section 3.2. For this model, we fixed the number of outputs to 128 and set the values of the 𝜆1 and 𝜆2 parameters through cross-validation. OM-DML [29]: This method that was mentioned in Section 2.1 simultaneously learns a distinct linear transformation on each modality and also finds optimal weights for combining the transformed modalities. Proposed Unsupervised MMD-DML (U-MMD-DML): As opposed to MMD-DML, this version of our method does not use the pairwise constraints to fine-tune the whole network and only uses the unsupervised pre-training shown in Algorithm 1. Proposed Multi-Modal Deep Distance Metric Learning (MMD-DML): This method was described in Section 4.

5.6. Evaluation in Retrieval and Classification We evaluate each method on five random splits of the datasets and average the obtained results. First, we use mAP measure to compare the performance of different methods and summarize the average results in Table 3. In Fig. 5, we report the performance of these methods in terms of the precision at top-k. We also compare these methods using 11-point interpolated precision-recall curve on the same dataset in Fig. 6.

Table 3: mAP of different retrieval methods. OMKS

Cal10 0.28544

Cal20 0.24715

Cal50 0.14299

Corel5k 0.42695

Indoor 0.06846

OM-DML

0.28648

0.21567

0.13078

0.34496

0.06236

MM-DML

0.30247

0.21796

0.11671

0.30356

0.05431

U-MMD-DML

0.24418

0.19869

0.11475

0.27710

0.05630

MMD-DML

0.35760

0.27923

0.17082

0.46844

0.07453

It can be seen from the results that MMD-DML significantly outperforms all the other methods. MMD-DML’s ability to learn a nonlinear transform for each modality can serve as a reason for the remarkable difference between performance of our method and that of the MM-DML method. The relatively high performance of shallow MM-DML on Cal10 dataset does not scale well to larger datasets and it becomes suboptimal compared with the other methods. This validates that we can improve the results using deep models such as MMD-DML for large-scale tasks. Moreover, comparing the results of MMD-DML and U-MMD-DML, we find that in MMD-DML supervisory information improves performance with a large margin.

12

Figure 5: Evaluation of methods using precision at top-k.

Figure 6: Evaluation of methods using precision-recall curve.

13

We also compare the methods in terms of k-nearest neighbor (k-NN) classification accuracy for various values of 𝑘. The results are summarized in Fig. 7. Our proposed method achieves the highest classification accuracy on all the datasets. According to Table 3 and Figs. 5-7, we can see that our MMD-DML method outperforms the other methods (with a larger margin when the number of classes in the dataset is lower, e.g. larger margin between our method and the second best method is obtained on Cal10 compared to Cal50).

Figure 7: Evaluation of methods in terms of k-NN classification accuracy.

5.7. The Impact of the Ratio of Pairwise Constraints As mentioned in Section 5.1, we keep a ratio of pairwise constraints (i.e. supervisory information in the form of similar and dissimilar pairs) to train the supervised methods. We evaluate the methods while changing this ratio and summarize the results in Fig. 8. Several empirical observations can be inferred from these results. First, MMD-DML performs better than the other methods in most cases. Second, the biggest leap in the performance of the three methods results from the first 20% of the constraints. Third, as mentioned in [23], OMKS becomes nearly saturated after receiving the first 20% of the constraints and a similar phenomenon happens for MM-DML and OM-DML methods too. However, MMD-DML keeps taking advantage of supervisory information beyond this level since the MMD-DML model is more flexible and more supervisory information can help it to be trained more properly.

14

Figure 8: mAP measure w.r.t. the ratio of pairwise constraints.

6. Conclusion In this paper, we proposed the MMD-DML framework for distance metric learning on multi-modal data when supervisory information is available in the form of similar/dissimilar pairs. MMD-DML is capable of learning a complicated nonlinear similarity function on multi-modal data (with heterogeneous modalities). In other words, MMD-DML has the ability of learning intra- and inter-modal high-order statistics from raw features. High degree of freedom in MMD-DML hypothesis space is well controlled using an efficient multi-stage pre-training phase. In fact, we first used the properties of multi-modal data to pre-train the network and then fine-tuned it using the supervisory information. Experimental results show the superiority of the proposed method in the retrieval and classification tasks. Our method improves mAP measure on Cal10, Corel5k, and Indoor datasets respectively 7.2%, 4.1%, and 0.6% compared to the second best method (OMKS).

References [1] E.P. Xing, M.I. Jordan, S.J. Russell, and A.Y. Ng, Distance metric learning with application to clustering with side-information, in: Advances in neural information processing systems, 2003, pp. 521-528. [2] K.Q. Weinberger, J. Blitzer, and L.K. Saul, Distance metric learning for large margin nearest neighbor classification, in: Advances in neural information processing systems, 2006, pp. 1473-1480. [3] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 209-216. [4] B. McFee and G. Lanckriet, Learning multi-modal similarity, Journal of machine learning research 12 (2011), 491-523. [5] M.S. Baghshah and S.B. Shouraki, Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data, Intelligent Data Analysis 13 (2009), 887-899. [6] J. Hu, J. Lu, and Y.-P. Tan, Discriminative deep metric learning for face verification in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1875-1882.

15

[7] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 40044012. [8] J. Hu, J. Lu, and Y.-P. Tan, Deep transfer metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325-333. [9] E. Hoffer and N. Ailon, Deep metric learning using triplet network, in: International Workshop on SimilarityBased Pattern Recognition, Springer, 2015, pp. 84-92. [10] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, Multimodal deep learning, in: Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689-696. [11] W. Wang, B.C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, Effective multi-modal retrieval based on stacked auto-encoders, Proceedings of the VLDB Endowment 7 (2014), 649-660. [12] P. Xie and E.P. Xing, Multi-modal distance metric learning, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 1806-1812. [13] F. Feng, R. Li, and X. Wang, Deep correspondence restricted Boltzmann machine for cross-modal retrieval, Neurocomputing 154 (2015), 50-60. [14] N. Srivastava and R.R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in: Advances in neural information processing systems, 2012, pp. 2222-2230. [15] N. Chen, J. Zhu, and E.P. Xing, Predictive subspace learning for multi-view data: a large margin approach, in: Advances in neural information processing systems, 2010, pp. 361-369. [16] N. Chen, J. Zhu, F. Sun, and E.P. Xing, Large-margin predictive latent subspace learning for multiview data analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012), 2365-2378. [17] H. Wang, F. Nie, H. Huang, and C. Ding, Heterogeneous visual features fusion via sparse multimodal machine, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3097-3102. [18] H. Xia, P. Wu, and S.C. Hoi, Online multi-modal distance learning for scalable multimedia retrieval, in: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 455-464. [19] J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015), 85-117. [20] W. Wang, X. Yang, B.C. Ooi, D. Zhang, and Y. Zhuang, Effective deep learning-based multi-modal retrieval, The VLDB Journal 25 (2016), 79-101. [21] P. Wu, S.C. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao, Online multimodal deep similarity learning with application to image retrieval, in: Proceedings of the 21st ACM international conference on Multimedia, ACM, 2013, pp. 153-162. [22] E.P. Xing, R. Yan, and A.G. Hauptmann, Mining associated text and images with dual-wing harmoniums, arXiv preprint arXiv:1207.1423 (2012). [23] H. Xia, S.C. Hoi, R. Jin, and P. Zhao, Online multiple kernel similarity learning for visual search, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 536-549. [24] G.R. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan, Learning the kernel matrix with semidefinite programming, Journal of machine learning research 5 (2004), 27-72. [25] N. Chen, S.C. Hoi, S. Li, and X. Xiao, SimApp: A framework for detecting similar mobile applications by online kernel learning, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 305-314. [26] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, Large scale multiple kernel learning, Journal of machine learning research 7 (2006), 1531-1565. [27] N. Chen, S.C. Hoi, S. Li, and X. Xiao, Mobile app tagging, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 63-72. [28] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, Dimensionality reduction for data in multiple feature representations, in: Advances in Neural Information Processing Systems, 2009, pp. 961-968. [29] P. Wu, S.C. Hoi, P. Zhao, C. Miao, and Z.-Y. Liu, Online multi-modal distance metric learning with application to image retrieval, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 454-467. [30] P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986. 16

[31] M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, in: Advances in neural information processing systems, 2005, pp. 1481-1488. [32] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936), 321-377. [33] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, in: International Conference on Machine Learning, 2013, pp. 1247-1255. [34] B. Kulis, Metric learning: A survey, Foundations and Trends® in Machine Learning 5 (2013), 287-364. [35] G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset, (2007). [36] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, Large scale online learning of image similarity through ranking, Journal of machine learning research 11 (2010), 1109-1135. [37] S.C. Hoi, W. Liu, M.R. Lyu, and W.-Y. Ma, Learning distance metrics with contextual constraints for image retrieval, in: Computer vision and pattern recognition, 2006 IEEE computer society conference on, IEEE, 2006, pp. 2072-2078. [38] A. Quattoni and A. Torralba, Recognizing indoor scenes, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 413-420. [39] Y.A. LeCun, L. Bottou, G.B. Orr, and K.-R. Müller, Efficient backprop, in: Neural networks: Tricks of the trade, Springer, 2012, pp. 9-48.

Seyed Mahdi Roostaiyan received his B.S. degree from Department of Computer Engineering, Shahid Chamran University of Ahwaz, Iran, in 2012, and his M.Sc. degree from Department of Computer Engineering, Sharif University of Technology, Iran, in 2014. His research interests include machine learning and pattern recognition.

Ehsan Imani is a senior undergraduate student in the Computer Engineering Department, Sharif University of Technology. His research interests include machine learning and data mining. He is particularly interested in deep networks and their application to various fields like information retrieval, computer vision, and robot control.

Mahdieh Soleymani Baghshah is an assistant professor in the Computer Engineering Department, Sharif University of Technology. She received her B.S., M.Sc., and Ph.D. degrees from Department of Computer Engineering, Sharif University of Technology, Iran, in 2003, 2005, and 2010. Her main research interest is machine learning and particularly deep learning.

17

Multi-Modal Deep Distance Metric Learning

Multi-Modal Deep Distance Metric Learning Seyed Mahdi Roostaiyan, Ehsan Imani, Mahdieh Soleymani Baghshah*

Seyed Mahdi Roostaiyan, Ehsan Imani*, Mahdieh Soleymani Baghshah

Computer Engineering Department, Sharif University of Technology, Tehran, Iran [email protected], [email protected], [email protected]

Computer Engineering Department, Sharif University of Technology, Tehran, Iran *

Corresponding author: M. Soleymani Baghshah, Department of Computer Engineering, Sharif University of Technology (SUT), Azadi St., Tehran, Iran. PO Box: 1458889694. Tel.: +98 2166166654; Fax: +98 21 6601 9246; E-mail: [email protected]

Abstract In many real-world applications, data contain heterogeneous input modalities (e.g., web pages include images, text, and etc.). Moreover, data such as images are usually described using different views (i.e. different sets of features). Learning a distance metric or similarity measure that originates from all input modalities or views is essential for many tasks such as content-based retrieval ones. In these cases, similar and dissimilar pairs of data can be used to find a better representation of data in which similarity and dissimilarity constraints are better satisfied. In this paper, we incorporate supervision in the form of pairwise similarity and/or dissimilarity constraints into multi-modal deep networks to combine different modalities into a shared latent space. Using properties of multi-modal data, we design multi-modal deep networks and propose a pre-training algorithm for these networks. In fact, the proposed network has the ability of learning intra- and inter-modal high-order statistics from raw features and we control its high flexibility via an efficient multi-stage pre-training phase corresponding to properties of multimodal data. Experimental results show that the proposed method outperforms recent methods on image retrieval tasks. Keywords: multi-modal data; metric learning; deep networks; similar-dissimilar pairs; pre-training.

1

1. Introduction A proper distance metric (or similarity measure) plays an important role in many learning and retrieval tasks. Until now, many methods have been proposed for metric learning [1-5]. In these methods, it is usually assumed that supervisory information in the form of relative distance constraints or similar/dissimilar pairs is available. Some of these methods, learn linear [1, 2] or nonlinear [4] transformations on the feature space to find a new representation space in which distance (or similarity) constraints are better satisfied. However, those methods that learn nonlinear transformations (or implicitly kernel matrices) are usually either limited to learning a restricted form of non-linear transformations or very time consuming when they are allowed to be flexible (i.e. they learn the whole kernel matrix). Moreover, the flexible methods that learn the whole kernel matrix are transductive and cannot be used to find similarities for new data. Recently, some deep metric learning methods [6-9] have been proposed that can learn a non-linear transformation to achieve a new representation space in which the distance constraints are better satisfied. However, these deep models of metric learning have been designed for input data containing one modality. Therefore, they have not used properties of multi-modal data for designing the architecture and training of the deep networks. On the other hand, real-world data usually contains different modalities such as text, image, and video. Video and corresponding audio [10] and annotated images [11, 12] are examples of multi-modal data. In recent years, several multi-modal methods have been proposed to incorporate heterogeneous modalities in the classification and retrieval tasks [11-14]. Besides, there are similar challenges in multi-view descriptions of data. Any view is indeed a description of data with a particular feature extraction method [15, 16]. Because of the somewhat similar nature of multi-modal and multi-view data, they pose similar challenges and even in some studies identical models have been introduced for both of them [17, 18]. Deep networks are flexible and effective models that have been used in a wide range of applications [19]. They can model distributions of different modalities and make a connection between them by a common or shared space that is obtained using a layer above modality specific networks. Multi-modal deep models have recently attracted much attention in many applications such as cross-modal [11, 20] and multi-modal retrieval tasks [14, 21]. A popular deep network architecture for multi-modal data has been shown in Fig. 1(a). This architecture which includes modality specific networks and a single layer on the top of these networks (to find a shared representation) has been used in many multi-modal deep learning models [10, 14]. However, this architecture has been already used for metric learning. In this paper, we propose a deep metric learning method for multi-modal data (to the best of our knowledge, our method is the first deep metric learning method for multi-modal data). The proposed Multi-modal Deep Distance Metric Learning (MMDDML) framework (see Fig. 1(b)) can include some layers for non-linear metric learning on the top of the single layer presenting the shared representation of modalities. We propose an effective approach for unsupervised pre-training of this model using the properties of multi-modal data. Since we intend to use the multi-modal deep network for metric learning, an optimization problem is presented that considers supervisory information in the form of similar/dissimilar pairs. Stochastic gradient descent is employed for training MMD-DML with batches of similar/dissimilar pairs. In this work, our goal is learning a distance metric that incorporates multiple modalities. Retrieval is one of the tasks in which the distance metric has a critical role and also some data views or modalities usually exist in retrieval applications. For example, in Content-Based Image Retrieval (CBIR), different views of an image, which are obtained through various feature extraction techniques, can act as different modalities. Experimental results show the effectiveness of our pre-training method in such retrieval tasks. The rest of this paper is organized as follows: Some related works are reviewed in Section 2. We first present some definitions and preliminaries of our proposed model in Section 3 and then the proposed method is described with details in Section 4. Experimental settings and results of our method for CBIR are presented in Section 5. Finally, we conclude our work in Section 6.

2. Related works The existing methods of representation learning for multi-modal data can be categorized as below:

Multiple Kernel Learning (MKL) Shallow Probabilistic Models Deep Models

These approaches are widely used in unsupervised [10, 11, 20, 22] and supervised [4, 12, 21, 23] multi-modal retrieval.

2

(a)

(b) Figure 1: (a) Unfolded MMD [10] (b) Unfolded MMD-DML.

2.1. Multiple Kernel Learning MKL methods can be used to learn a kernel that is a combination of a set of fixed basis kernels. Although these methods were first applied to single modal data, they can also be utilized for multi-modal data where different kernels are considered for different modalities [24, 25]. Weighted kernel combination is one of the earliest MKL methods [24-27] in which the kernel space is equivalent to weighted concatenation of kernel Hilbert spaces. Lanckriet et al. [24] employed weighted kernel combination in the Support Vector Machine (SVM) and learned optimal kernel weights and SVM parameters simultaneously. Even though most of MKL methods have been designed for classification purposes and used labels as supervisory information. However, they can be adapted to use supervisory information in the form of pairwise distance constraints [4] or triplets distance constraints [25, 27]. Recently, Chen et al. [25, 27] proposed methods for learning mobile applications similarity using weighted kernel combination approach. Lin et al. [28] introduce Weighted Multiple Kernel Embedding (WMKE) method for learning a linear transformation on spaces resulted from weighted kernels combination. Although this method can model correlation between modalities, simple scaling and selecting kernels are the only degrees of freedom considered for integrating modalities. In Multiple Kernel Partial Order Embedding (MKPOE) method [4], distinct linear transformations are learned on kernel spaces simultaneously. Unlike WMKE, this method cannot directly model correlation between diverse modalities. However, it can transform modality spaces efficiently. The learning stages of MKL approaches usually include optimization of a very big positive semi-definite (PSD) matrix. Therefore, these methods are not scalable to massive data in real-world applications such as multi-media retrieval tasks. Xia et al. [18] extended MKPOE method to an online mode by converting constrained optimization problem to its unconstrained equivalent and then projecting parameters to the constraints space. Xia et al. [23] proposed a similar method called Online Multiple Kernel Similarity Learning (OMKS) for CBIR application. They used different features extracted from image as modalities. To increase performance, several different kernels are considered for each modality and these kernels are combined to find the similarity measure. They proposed an efficient two-stage optimization technique for finding kernel space transformations and optimal combination weights. Wu et al. [29] proposed a similar method called OM-DML which simultaneously learns a distinct linear transformation on each modality and also optimal weights for combining modalities. By directly learning a linear 3

transformation instead of learning a Mahalanobis metric, OM-DML eliminates the time-consuming Positive Semi-Definite (PSD) projection step required in the OMKS algorithm. This method is also able to seek low-rank solutions by setting the number of dimensions for new spaces to be less than the number of input dimensions. In this paper, we propose the MMD-DML method which explicitly learns a non-linear transform having the advantage of kernel-based approaches while it does not need the PSD constraints (similar to the methods like OM-DML). Unlike the MKL methods that fix the base kernels, our method can learn a flexible non-linear transform on each modality. Furthermore, unlike MKPOE, OMKS, and MMDDML methods, our method has the ability to model intermodal correlations using the joint multi-layer network on the top of the modality specific networks.

2.2. Probabilistic shallow and deep network models for multi-modal data Shallow and deep networks are capable of providing a powerful framework using nonlinear activation functions or diverse conditional probability distributions and have been used extensively in various areas including multi-modal tasks [10, 12, 1416, 22]. Harmonium [30] is a shallow probabilistic model containing a layer of latent variables as a hidden representation of data. Dual-Wing Harmonium (DWH) [22] is an extension from exponential Harmonium [31] which is applicable to data with two modalities in the visible layer. In this model, image and annotations (along with image) are embedded into a shared latent space. Assumptions about conditional probability distribution can be leveraged as prior information about data. Xie et al. [12] extended DWH in their Multi-Modal Distance Metric Learning (MM-DML) method for distance metric learning through minimizing the cost function that has been defined according to similar and dissimilar pairs. Chen et al. proposed supervised extensions of DWH for large margin predictive subspace learning [15, 16]. Supervisory information in the form of labeled data is utilized in these methods. Several models of multi-modal deep networks have been proposed in recent years. Most of them are unsupervised methods that model data distribution [10, 14]. Some of the existing methods try to find a latent space that can be constructed by each modality [11, 13]. These methods are useful in cross-modal tasks. For example, in multi-modal retrieval based on Stacked Auto Encoders (SAEs) [11], an SAE is trained for each of the two modalities of image-tag bimodal data. After that, these methods try to minimize Euclidean distance between the latent representation of the images and that of their associated tags. Feng et al. [13] proposed a similar method based on Restricted Boltzmann Machine (RBM) to map image and text into a low-dimensional common space for cross-modal retrieval task. They used correlation-based loss function to maintain correspondence between distinct deep RBMs of modalities. A deep model using Canonical Correlation Analysis (CCA) [32] to find a shared latent space has also been introduced in [33]. In this model, each modality is transformed through a separate deep network to a space where the inter-modal correlation of the transformed modalities is maximized. Ngiam et al. [10] proposed an effective Multi-Modal Deep Network (MMD) model that learns a shared representation from different modalities in an unsupervised manner. The MMD model is pre-trained in a greedy layer-wise manner and then fine-tuned for multi-modal or cross-modal tasks by backpropagation. Srivastava et al. [14] proposed Multi-Modal Deep Boltzmann Machines (MMDBM) as an unsupervised method that assigns a deep network to each modality and uses a layer on the top of these networks to find a shared latent space. In this method, for each layer, an RBM is used and the model is trained in a layer-wise manner using contrastive divergence. This method is similar to the MMD method [10] but uses DBM instead of SAE.

3. Preliminaries In this section, we present some definitions and also some basic ideas about metric learning that have been presented in the previous works.

3.1. Definitions In this part, some definitions are provided for the terms used in the following sections and some basic ideas about metric learning are presented. DEFINITION 1: Multi-modal space A multi-modal vector space is 𝔻𝑀 = ℝ𝑑1 × … × ℝ𝑑𝑀 for which any 𝒙 = (𝒙1 , … , 𝒙𝑀 ) ∈ 𝔻𝑀 has 𝑀 modalities such that 𝒙1 ∈ ℝ𝑑1 , …., 𝒙𝑀 ∈ ℝ𝑑𝑀 . DEFINITION 2: Multi-modal retrieval Given a query object 𝑞 ∈ 𝔻𝑀 and a target domain 𝐷𝑡 ⊂ 𝔻𝑀 with 𝑇 objects, we intend to ﬁnd an order 𝑂 = (𝑜1 , … 𝑜𝑇 ) of 𝐷𝑡 such that ∀ 𝑖 < 𝑗, 𝑑𝑖𝑠𝑡(𝑞, 𝑜𝑖 ) < 𝑑𝑖𝑠𝑡(𝑞, 𝑜𝑗 ). 4

DEFINITION 3: Similar/dissimilar pairs Similar and dissimilar pair sets are defined as: 𝒮 = {(𝒙 , 𝒙′)} ⊂ 𝔻𝑀 × 𝔻𝑀 , 𝒟 = {(𝒙, 𝒙′)} ⊂ 𝔻𝑀 × 𝔻𝑀 .

(1)

For each (𝒙 , 𝒙′) ∈ 𝒮, 𝒙 and 𝒙′ are regarded as similar pairs in the training stage and pairs in the set 𝒟 are regarded as dissimilar ones.

3.2. Metric learning In this section, we first present some important and popular optimization problems for metric learning. Then, the most popular multi-modal metric learning method is introduced. Xing et al. proposed a distance metric learning method that minimizes the distance between similar pairs while separating dissimilar pairs by a margin [1]. Hence, the optimization problem does not consider any loss for dissimilar pairs that are far enough from each other: arg min ∑ ‖𝒙 − 𝒚‖2𝑨 s. t. ∀(𝒙, 𝒚) ∈ 𝒟, ‖𝒙 − 𝒚‖2𝑨 ≥ 1, 𝐀 ≽ 0, 𝐀

(2)

(𝒙,𝒚)∈𝒮

where ‖𝒙 − 𝒚‖2𝑨 = (𝒙 − 𝒚)𝑇 𝑨(𝒙 − 𝒚) = 𝑑𝑨 (𝒙, 𝒚) denotes the Mahalanobis distance between data points 𝒙 and 𝒚. Davis et al. [3] proposed an optimization problem that imposed a margin on similar pairs as well as dissimilar ones. Indeed, the distance between similar pairs which are adequately close to each other are not entered in the loss function: arg min 𝑟(𝑨) = 𝑡𝑟(𝑨) − log det(𝑨) s. t. 𝑑𝑨 (𝒙, 𝒚) ≤ 𝓊, (𝒙, 𝒚) ∈ 𝒮, 𝑑𝑨 (𝒙, 𝒚) ≥ ℓ, (𝒙, 𝒚) ∈ 𝒟. (3) 𝑨

Here, 𝑟(𝑨) is a special case of LogDet divergence which has some properties, such as the scale and translation invariance, that are suitable for metric learning [34]. Xie et al. proposed the MM-DML method [11] with the following optimization problem based on dual-wing harmonium: 1 1 2 arg min ℒ(𝒳; Θ) + 𝜆 ∑ ||𝑡(𝒙) − 𝑡(𝒚)||2 s. t. ∀(𝒙, 𝒚) ∈ 𝒟, ||𝑡(𝒙) − 𝑡(𝒚)|| ≥ 1, (4) |𝒮| |𝒳| Θ (𝒙,𝒚)∈𝒮

where Θ is model parameters, ℒ(𝒳; Θ) shows data likelihood in DWH, 𝜆 is a regularizer parameters, and 𝑡(𝒙) is the latent representation of 𝑥. The MM-DML optimization problem in Eq. (4) is an extension of the one introduced in Eq. (2). By softening the constraints, the optimization problem in Eq. (4) can be reformulated as: 1 1 1 arg min ℒ(𝒳; Θ) + 𝜆1 ∑ ||𝑡(𝒙) − 𝑡(𝒚)||2 + 𝜆2 ∑ ma x( 0,1 − ||𝑡(𝒙) − 𝑡(𝒚)||2 ), (5) |𝒮| |𝒳| |𝒟| Θ (𝒙,𝒚)∈𝒮

(𝒙,𝒚)∈𝒟

where 𝜆1 and 𝜆2 are regularization parameters. MM-DML method utilizes the stochastic gradient descent to directly optimize the feature transformation instead of learning the Mahalanobis metric (𝐴) used by Xing et al. [1]. Although the optimization problem of the MM-DML method is not convex and, without an intelligent parameter initialization strategy, MM-DML becomes prone to falling into an improper local-minima, it can provide some benefits. For example, a low-rank solution that is desirable in the context of Mahalanobis metric learning [3] can be achieved by explicitly learning a feature transformation that provides dimensionality reduction. In general, learning a non-linear transformation has some advantages to learning a Mahalanobis metric or learning a kernel matrix. Deep networks provide a powerful framework to learn flexible non-linear transformations. However, all of the existing deep metric learning methods [6-10] are proper for input data containing only one modality.

4. Proposed method In this section, we propose the MMD-DML method that uses the deep learning approach to find a flexible non-linear transformation leading to an effective distance metric for multi-modal data. We use a multi-stage pre-training phase utilizing unlabeled multi-modal data. Then, we impose margin constraints for both similar and dissimilar pairs via an optimization problem inspired by the ITML method [3]. The batch-mode gradient descent technique is utilized to find the solution of the proposed optimization problem that considers similar/dissimilar pairs.

4.1. Optimization problem Fig. 1(b) shows the unfolded structure of the proposed architecture in our MMD-DML method. This model has a separate SAE with an arbitrary number of layers for each modality. Joint SAE (JSAE) takes the concatenation of the latent representa5

tions of the modalities as its input layer and provides a shared representation as the output. The depth of the SAE considered to the 𝑚-th modality is shown as ℎ𝑚 and the depth of JSAE is denoted as ℎ𝑗𝑜𝑖𝑛𝑡 . Let 𝒙0 = 0 (𝒙1 , … , 𝒙0𝑀 ) ∈ 𝔻𝑀 , the representations resulted from the different layers of the SAE considered for the 𝑚-th modality are deℎ ℎ +1 2ℎ noted as 𝒙1𝑚 , … , 𝒙𝑚𝑚 (Fig. 1(b)). Moreover, 𝒙𝑚𝑚 , … , 𝒙𝑚 𝑚 show the decoded representations obtained in the unfolded MMDℎ ℎ DML (Fig. 1(b)). Concatenation of the outputs of modality specific SAEs is shown as 𝒋0 = (𝒙1 1 , … , 𝒙𝑀𝑀 ) that provides the inℎ𝑗𝑜𝑖𝑛𝑡 1 put of JSAE. Representations resulted from encoder layers of JSAE are shown as 𝒋 , … , 𝒋 . Moreover, 𝒋ℎ𝑗𝑜𝑖𝑛𝑡 +1 , … , 𝒋2ℎ𝑗𝑜𝑖𝑛𝑡 denote the outputs of the decoder layers (Fig. 1(b)). The mapping function corresponding to the whole MMD-DML model is ̂𝑙𝑚 (𝑙 = 0, … , ℎ𝑀 − 1) be the recondenoted as 𝑓𝑀 (𝒙; Θ) where Θ shows all model parameters and 𝑓𝑀 (𝒙0 ; Θ) = 𝒋ℎ𝑗𝑜𝑖𝑛𝑡 . Let 𝒙 𝑙 struction of 𝒙𝑚 resulted from applying an encoder and the corresponding decoder (of an auto-encoder network with one hidden layer) on 𝒙𝑙𝑚 shown in Fig. 2(a)-(b). Similarly, let 𝒋̂0 , … , 𝒋̂ℎ𝑗𝑜𝑖𝑛𝑡 −1 be the reconstructions obtained for 𝒋0 , … , 𝒋ℎ𝑗𝑜𝑖𝑛𝑡−1 . We also ̂𝑚 (𝑚 = 1, … , 𝑀) shown in Fig. 2(c). denote the reconstruction of the 𝑚-th modality using the corresponding unfolded SAE as 𝒙 The notation symbols used in our method have been presented in Table 1. Table 1: The notation symbols used in our method. symbol 𝒮 and 𝒟 𝒳 ℒ𝑟𝑚 (. , . ) 𝒙0 = (𝒙10 , … , 𝒙0𝑀 ) ℎ𝑚 𝒙𝑙𝑚 ̂𝑙𝑚 𝒙 ℎ𝑗𝑜𝑖𝑛𝑡

Description Sets of pairwise similarity and dissimilarity constraints The set of available training data (containing only feature vectors and not labels) The loss function used for the reconstruction of the 𝑚-th modality as in Eq. (9) (square loss in our experiments) The input containing 𝑚 modalities The depth of the SAE considered for the 𝑚-th modality The representation obtained in the 𝑙-th encoder layer of the SAE considered to the 𝑚-th modality The reconstruction of 𝒙𝑙𝑚 resulted from applying an auto-encoder network with one hidden layer on 𝒙𝑙𝑚 The depth of JSAE’s encoder used as the shared SAE on the top of the modality specific networks ℎ ℎ 𝒋0 = (𝒙1 1 , … , 𝒙𝑀𝑀 ) The input of the JSAE )the concatenation of the outputs of modality specific SAEs( The depth of the JSAE’s encoder ℎ𝑗𝑜𝑖𝑛𝑡 𝑙 The representation obtained by the 𝑙-th encoder layer of JSAE 𝒋 The reconstruction of 𝒋𝑙 resulted from applying an auto-encoder network with one hidden layer on 𝒋𝑙 𝒋̂𝑙 The mapping function corresponding to the whole MMD-DML model 𝑓𝑀 (. ; Θ) ̂𝑚 The reconstruction of the 𝑚-th modality using the corresponding unfolded SAE shown in Fig. 2(c) 𝒙

Finally, we define the optimization problem of MMD-DML as: arg min ℒ𝑟 (𝒳; Θ) 𝑠. 𝑡. ∀(𝒙, 𝒙′ ) ∈ 𝒮, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) ≤ 𝓊, ∀(𝒙, 𝒙′ ) ∈ 𝒟, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) ≥ ℓ, Θ

𝑙

𝑙

𝑙

(6)

𝑙

where 𝑑(. , . ): ℝ × ℝ → ℝ is a distance metric defined in ℝ and 𝑙 is the number of units in the last layer of the decoder of the joint network. The loss term ℒ𝑟 (𝒳; Θ) shows the average reconstruction error over 𝒳 and is defined as: 𝑀

1 2ℎ ℒ𝑟 (𝒳; Θ) = ∑ ∑ ℒ𝑟𝑚 (𝒙0𝑚 , 𝒙𝑚 𝑚 ), |𝒳| 0 2ℎ

(7)

𝒙 ∈𝒳 𝑚=1

where ℒ𝑟𝑚 (𝒙0𝑚 , 𝒙𝑚 𝑚 ) denotes the reconstruction loss used for the 𝑚-th modality. As suggested by Wang et al [10], these functions can be selected depending on modality distributions. Since various features extracted from images usually follow Gaussian distributions [11], we use convenient squared Euclidean distance loss in all of our CBIR experiments. Using hinge losses instead of hard margin constraints in Eq. (6) we obtain: 1 arg min ℒ𝑟 (𝒳; Θ) + 𝜆1 ∑ max(0, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) − 𝓊) |𝒮| ′ Θ (𝒙,𝒙 )∈𝒮 (8) 1 + 𝜆2 ∑ max(0, ℓ − 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ))), |𝒟| ′ (𝒙,𝒙 )∈𝒟

where 𝜆1 and 𝜆2 are regularization parameters. In the next subsections, we first introduce a pre-training algorithm to initialize the parameters of our MMD-DML model. Then, a gradient descent optimization technique is utilized to solve the optimization problem in Eq. (8) as the fine-tuning step the proposed deep model. Since the hinge loss terms in this equation are not differentiable, we simply use sub-gradient technique by considering the gradient of hinge loss equal to zero in non-differentiable points.

6

4.2. Unsupervised Pre-Training of MMD-DML Ngiam et al. [10] proposed a pre-training method for MMD in which the network is first initialized in a greedy layer-wise manner by sparse RBMs. After that, the unfolded MMD network is pre-trained by the backpropagation algorithm. In our method, unsupervised pre-training of the network consists of three major steps. Different stages for pre-training are shown in Algorithm 1. The first step includes pre-training of the SAE of each modality (Fig. 2(a)). To achieve a proper starting point, every layer is first initialized by Singular Value Decomposition1 (SVD) and then pre-trained by the backpropagation2 algorithm to provide a suitable dimensionality reduction for the next layer (Fig. 2(b)). The SAE whose layers are found in this greedy manner (one after the other) is then trained as a whole multi-layer network by the backpropagation algorithm (Fig. 2(c)). Indeed, we train the network allocated to the 𝑚-th modality to reach the lower reconstruction error for the representation obtained by this network. As mentioned in Section 4.1, reconstruction loss functions of modalities are chosen as: 2 1 (9) ̂𝑚 ) = ||𝒙𝑚 − 𝒙 ̂𝑚 || . ℒ𝑟𝑚 (𝒙𝑚 , 𝒙 2 2 ̂𝑚 is the reconstruction of 𝒙𝑚 obtained by the SAE of the m-th modality as shown in Fig. 2(c). However, the loss funcwhere 𝒙 tions that are utilized to show input reconstruction error are not needed to be the square loss necessarily. They can be chosen depending on modalities distributions as recommended by Wang et al. [10]. In the second step, the JSAE is pre-trained in a similar manner using inputs provided by the modality specific SAEs (Fig. 3). Eventually, in Step 3 of Algorithm 1, the whole unfolded network (Fig 1(b)) is pre-trained by the backpropagation algorithm to find the shared representation that minimizes sum of the squared reconstruction error over all the modalities (i.e. the first term in Eq. (8)).

ℎ

ℎ𝑚 𝑚 𝑋𝒙 𝑚𝑚

(𝒙𝑙𝑚 )′

(𝒙0𝑚 )′

𝒙0𝑚

𝒙𝑙𝑚

(a)

(b)

ℎ

𝒙𝑚𝑚

ℎ

𝒙2 2

𝒙10

𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑃𝑎𝑠𝑠

ℎ

𝒙1 1

2ℎ𝑚 𝒙𝑚𝑋𝑚

2ℎ 𝒙2𝑋22

2ℎ 1 𝒙𝑋 11

𝒙02 (c)

𝒙0𝑚

Figure 2: Pre-training of the modalities SAEs (Step 1 of Algorithm 1). (a) SAE of the 𝑚-th modality (b) Layer-wise SAE pretraining of the 𝑚-th modality (for the 𝑙-th layer) by firstly using SVD initialization and then update weights of this network (that has one hidden layer) using error backpropagation on the reconstruction error (c) Backpropagation to minimize the reconstruction error for the unfolded SAE of each modality.

1

If we have large-scale data, we can simply ignore SVD steps or calculate SVD over a subset of examples. Reconstruction loss function used in the first layer of each modality specific SAE can be selected depending on modality distribution. For other layers of every SAE, however, Euclidean reconstruction loss functions loss is common. All reconstruction loss minimization steps in Algorithm 1 are done by batchmode gradient descent. 2

7

𝐹𝑜𝑟𝑤𝑎𝑟𝑑 𝑃𝑎𝑠𝑠

(a)

(b) Figure 3: Pre-training of Joint SAE (Step 2 of Algorithm 1). (a) Greedy layer-wise pre-training of Joint SAE by firstly using SVD initialization and then using backpropagation to minimize reconstruction error of each layer (b) Backpropagation to minimize the reconstruction error in the whole unfolded joint SAE.

4.3. Supervised fine-tuning of MMD-DML In this section, we use the gradient descent method to fine-tune the pre-trained MMD-DML network by considering similar/dissimilar distance losses in the second and the third terms of Eq. (8). By utilizing distance losses in Eq. (8), we optimize MMD-DML parameters (weights and biases of MMD-DML encoders) as: Θ∗ = arg min ℒ𝑚𝑒𝑡𝑟𝑖𝑐 (Θ; 𝒮; 𝒟) Θ

= 𝜆1

1 ∑ ma x( 0, 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) − 𝓊) |𝒮| ′ (𝒙,𝒙 )∈𝒮

(10)

1 + 𝜆2 ∑ ma x( 0, ℓ − 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ))). |𝒟| ′ (𝒙,𝒙 )∈𝒟

As mentioned in Section 4.1, hinge losses in the above objective function are not differentiable in zero and we use subgradient strategy to train our model. This strategy simply uses the gradient in differentiable sub-regions. In other words, the sub-gradient of the hinge loss is defined as: ∇Θ ma x( 0, z) = 𝕀(𝑧(Θ) > 0)∇Θ z. Finally, the gradient of the cost function Eq. (10) is calculated as: 1 ∇Θ ℒ𝑚𝑒𝑡𝑟𝑖𝑐 (Θ; 𝒮; 𝒟) = 𝜆1 ∑ 𝕀(𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) > 𝓊)∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) |𝒮| ′ (𝒙,𝒙 )∈𝒮

1 −𝜆2 ∑ 𝕀(𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) < ℓ)∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) . |𝒟| ′ (𝒙,𝒙 )∈𝒟

8

(11)

(12)

We utilized the batch-mode stochastic gradient descent technique. Therefore, in each step, we calculate Eq. (12) for a minibatch of similar/dissimilar pairs that is a subset of 𝒮 ∪ 𝒟. Note that Eq. (12) is a summation of gradients attributed to violating similar/dissimilar pairs in the 𝐵 batch. We can calculate gradient originating from every (𝒙, 𝒙′ ) ∈ 𝐵 as: ∇Θ 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) = (13) ′ ∇𝑓𝑀 (𝒙;𝛩) 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙 ; Θ)) × ∇Θ 𝑓𝑀 (𝒙; Θ) + ∇𝑓𝑀(𝒙′ ;𝛩) 𝑑(𝑓𝑀 (𝒙; Θ), 𝑓𝑀 (𝒙′ ; Θ)) × ∇Θ 𝑓𝑀 (𝒙′ ; Θ)

Algorithm 1: Pre-Training of MMD-DML Inputs: A set of multi-modal vectors 𝑿0 = (𝑿10 … 𝑿0𝑀 ) (each row is one of the examples in 𝒳 and 𝑿0𝑚 shows the matrix containing the 𝑚-th modality of examples). Outputs: Parameters of MMD-DML initialized using pre-training Step 1: Pre-training an SAE for each modality for 𝑚 = 1 to 𝑀 do - Greedy layer-wise pre-training of the 𝑚-th SAE corresponding to the 𝒎-th modality: for 𝑙 = 1 to ℎ𝑚 do // in each iteration initialize an auto-encoder (called AE) with one hidden layer, update its weights and finally add its encoder layer as the 𝑙-th layer of the 𝑚-th SAE. 𝑼𝚺𝑽∗ ← SVD (𝑿𝑙−1 𝑚 ). Initialize weights of the AE’s encoder layer and decoder layer using 𝑼 and 𝑼𝑇 matrices respectively and biases to 0. ̂ 𝑙−1 Apply AE (the encoder and decoder layer of this new AE) on 𝑿𝑙−1 𝑚 to find 𝑿𝑚 . ̂ 𝑙−1 Update weights of this auto-encoder by minimizing the reconstruction error between 𝑿𝑙−1 𝑚 and 𝑿𝑚 via the backpropagation algorithm. Add AE’s encoder layer as the 𝑙-th layer of the 𝑚-th SAE. 𝑙 Using the encoder layer of AE on 𝑿𝑙−1 𝑚 to find 𝑿𝑚 . - Backpropagation in the unfolded modality SAE: ̂ 𝑚. Using the encoder layers of the 𝑚-th SAE and then their corresponding decoder layers as in Fig 2(c) to find 𝑿 0 ̂ Update whole weights of the 𝑚-th SAE by minimizing the reconstruction loss between 𝑿𝑚 and 𝑿𝑚 using backpropagation. Step 2: Pre-training JSAE - Greedy layer-wise pre-training of the JSAE: for 𝑙 = 1 to ℎ𝑗𝑜𝑖𝑛𝑡 do 𝑼𝚺𝑽∗ ← SVD (𝑱𝑙−1 ). Initialize weights of the AE’s encoder layer and decoder layer using 𝑼 and 𝑼𝑇 matrices respectively and biases to 0. Apply AE (the encoder and decoder layer of this new AE) on 𝑱𝑙−1 to find 𝑱𝑙−1 . Update weights of this auto-encoder by minimizing the reconstruction error between 𝑱𝑙−1 and 𝑱𝑙−1 via the backpropagation algorithm. Add AE’s encoder layer as the 𝑙-th layer of JSAE. Using the encoder layer of AE on 𝑱𝑙−1 to find 𝑱𝑙 . - Backpropagation in unfolded JSAE: Using the encoder layers of JSAE and then their corresponding decoder layers as in Fig 1(b) to find 𝑱2ℎ𝑗𝑜𝑖𝑛𝑡 . Update whole weights of the JSAE by minimizing the reconstruction loss between 𝑱2ℎ𝑗𝑜𝑖𝑛𝑡 and 𝑱0 using the backpropagation algorithm. Step 3: Backpropagation in unfolded MMD-DML - Update weights of the whole network by minimizing ℒ𝑟 (𝑿0 ; Θ) in Eq. (7) via the backpropagation algorithm.

9

The first term on the right-hand side of Eq. (13) is the gradient of 𝑑(𝑓𝑀 (𝒙; Θ), 𝑜) w.r.t. model parameters where 𝑜 = 𝑓𝑀 (𝒙′ ; Θ) is the fixed desired MMD-DML network output for 𝒙 and 𝑑(𝑓𝑀 (𝒙; Θ), 𝑜) measures the loss of network in generating 𝑜. Similarly, the second term of Eq. (13) is the gradient of 𝑑(𝑓𝑀 (𝒙′ ; Θ), 𝑜′) w.r.t. model parameters where 𝑜 ′ = 𝑓𝑀 (𝒙; Θ) is a fixed desired network output for 𝒙′ . In other words, the gradients in Eq. (13) are both similar to the gradient in neural networks for regression problems. Thus, the partial derivative in Eq. (13) w.r.t. parameters of each layer can be calculated through the backpropagation algorithm [11].

5. Experiments In this paper, we use various feature types such as SIFT and GIST as different modalities of image data and evaluate MMDDML in CBIR as a multi-modal retrieval task.

5.1. Datasets To assess the efficacy of our method in CIBR tasks, we evaluate it on three widely-used datasets3: Caltech-256, Corel5k, and Indoor. These are the most common datasets for CIBR tasks and following [23] as the most related work to ours, we select these datasets. Caltech-256 dataset has 256 image categories plus an extra class named “Clutter” [35]. Similar to the work by Chechik et al. [36], we choose 10, 20, and 50 classes from this dataset. In our experiments, these subsets are referred to as “Cal10”, “Cal20”, and “Cal50” respectively. Corel5k has 50 diverse image categories collected from COREL image CDs [37]. Contrary to Caltech-256, which has varied number of images in each category, each of the classes in Corel5k contains exactly 100 images. Indoor is a dataset previously used for indoor scene recognition [38]. This dataset has 67 categories, each of which contains at least 100 images. Following the work of Xia et al. [23], in order to avoid the dominating effect of one class with high number of images, we find the number of images in the smallest class and randomly choose samples of this size from each class so that the number of images is the same for all classes. Then, we randomly split the data into four partitions – that is – training set, validation set, query set, and test set. Training set contains 50% of images and is used for pre-training the network and extracting pairwise constraints. Validation set contains 10% of images and is used for tuning the hyper-parameters. Query set and test set contain 10% and 30% of the images respectively that are used for evaluation of the method. For comparison of different methods, query objects are chosen from the query set and the test set is regarded as the target domain. To extract pairwise constraints, we create all possible similar pairs in the training set and for each similar pair (𝒙1 , 𝒙2 ) we randomly choose a point 𝒙3 from another class and create a dissimilar pair (𝒙1 , 𝒙3 ). After that, we keep half of the constraints to train the methods. The effect of using varied numbers of constraints on the performance of the methods is shown in Section 5.5. For OMKS, which uses triplets rather than pairwise constraints, we merge each similar pair (𝒙1 , 𝒙2 ) and dissimilar pair (𝒙1 , 𝒙3 ) to create a triplet (𝒙1 , 𝒙2 , 𝒙3 ).

5.2. Extracted features Similar to [23], we use several types of features from each image. These features are Local Binary Pattern, GIST features, Gabor wavelets, color histogram and color moments, edge direction histogram, SIFT features, and SURF features. For SIFT and SURF features, we use 200 and 1000 as codebook size, thus generating four types of features called SIFT200, SIFT1000, SURF200, and SURF1000. Using PCA, we extract 100 features from each feature set whose dimension is more than 100.

5.3. Choosing distance metric and margins We use the measure defined below as the distance metric in Eq. (8): < 𝒉, 𝒉′ > 𝑑(𝒉, 𝒉′) = (1 − ). (14) ||𝒉||||𝒉′ || This is the distance metric related to cosine similarity and ranges over (0,2) in every space. Using this distance metric we can 𝜋

𝜋

restrict the values of 𝓊 and ℓ. Indeed, we specified these margins as 𝓊 = (1 − cos ( )) and ℓ = (1 − cos ( )) in all the be12 6 low experiments. As mentioned by Xing et al [1], the margin value in Eq. (2) corresponds to only scaling and different margin values may yield equivalent solutions. Suppose a near optimal (w.r.t. visual similarity) pre-trained MMD-DML model (see Section 4.2 for the pre-training stage). Distance between pairs in this model, w.r.t. Euclidian metric, ranges over (0, 𝒱). The upper bound of 𝒱 3

The datasets used in our experiments are available in project website of OMKS method :http://www.cais.ntu.edu.sg/∼chhoi/OMKS/

10

can be investigated using the number and the type of activation functions of neurons in the last layer or can been estimated using representation of examples in 𝒳 by finding the maximum distance between data points. Choosing suitable values for 𝓊 and ℓ margins, in this range, results in fast convergence of gradient descent algorithm by reducing the number of iterations and insignificant changes in pre-trained MMD-DML. For example, suppose the margins are chosen so that 𝓊 < ℓ ≪ 𝔼𝒳 [d(𝐱, 𝐲)] where 𝔼𝒳 [d(𝐱, 𝐲)] denotes expected value of distances between data points in the last layer of MMD-DML. In this situation, the dissimilar hinge loss term in Eq. (8) is mostly inactive and, thus, gradient descent tries to shrink the distance between similar pairs while disregarding a significant portion of dissimilar pairs until 𝔼𝒳 [d(𝐱, 𝐲)] gets close to dissimilarity margin. As a result, the network will require more iterations to achieve a desirable solution and will be unable to take advantage of the initial point found by our MMD-DML method. Notice that this performance degradation is due to blindly choosing margins that are inconsistent with the scale of the obtained features in the shared representation. Consequently, we use the cosine similarity that is scale invariant and thus the range of margin values does not depend on the properties of the shared representation space. Experiments in the following subsections show that angular distance metric of Eq. (14) achieves state-of-the-art results in few iterations.

5.4. Network architecture In this section, we evaluate networks with various numbers of layers and different numbers of units to find the desired architecture of the network. We start with a network having single-layer modality-specific SAEs and no JSAE. At each step, we increase the number of layers in modality-specific SAEs by one while fixing other hyper-parameters and evaluate the resulted model using mean Average Precision (mAP). This process is continued until adding a new layer decreases the network performance. The number of units in each layer is chosen so that the first layer reduces the dimensionality of each modality to 50 and the subsequent layers do not further reduce the dimensionality. Fig. 4 shows performance of the network with respect to the number of layers on the different datasets. According to these results, performance tends to degrade when the number of layers goes beyond three (especially for datasets with the smaller number of samples) since the networks with the higher number of parameters are more prone to overfitting. Indeed, when we increase the number of layers, the flexibility of the model can also be promoted. However, the number of adjustable parameters (i.e. weights and biases) are raised and since in many datasets the number of training samples is not sufficient, the overfitting may occur when new layers are added. For the Indoor dataset, the largest dataset, the network can be five layers deep since we have more samples to train the network.

Figure 4: Performance of the network with different depths.

11

For each dataset, we then pick the network with the highest performance and replace its last layer with a single-layer JSAE. This network is evaluated using 64, 128, and 256 output neurons. The results are summarized in Table 2. For the activation functions of the network, as recommended in [39], hyperbolic tangent is used due to its symmetry around the origin which allows faster convergence for all encoders and decoders except decoders of the first layers. Indeed, decoders of the first layers employed linear activation functions. The network is then trained in 300 iterations with a batch size of 250. Table 2: mAP of networks with different output widths. Output width 64 128 256

Cal10 0.38761 0.41103 0.42248

Cal20 0.26526 0.29705 0.26665

Cal50 0.16651 0.18786 0.18763

Corel5k 0.48642 0.48981 0.48206

Indoor 0.07384 0.08843 0.06734

5.5. Compared methods We compare our method with three recent methods introduced for multi-modal retrieval. As a baseline, we also report the results of the Unsupervised Multi-Modal Deep SAE (U-MMD-DML) as the unsupervised version of our method that can be considered as an extension of Bimodal Deep Network proposed in [10] with some difference in training (that have been mentioned in Section 4.1). Details of the methods used for comparison are provided below:

OMKS [23]: Using training triplets, OMKS optimizes several kernel functions for each modality while learning the optimal weights for linear combination of these functions. Similar to [23], we used three RBF kernels with 𝜎 ∈ {2−1 , 20 , 21 } and also a cosine similarity kernel for each modality as the base kernels. MM-DML [12]: This method was described in Section 3.2. For this model, we fixed the number of outputs to 128 and set the values of the 𝜆1 and 𝜆2 parameters through cross-validation. OM-DML [29]: This method that was mentioned in Section 2.1 simultaneously learns a distinct linear transformation on each modality and also finds optimal weights for combining the transformed modalities. Proposed Unsupervised MMD-DML (U-MMD-DML): As opposed to MMD-DML, this version of our method does not use the pairwise constraints to fine-tune the whole network and only uses the unsupervised pre-training shown in Algorithm 1. Proposed Multi-Modal Deep Distance Metric Learning (MMD-DML): This method was described in Section 4.

5.6. Evaluation in Retrieval and Classification We evaluate each method on five random splits of the datasets and average the obtained results. First, we use mAP measure to compare the performance of different methods and summarize the average results in Table 3. In Fig. 5, we report the performance of these methods in terms of the precision at top-k. We also compare these methods using 11-point interpolated precision-recall curve on the same dataset in Fig. 6.

Table 3: mAP of different retrieval methods. OMKS

Cal10 0.28544

Cal20 0.24715

Cal50 0.14299

Corel5k 0.42695

Indoor 0.06846

OM-DML

0.28648

0.21567

0.13078

0.34496

0.06236

MM-DML

0.30247

0.21796

0.11671

0.30356

0.05431

U-MMD-DML

0.24418

0.19869

0.11475

0.27710

0.05630

MMD-DML

0.35760

0.27923

0.17082

0.46844

0.07453

It can be seen from the results that MMD-DML significantly outperforms all the other methods. MMD-DML’s ability to learn a nonlinear transform for each modality can serve as a reason for the remarkable difference between performance of our method and that of the MM-DML method. The relatively high performance of shallow MM-DML on Cal10 dataset does not scale well to larger datasets and it becomes suboptimal compared with the other methods. This validates that we can improve the results using deep models such as MMD-DML for large-scale tasks. Moreover, comparing the results of MMD-DML and U-MMD-DML, we find that in MMD-DML supervisory information improves performance with a large margin.

12

Figure 5: Evaluation of methods using precision at top-k.

Figure 6: Evaluation of methods using precision-recall curve.

13

We also compare the methods in terms of k-nearest neighbor (k-NN) classification accuracy for various values of 𝑘. The results are summarized in Fig. 7. Our proposed method achieves the highest classification accuracy on all the datasets. According to Table 3 and Figs. 5-7, we can see that our MMD-DML method outperforms the other methods (with a larger margin when the number of classes in the dataset is lower, e.g. larger margin between our method and the second best method is obtained on Cal10 compared to Cal50).

Figure 7: Evaluation of methods in terms of k-NN classification accuracy.

5.7. The Impact of the Ratio of Pairwise Constraints As mentioned in Section 5.1, we keep a ratio of pairwise constraints (i.e. supervisory information in the form of similar and dissimilar pairs) to train the supervised methods. We evaluate the methods while changing this ratio and summarize the results in Fig. 8. Several empirical observations can be inferred from these results. First, MMD-DML performs better than the other methods in most cases. Second, the biggest leap in the performance of the three methods results from the first 20% of the constraints. Third, as mentioned in [23], OMKS becomes nearly saturated after receiving the first 20% of the constraints and a similar phenomenon happens for MM-DML and OM-DML methods too. However, MMD-DML keeps taking advantage of supervisory information beyond this level since the MMD-DML model is more flexible and more supervisory information can help it to be trained more properly.

14

Figure 8: mAP measure w.r.t. the ratio of pairwise constraints.

6. Conclusion In this paper, we proposed the MMD-DML framework for distance metric learning on multi-modal data when supervisory information is available in the form of similar/dissimilar pairs. MMD-DML is capable of learning a complicated nonlinear similarity function on multi-modal data (with heterogeneous modalities). In other words, MMD-DML has the ability of learning intra- and inter-modal high-order statistics from raw features. High degree of freedom in MMD-DML hypothesis space is well controlled using an efficient multi-stage pre-training phase. In fact, we first used the properties of multi-modal data to pre-train the network and then fine-tuned it using the supervisory information. Experimental results show the superiority of the proposed method in the retrieval and classification tasks. Our method improves mAP measure on Cal10, Corel5k, and Indoor datasets respectively 7.2%, 4.1%, and 0.6% compared to the second best method (OMKS).

References [1] E.P. Xing, M.I. Jordan, S.J. Russell, and A.Y. Ng, Distance metric learning with application to clustering with side-information, in: Advances in neural information processing systems, 2003, pp. 521-528. [2] K.Q. Weinberger, J. Blitzer, and L.K. Saul, Distance metric learning for large margin nearest neighbor classification, in: Advances in neural information processing systems, 2006, pp. 1473-1480. [3] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 209-216. [4] B. McFee and G. Lanckriet, Learning multi-modal similarity, Journal of machine learning research 12 (2011), 491-523. [5] M.S. Baghshah and S.B. Shouraki, Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data, Intelligent Data Analysis 13 (2009), 887-899. [6] J. Hu, J. Lu, and Y.-P. Tan, Discriminative deep metric learning for face verification in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1875-1882.

15

[7] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 40044012. [8] J. Hu, J. Lu, and Y.-P. Tan, Deep transfer metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325-333. [9] E. Hoffer and N. Ailon, Deep metric learning using triplet network, in: International Workshop on SimilarityBased Pattern Recognition, Springer, 2015, pp. 84-92. [10] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, Multimodal deep learning, in: Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689-696. [11] W. Wang, B.C. Ooi, X. Yang, D. Zhang, and Y. Zhuang, Effective multi-modal retrieval based on stacked auto-encoders, Proceedings of the VLDB Endowment 7 (2014), 649-660. [12] P. Xie and E.P. Xing, Multi-modal distance metric learning, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 1806-1812. [13] F. Feng, R. Li, and X. Wang, Deep correspondence restricted Boltzmann machine for cross-modal retrieval, Neurocomputing 154 (2015), 50-60. [14] N. Srivastava and R.R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in: Advances in neural information processing systems, 2012, pp. 2222-2230. [15] N. Chen, J. Zhu, and E.P. Xing, Predictive subspace learning for multi-view data: a large margin approach, in: Advances in neural information processing systems, 2010, pp. 361-369. [16] N. Chen, J. Zhu, F. Sun, and E.P. Xing, Large-margin predictive latent subspace learning for multiview data analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012), 2365-2378. [17] H. Wang, F. Nie, H. Huang, and C. Ding, Heterogeneous visual features fusion via sparse multimodal machine, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3097-3102. [18] H. Xia, P. Wu, and S.C. Hoi, Online multi-modal distance learning for scalable multimedia retrieval, in: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 455-464. [19] J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015), 85-117. [20] W. Wang, X. Yang, B.C. Ooi, D. Zhang, and Y. Zhuang, Effective deep learning-based multi-modal retrieval, The VLDB Journal 25 (2016), 79-101. [21] P. Wu, S.C. Hoi, H. Xia, P. Zhao, D. Wang, and C. Miao, Online multimodal deep similarity learning with application to image retrieval, in: Proceedings of the 21st ACM international conference on Multimedia, ACM, 2013, pp. 153-162. [22] E.P. Xing, R. Yan, and A.G. Hauptmann, Mining associated text and images with dual-wing harmoniums, arXiv preprint arXiv:1207.1423 (2012). [23] H. Xia, S.C. Hoi, R. Jin, and P. Zhao, Online multiple kernel similarity learning for visual search, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 536-549. [24] G.R. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan, Learning the kernel matrix with semidefinite programming, Journal of machine learning research 5 (2004), 27-72. [25] N. Chen, S.C. Hoi, S. Li, and X. Xiao, SimApp: A framework for detecting similar mobile applications by online kernel learning, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 305-314. [26] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, Large scale multiple kernel learning, Journal of machine learning research 7 (2006), 1531-1565. [27] N. Chen, S.C. Hoi, S. Li, and X. Xiao, Mobile app tagging, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 63-72. [28] Y.-Y. Lin, T.-L. Liu, and C.-S. Fuh, Dimensionality reduction for data in multiple feature representations, in: Advances in Neural Information Processing Systems, 2009, pp. 961-968. [29] P. Wu, S.C. Hoi, P. Zhao, C. Miao, and Z.-Y. Liu, Online multi-modal distance metric learning with application to image retrieval, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 454-467. [30] P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986. 16

[31] M. Welling, M. Rosen-Zvi, and G.E. Hinton, Exponential family harmoniums with an application to information retrieval, in: Advances in neural information processing systems, 2005, pp. 1481-1488. [32] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936), 321-377. [33] G. Andrew, R. Arora, J. Bilmes, and K. Livescu, Deep canonical correlation analysis, in: International Conference on Machine Learning, 2013, pp. 1247-1255. [34] B. Kulis, Metric learning: A survey, Foundations and Trends® in Machine Learning 5 (2013), 287-364. [35] G. Griffin, A. Holub, and P. Perona, Caltech-256 object category dataset, (2007). [36] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, Large scale online learning of image similarity through ranking, Journal of machine learning research 11 (2010), 1109-1135. [37] S.C. Hoi, W. Liu, M.R. Lyu, and W.-Y. Ma, Learning distance metrics with contextual constraints for image retrieval, in: Computer vision and pattern recognition, 2006 IEEE computer society conference on, IEEE, 2006, pp. 2072-2078. [38] A. Quattoni and A. Torralba, Recognizing indoor scenes, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 413-420. [39] Y.A. LeCun, L. Bottou, G.B. Orr, and K.-R. Müller, Efficient backprop, in: Neural networks: Tricks of the trade, Springer, 2012, pp. 9-48.

Seyed Mahdi Roostaiyan received his B.S. degree from Department of Computer Engineering, Shahid Chamran University of Ahwaz, Iran, in 2012, and his M.Sc. degree from Department of Computer Engineering, Sharif University of Technology, Iran, in 2014. His research interests include machine learning and pattern recognition.

Ehsan Imani is a senior undergraduate student in the Computer Engineering Department, Sharif University of Technology. His research interests include machine learning and data mining. He is particularly interested in deep networks and their application to various fields like information retrieval, computer vision, and robot control.

Mahdieh Soleymani Baghshah is an assistant professor in the Computer Engineering Department, Sharif University of Technology. She received her B.S., M.Sc., and Ph.D. degrees from Department of Computer Engineering, Sharif University of Technology, Iran, in 2003, 2005, and 2010. Her main research interest is machine learning and particularly deep learning.

17