Multimodal Reconstruction Using Vector Representation

0 downloads 0 Views 3MB Size Report
shown to improve deep multi-modal learning for language and vision. ... training for zero-shot image recognition and image retrieval. Ngiam et al. [11] used an auto-encoder ... encoder models paired with a corresponding embedding net- work.
MULTIMODAL RECONSTRUCTION USING VECTOR REPRESENTATION Shagan Sah Ameya Shringi Dheeraj Peri John Hamilton Andreas Savakis Ray Ptucha Rochester Institute of Technology Rochester NY, USA ABSTRACT Recent work has demonstrated that neural embedding from multiple modalities can be utilized to focus the results of generative adversarial networks. However, little work has been done towards developing a procedure to combine vectors from different modalities for the purpose of reconstructing input. Generally, embeddings from different modalities are concatenated to create a larger input vector. In this paper, we propose learning a Common Vector Space (CVS) where similar inputs from different modalities cluster together. We develop a framework to analyze the extent of reconstruction and robustness offered by CVS. We apply the CVS for the purpose of annotating, generating and captioning images on MS-COCO. We show that CVS is on par with techniques used for multiple modality embeddings while offering more flexibility as the number of modalities increases. 1. INTRODUCTION An ambitious goal for machine learning and signal processing research is to be able to represent different modalities of data that have the same meaning with a common latent vector representation. Concepts that are similar, lie close together in this space, while dissimilar concepts lie far apart. For example, the word automobile, a photo of a car, and a phrase about driving should all map in close proximity. A sufficiently powerful model should be able to store similar concepts in a similar vector representation or produce any of these realizations from the same latent vector. Successfully mapping visual, audio, and textual modality in and out of this latent space would significantly impact the broad task of information retrieval. The mapping of several modalities into a Common Vector Space (CVS) requires high enough dimension to support an underlying manifold capable of representing a large number of concepts, as well as enabling the mapping in and out of CVS in a conceptually lossless manner. In a somewhat paradoxical fashion, this common connection space must be sufficiently small so that it does not suffer from the curse of dimensionality. To support the underlying concepts of a successful CVS, this research concentrates on image, word, and sentence pairings going into and coming out of this space. Our work builds upon recent advances in vector representations [1, 2], generative models [3], image/video captioning

978-1-4799-7061-2/18/$31.00 ©2018 IEEE

Fig. 1. Depiction of CVS manifold. Similar multi-modal concepts lie close in CVS. This research explores the modalities of words, sentences and images. Mode-specific embeddings convert input encodings into the CVS. Likewise, modespecific inverse embeddings convert CVS representations into vector representations suitable for decoders which in turn synthetically generate outputs. [4, 5, 6] and machine translation [7] frameworks to form a multi-modal CVS. We demonstrate the application to both within domain: image-to-image or text-to-text and cross domain: text-to-image or image-to-text generation tasks. Experimental results gain insight on how vector representations are suitable for advanced image and language embeddings. The contributions of this work are three fold. Firstly, we formulate a vector space model that acts as a bridge between multiple modalities. This is achieved through a neural embedding model that can merge a wide variety of modalities. Secondly, we demonstrate the ability of CVS to perform multimodal conditioned image generation. Lastly, we show the application of CVS to both traditional tasks such as image classification, image captioning and to more challenging tasks of class/caption conditioned image generation. 2. RELATED WORK There has been increased advances in the area of deep multimodal representation in the last few years. For example, Srivastava et al. [8] used deep Boltzmann machines to generate tags from images or images from tags. Sohn et al. [9] in-

3763

ICIP 2018

troduced a novel informative theoretical objective that was shown to improve deep multi-modal learning for language and vision. Joint language and image learning based on image category was shown in [10]. They show the use of the joint training for zero-shot image recognition and image retrieval. Ngiam et al. [11] used an auto-encoder model to learn crossmodal representations and show results with audio and video datasets. Sohn et al. [12] introduced multi-class N-tuple loss and showed superior results on image clustering, image retrieval and face re-identification. Eisenschtat et al. [13] introduced a 2-layer bidirectional network to map vectors coming from two data sources by optimizing correlation loss. Wang et al. [14] learn joint embeddings of images and text by enforcing margin constraints on training objectives. The notion of a latent space where similar points are close to each other is a key principle of metric learning. The representations obtained from this formulation need to generalize well when the test data has unseen labels. Models based on metric learning have been used extensively in the domain of face verification [15], image retrieval [16], personre-identification [17] and zero-shot learning [18]. Ngiam et al. [11] used an autoencoder model to learn cross-modal representations and show results with audio and video datasets. Recently, Wu et al. [19] leveraged this concept to associate data from different modalities. Our work shares similarities with [19]. However, we focus on generating visual/textual data and creating a model capable of combining more than two modalities. 3. COMMON VECTOR SPACE Our method of mapping in and out of the Common Vector Space (CVS) can be envisioned as a generalization of encoder-decoder models. The encoder side, which we call the CVS encoder, is a combination of existing pre-trained encoder models paired with a corresponding embedding network. The embedding network converts from traditional encoded vector space data (e.g. FC6, GLoVe, etc.) to the CVS, such that different modalities share a common latent representation. Similarly, the decoder side, which we call the CVS decoder, is a combination of an inverse embedding network paired with a pre-trained decoder. This converts from the CVS representation to the vector representation required to feed a decoder for each modality. Every forward and inverse embedding network in this formulation can be an independent neural network whose weights are trained to minimize reconstruction errors as different modalities go in and come out of the CVS. Formally, consider data samples xpn , where n = 1, ..., N are the number of input data samples and p = 1, ..., P are different modalities such that each modality has a data space defined as Xp . Any two samples xpn and xqm are related by some relationship. In a simple case, this relationship can take only a positive or a negative form. In a more generalized set-

ting, this would be the degree of correlation between the samples. For example, in an image captioning dataset the multiple image-caption pairs each have some degree of correlation. A loss function is defined between the input data sample pairs for all modalities. The objective of the loss function is to learn the CVS encoder function for all modalities such that they embed each of the inputs into a common latent space. The common space enables meaningful comparisons across different modalities by optimizing with respect to a metric of interest. Each modality has a corresponding embedding network function f with parameters θ that maps the input sample xpn to a common latent space H. fp (θ) : xpn → hn ∈ H

(1)

Every embedding network function has a corresponding inverse embedding network function fθ−1 that transforms the latent representation into a meaningful modality representation appropriate for decoding. The inverse embedding network is trained to minimize reconstruction error of an input sample. fp−1 (θ) : hn → xˆpn ∈ Xp

(2)

3.1. Encoder-Decoder Reconstructions Reconstructing a feature space through a predictive model can be challenging since dimensionality reduction could lead to loss of valuable information. During training, let the input feature x ∈ Rm be transformed to some latent space h ∈ Rn by learning parameters θ = (We , be ). If f were a collection of fully connected layers, inversion (i.e. f −1 ) could be achieved with a least square optimization to estimate x ˆ. Alternatively, by forcing We matrices to be orthogonal, we can use the transpose of the weight matrix WeT as the inverse. If n < m, the reduction of the input feature space can be viewed as a restriction of model search within the latent subspace. An unsupervised learning in the form of a reconstruction loss of the features is introduced. This assists in reconstructing original input features from the encoded latent representations, while constraining the representations to possess certain desirable properties. The reconstruction loss (Lr ) is defined as: Lr = ||xp − xˆp ||2

(3)

where, xp and xˆp are the input and reconstructed feature vectors in the same modality p respectively. 3.2. Joint Learning In general, multi-modal joint learning techniques fall into two main groups. The first group comprises an embedding network trained to map one modality into the encoded space of a second modality. The feature space of the second modality is considered the common space between both modalities.

3764

4.2. Evaluation on MS-COCO

Fig. 2. Training and testing modes for learning the common vector space. During training, the encoded input modalities are aligned through a loss function. The learned weights are inverted in test phase to reconstruct the input modality. The second group learns separate embedding networks to map both modalities into a new common vector space. Our model falls in the second category. Thus to learn embeddings in the common space, we use different losses that are capable of establishing correlations amongst embeddings.

MS-COCO [22] has 82,783 training images with 80 object categories. Each image has multiple category labels in addition to five captions that are associated with it. The category labels are aimed at object detection and segmentation and each image on an average has three labels. Since we leverage the word label as an anchor modality during CVS training, having multiple word categories per image, results in improper positive and negative pairing. We therefore selected 16 label categories with minimal degree of overlap. For example, Figure 3 (left) shows an image with two object categories dog and donut that were selected. Such an image cannot be treated as a positive pair for one category while negative with the other. Hence, we do not select such images. Figure 3 (right) shows a similar example. The final 16 category word dataset contains 16,000 train and 2,900 test images, each with five captions.

3.3. Weighted Structured Loss We define a loss function that is based on several positive and negative multi-modal samples. The loss is an extension to lifted structured embedding loss proposed in [20]. Our extension forces all positive samples to be closer than any negative ˆ are positive samples irrespective of the modalities. Pˆ and N and negative sample pairs across all modalities, respectively. Fig. 3. Example images from MS-COCO demonstrating category overlap.

 X X  1 log ( exp(α − di,k )) λNˆ E= 2|Pˆ | ˆ (i,k)∈N

(i,j)∈Pˆ

+

X



dj,k + λPˆ di,j

2 (4)

ˆ (j,k)∈N

where di,j = ||xi − xj ||22 , di,k = ||xi − xk ||22 , dj,k = ||xj − xk ||22 and λNˆ , λPˆ are weights associated with positive and negative components of loss. For reconstruction, the positive component is weighed more than the negative component.

To train the CVS model, we use the word labels as the anchor modality such that all images and captions have an associated category. The input is a tuple of three entities: an image, a caption, and a word label. Every tuple creates two input pairs that activates two branches of the network. The weighted structured loss is used to update gradients of only the branch that is active for the pair.

4. EXPERIMENTS AND DISCUSSION 4.1. Training Details All experiments are trained for 30 epochs and use a batch size of 200. The margin for weighted structured loss is 10.0. Adam optimizer is used with learning rate 1 × 10−2 and decay parameters (β1 = 0.9, β2 = 0.999) as reported in [21]. Pre-trained models are used to extract vector representations for different modalities. Each modality (caption, image or word) has an encoder-decoder model that can represent the input data as a vector and can also reconstruct the data from an estimate of this latent representation.

Fig. 4. Image to caption generation examples.

4.2.1. Caption-to-Image and Image-to-Caption We train variations of embedding networks to generalize the results across modalities. We use image and caption modali-

3765

ties for the purpose of cross-modal evaluations. The outputs were evaluated in pairwise fashion. Figure 4 shows examples of captions generated from images along with the reconstructed images. Figure 5 show qualitative results for caption to image generation obtained on the test set. It should be noted that the images generated through direct FC6 vectors are the upper-bound on the quality of the generated images. Table 1 has quantitative evaluations which indicate that scaling the CVS training for larger datasets is challenging. Although the reconstructions are not perfect, especially in case of generated images, the generated outputs carry the semantic information from the other modality.

Fig. 6. t-SNE visualization the common vector space on a validation set. Red, black and blue colors indicate captions, images and word categories, respectively.

4.2.2. MS-COCO Embeddings in CVS

Fig. 5. Caption to image generation examples. Images generated through the direct FC6 vectors are the upper-bound on the quality of the generated images.

Table 1. Image generation results from all input modalities. Inception Score [23] Direct FC6 7.50 ± 0.83 Image-to-Image 7.52 ± 0.69 Word-to-Image 1.15 ± 0.01 Caption-to-Image 1.96 ± 0.15 Table 1 indicates that our CVS framework has limitations when generalizing to larger datasets. We believe the reason for this disparity across modalities appears due to two reasons: 1) our captioning dataset is very diverse in terms of object combinations. This poses a significant challenge for common vector learning; and 2) as discussed in the previous section, the embedding size has direct implications on reconstruction quality. However, larger embedding sizes require exponentially more samples to avoid sparse regions in the common space. This presents an interesting trade-off between embedding dimension and reconstruction quality. We visualize the embedding space to demonstrate the sparsity in the next section.

Figure 6 shows 2-dimensional t-SNE [24] visualizations of the individual input modalities and CVS on the test dataset. In addition to being very sparse, caption and image input spaces do not always form distinct regions for each category. Since the word vectors are pre-trained GLoVE representations, they tend to form distinct regions. The CVS framework performs well for certain objects. For example, the giraffe and stop sign categories are well separated and form their independent clusters. Despite good empirical results, the visual analysis of several concepts form mixed clusters indicating the presence of shared class information which could benefit from larger training sets.

5. CONCLUSION This research advances the understanding of vector representation for the purpose of input embedding and output reconstruction on multiple modalities. The proposed model demonstrates flexibility in performing cross-modal image and text generation. This work advances the area of caption conditioned image generation by allowing the common vector space to be shared between vision and language representations. Both inception scores and empirical analysis demonstrate the potential impact for CVS. A visual investigation into the latent representation indicates significant overlap in some categories which is an area of future research.

3766

6. REFERENCES [1] Ryan Kiros et al., “Skip-thought vectors,” in NIPS, 2015, pp. 3294–3302. [2] Quoc V Le and Tomas Mikolov, “Distributed representations of sentences and documents.,” in ICML, 2014, vol. 14, pp. 1188–1196. [3] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune, “Plug & play generative networks: Conditional iterative generation of images in latent space,” arXiv preprint arXiv:1612.00005, 2016. [4] Jeffrey Donahue et al., “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015, pp. 2625–2634. [5] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044, vol. 2, no. 3, pp. 5, 2015. [6] Subhashini Venugopalan et al., “Sequence to sequencevideo to text,” in ICCV, 2015, pp. 4534–4542. [7] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [8] Nitish Srivastava and Ruslan R Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in neural information processing systems, 2012, pp. 2222–2230. [9] Kihyuk Sohn, Wenling Shang, and Honglak Lee, “Improved multimodal deep learning with variation of information,” in Advances in Neural Information Processing Systems, 2014, pp. 2141–2149.

[13] Aviv Eisenschtat and Lior Wolf, “Linking image and text with 2-way nets,” in arXiv preprint arXiv:1608.07973, 2016. [14] Svetlana Wang, Yin Li, “Learning deep structurepreserving image-text embeddings,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [15] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–823. [16] Albert Gordo, Jon Almaz´an, Jerome Revaud, and Diane Larlus, “Deep image retrieval: Learning global representations for image search,” in European Conference on Computer Vision. Springer, 2016, pp. 241–257. [17] Alexander Hermans, Lucas Beyer, and Bastian Leibe, “In defense of the triplet loss for person reidentification,” arXiv preprint arXiv:1703.07737, 2017. [18] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng, “Zero-shot learning through cross-modal transfer,” in Advances in neural information processing systems, 2013, pp. 935–943. [19] Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston, “Starspace: Embed all the things!,” arXiv preprint arXiv:1709.03856, 2017. [20] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012. [21] Diederik Kingma and Jimmy Ba, method for stochastic optimization,” arXiv:1412.6980, 2014.

“Adam: A arXiv preprint

[10] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele, “Learning deep representations of fine-grained visual descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 49–58.

[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.

[11] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.

[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.

[12] Kihyuk Sohn, “”improved deep metric learning with multi-class n-pair loss objective,” in Advances in Neural Information Processing Systems, 2016.

[24] Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.

3767