Paper Title (use style: paper title)

0 downloads 0 Views 495KB Size Report
Keywords— Modality; learning; image processing; automatic ..... Learning to generate chairs with convolutional neural networks. In. CVPR, 2015. [9] Yang, Jimei ...
Image to Text Conversion: State of the Art and Extended Work Abstract—The aim of this article is to study the conversion of information between the different modalities (text, image) due to the evolution of human-machine communication that introduced the use of natural communication modalities to humans Such as gestures, speech, sound and vision. In fact, one of the main challenges of this "multimodal" learning is the learning of a shared representation between the distinct modalities and the prediction of the missing data (for example, by retrieval or synthesis) from a conditioned modality to another. Some researches work on the different types of conversions; Text to Speech, Speech to Picture or Text to Picture synthesis and viceversa but in this paper we will focus on: Text to picture and picture to text synthesis. Keywords— Modality; learning; image processing; automatic phrase generation; PTT Conversion.

I. INTRODUCTION Artificial intelligence is based essentially on the creation of systems that behave and think like human beings, allowing natural communication between human beings and machines. Thanks to the evolution of man-machine communication, the use of natural communication modalities to humans has been introduced. As a consequence, it becomes necessary to convert the information between the different modalities. Several researches have been made for different types of conversions. In contrast, few researches have contemplated and examined Image to Text conversion through image metadata or prediction and especially for Arabic language. So, the progression of human-machine communication was another reason for the emergence of new applications such as car navigation aid, design aid for three-dimensional objects and especially those of learning aid whether for the assistance of the handicapped, foreign languages learners the students. According to several studies, one of the causes of school failure is the student's inability to read and express himself orally and in writing and the percentage of school failure is increasing. Also, other statistical studies have shown that 86% of children learn quickly letters and sentences if they are accompanied with images. Learning aid is one of the most important fields of application of human machine communication as it presents a solution for learners who are considering difficulties in the production of phrases in Arabic language from the images which lead to failure at school. Similarly, according to international statistics, the Arabic language is ranked fourth in number of speakers. As a result, the number of learners of the Arabic language is increasing so it is necessary to develop a tool that helps learning by generating sentences from images through associated metadata or prediction. This paper is organized as follows. A description of the state of the art is presented in section 2. The details about our

contribution are described in section 3. An application of one practical example using our proposition took place in section 4. Finally, the conclusion is drown in section 5. II.

STATE OF ART

A. TTP Synthesis Nowadays, it is becoming quite interesting to automatically synthesize images from texts using artificial intelligence systems. Much of the work done on multimodal learning from texts and images exploited the retrieval approach used to extract relevant images from a textual query or vice versa. 1) Conditional production : Recently, there has been an exploitation of recurrent decoders of neural networks in order to generate textual descriptions conditioned on images [1] based on an old and deep study [2] about upper layer characteristics of a deep convolutional network using "MS COCO" [3]. By 2015, there was an improvement about that [4] which was the introduction of a mechanism of recurrent visual attention in order to have improved results. Recently, powerful new tools have been developed for the purpose of learning representations of discriminative text features that are recurrent neural networks. At the same time, the GANs (generative adversarial networks) began the generation of very convincing images having particular categories, like the interiors of the rooms, the faces... Scott Reed et al. [5] have introduced a new formulation (generative adversarial networks) in order to model text and image. It is a system which, starting from the detailed descriptions, allows the generation of images of flowers and birds. The contribution of this approach is that the model proposed by Scott Reed et al. implements conditions on text descriptions instead of class labels. As it is presented, according to researches, this is the first architecture differentiable from end to end of the character level at the pixel level. Thus, they adopted a multiple interpolation regulator at the GAN generator level which resulted in a remarkable improvement in the generalization of capability on specific categories. It is a simple and effective approach to image synthesis thanks to an efficient model for generating images based on detailed visual descriptions. Srivastava et al. [6] have implemented a deep Boltzmann machine and at the same time have done a modeling of images as well as text labels. Recently, several researchers have benefited from the ability of deep convolutional decoder networks to generate

realistic images, for example, they were used in 2014 [7] for the generator network module.

or through the transmission of the relevant individual sentences to have a new legend [21].

In 2015, certain researchers [8] constructed a deconvolutional network in order to generate conditioned 3D representations on a set of graphic codes that define luminosity, shape and position.

In the context of PTT synthesis, there are TTS systems (Text To Speech) that use an input image, convert it into text and then into speech. In order to help blind people read books, newspapers and articles, [22], the authors used images with texts as a basis in the TTS conversion. The image is processed by image processing techniques to identify each character. These characters are converted into voice sounds.

This approach has been enriched [9] by a network of coders as well as actions. They have formed a recursive convolutional decoder-encoder that rotates the 3D chair models and human faces on the rotation sequences of actions. Other work [10] used a laplacian pyramid of generator to synthesize images at multiple resolutions. The results of this work gave convincing high-resolution images and for the controllable generation was also beneficial since they allow so to do conditions on the class labels. Although they have operated a standard convolutional decoder [11], the researchers managed to produce to a very efficient and stable architecture that integrates batch standardization to achieve impressive synthesis results. 2) Non Conditional production : There are researches which have not been based on conditional production such as the approach used for generating, which generates answers to questions about the visual content of images [12] and which has been extended to an explicit knowledge base [13]. As for Zhu et al. [14] they applied sequential models at the same time to the text in two forms “books and films”. And in other work [15], the researchers chose to use a variational recursive self-coder to have image generation from text captions. B. PTT Synthesis For a relevant description of the images, there are work that have typically used two types of approaches. 1) Approach using metadata « existing text » This type of method exploits the existing text for the description. On the one hand, there are cases where we find the text with the image, which is the case in the work [16] where the authors did legends for the new images deduced from the text of the article joining the images by the use of synthesis techniques [17]. On the other hand, there are methods which are based on the recovery and their usefulness is to group relevant text for the composition. There are those [18] who have used GPS metadata for retrieving text documents relevant to an image. For Farhadi et al., Their researches [19] are based on the analysis of images in the sense of a representation which describes the triplet "object, action and scene". Then, they use this predicted triplet for the retrieval of descriptive sentences from a written collection to describe similar images. Recent research has also been based on recovery with the use of nonparametric methods for the composition of legends by transmitting all the legends that exist through an enormous database of legendary images [20],

And to recognize the characters of the image an algorithm called 'i' Novel Algorithm is used. The main module in character recognition is segment extraction. The programming is done in MATLAB using the character recognition algorithm and BeagleBone Black which is used to run the program that gives output as a text file that has the content of the characters in the image. 2) Approach for predicting the content of the image This type of approach consists of constructing descriptions from scratch instead of retrieving existing text. For example, Yang et al. [23] construct bottom-up descriptive texts in order to detect objects and scenes, and then use text statistics to "associate verbs to have relations between objects. Then, there will have been the integration of these descriptions in an HMM framework. In the research [24] of Yao et al, they treated the subject of text generation through a complete system based on different ontologies of hierarchical knowledge and also used a human being in the loop to analyze the hierarchical image. Unlike the work of Girish Kulkarni et al [25] and Li et al. [26] which consist in detecting several objects, modifiers and their spatial relations in order to generate sentences that correspond to the detected elements by automatically processing the images without human intervention. In other research [27], also to help blind people, the authors implemented a system of conversion of images into text and subsequently into speech. This system provided a better supply for image conversion whether captured or stored in text and speech. In the first part, TTP, of this conversion process, pre-processing techniques and image segmentation were used to have a text that will be translated into speech after. In the work of Benjamin Z. Yao et al. [28], they proposed an approach allowing the generation of texts to describe images and videos by the technique of image analysis. It is used to compute an analysis graph for the most likely interpretations of an input image. It includes structured tree decomposition content of images, a scene, or parts covering all the pixels of the image. Several researchers form a contentbased image retrieval (CBIR) domain can process digital images and extract the feature vector based on low-level image properties (color , Shape, texture) [29] [30] [31] [32] [33]. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, [34] proposed an image model to perform measurements of electricity counter by making captures and sending these images in the form of Multimedia Message

Service (MMS) to the server which will perform a series of steps to have the text output: -

-

The first part, content planning, to facilitate the production of detection and recognition algorithms based on computer vision with statistics extracted from large corpus of visual descriptive texts, to determine the best content words to be used to describe an image.

-

The second step will be the realization of the surface, to choose words to construct natural language phrases according to the predicted content and general statistics of natural language. For the generation of sentences, they used estimates of the spatial relations between the objects.

Reading and converting the received image into a threedimensional array of pixels

-

Conversion of color image to black and white

-

Suppression of nuances caused by non-uniform light

-

Inversion of black pixels into white and black into white

-

Elimination of pixels that are neither black nor white

-

Suppression of small components

-

Conversion to text.

For the extraction of the functionalities, another approach is presented in the analysis of the edges of the image. In [35], the authors applied an edge direction method for the construction of an edge direction histogram, found the image edges and then quantified the latter. But performance was too limited. As for the work [36], the author has modified this method to improve it by considering the correlations between the edges using a weighted function. Girish Kulkarni et al, [27] present a system for automatically generating natural language descriptions from images. This system consists of two parts. TABLE I. Approach Benjamin Z. Yao et al. (2010) [24] S. Li et al. (2011) [25] G. Kulkarni et al. (2013) [27] Y. Shindel et M. Patil. (2016) [26]

V.Ordonez et al. (2011) [20] P. Kuznetsova et al. (2012) [21] M. Arun et al. (2014) [22]

C. Critical study The critical study will concern the two approaches which has been discussed previously.

REVIEWS OF SOME APPROACHES USING EXISTING IMAGE METADATA

Limits It is a complete system but it is based on lexicalized rules of grammar specific to the domain, and also requires a very specific and complex representation scheme of the meaning from the processing of the image. They used an approach based on techniques much simpler than those used by previous approaches. The sentences generated are close to being robotic as being human. There are bad results that are usually related to false detections or missing detections that may be due to the choices of the object detection technique. The authors perform a pre-processing of the image which can have a remarkable influence on the results since they do not use the original color image but convert it into a gray-level image where important information, Provided by the colors, can be lost. TABLE II.

Approach A.Farhadi et al. (2010) [19]

For this fact a set of work were done on the image analysis as well as object detection and they learn the spatial relationships between the labeled parts, either regions or detections. These relationships have been exploited as contextual models for improving the accuracy of labeling [37], [38], [39], [40]. As well as Girish Kulkarni et al. [27] combined visual descriptive language bases and generated object estimates via predicted modifiers by the attribute classifiers around these detections of objects and detectors, based on various work [41], [42], [43], [44], [45], [46] to estimate the modifiers using the low-level characteristics of Farhadi et al. [42]

REVIEWS OF SOME APPROACHES USING IMAGE CONTENT PREDICTION

Limits The use of a simplified sentence model instead of an iterative procedure, which allows to go further in the sentences and the images, thanks to the use of the semantics of the distribution which is used to quantify and classify the semantic similarities between the linguistic elements as a function of their distribution properties in a large database. They used nonparametric methods where the number of estimated parameters that describe the data increases with the number of samples available, contrary to the conventional methods where this number is decided in advance. The authors chose the use of ILP instead of HMM whereas it gave results very similar to those obtained by HMM. Insufficient extraction of characteristics that affects the accuracy of the results obtained.

III. CONTRIBUTION Aid for learning is one of the most important fields of application of human-machine communication since it presents a solution, especially in the context of primary education, for pupils who are considering difficulties in production Phrases in Arabic language from the images which induced to failure at school. A. General form of the proposed system In our proposition we will build on the work of Yogesh N. Shinde et al. [26] which consists of a PTS (Picture To Speech)

conversion system via an intermediate conversion which is the PTT conversion where the input image passes through a succession of steps; a pre-processing followed by an extraction of characteristics of the image, subsequently the detection of the edges and finally an object recognition phase. After recognizing the objects, their characteristics are extracted as texture and color. In the following, the object's keywords are retrieved from the database and using the appropriate preposition and predicted words, the appropriate sentence is constructed and a description of the image will be generated by the system.

The figure below shows the general form of our

contribution:

Fig. 1. General form of propsed methodology

To carry out this conversion a succession of steps is followed: 1) Pre-processing Pre-processing is an important step in the image processing domain. It is used to improve the quality and content of an image to extract information. 2) Edge detection It reduces in a significate way the amount of data and eliminates information that may be considered less relevant, while preserving the important structural properties of the image.

describing its high-level content information. Then, according to the similarity of these vectors, a comparison between these two images is made. Two types of methods can be used to extract features: - extractAll: which extracts features for all images in the database. - extractSingleImage: which extracts features for a single image. 5) Phrase generation: Based on the extracted characteristics and the semantic database, the sentence in Arabic will be generated.

3) Segmentation

6) Post-verification:

It allows the extraction of structural information from an image that can’t be seen with the naked eye. So, the aim is the cutting of the image into various regions, in which the pixels must verify a certain criterion of homogeneity. 4) Feature Extraction: In the fields of machine learning, image processing and pattern recognition, feature extraction plays an important role. The extracted feature must contain relevant data from the input data. A number of features for each image are extracted, IV.

EXPERIMENT RESULTS

The figures below represent examples of images, on which will be carried out the process of generating the sentence in Arabic language.

We can throw the phrase produced automatically as: -

Request search (google),

-

Sentence to translate (google translator, Bing translator),

-

Translator of texts (Ms Word, etc) And each time the proposed corrections are introduced.

Fig. 2. First example.

Fig. 3. Second example.

V.

CONCLUSIONS

It is therefore in the context of digital accessibility and automatic language processing, our work intervenes, to ensure that people who suffer from troubles in the production of sentences from the images or learners of the Arabic language who can’t produce a textual representation and also in the context of artificial intelligence to enable a machine to produce phrases in Arabic language from the metadata extracted from the images.

According to the study of the ancient work there are two main families of approaches that ensure the conversion Image to Text, the use of the metadata or also the text existing in the image or the techniques of description of the images to have text in exit. These works did not treat this type of conversion for the Arabic language. For this reason, our future work will focus on the conversion Image to Text in Arabic language.

REFERENCES [1] [2] [3]

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12] [13]

[14]

[15]

[16]

[17] [18]

[19]

[20]

[21]

[22]

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In CVPR, 2015. Hochreiter, Sepp and Schmidhuber, Jurgen. Long short-term ¨ memory. Neural computation, 9(8):1735–1780, 1997. Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollar, Piotr, and Zitnick, ´ C Lawrence. Microsoft coco: Common objects in context. In ECCV. 2014. Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Courville, Aaron, Salakhutdinov, Ruslan, Zemel, Richard, and Bengio, Yoshua. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015 Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran REEDSCOT1 , AKATA2 , XCYAN1 , LLAJAN1 Honglak Lee, Bernt Schiele. Generative Adversarial Text to Image Synthesis. ( ¨ MPIINF.MPG.DE, 2016. Srivastava, Nitish and Salakhutdinov, Ruslan R. Multimodal learning with deep boltzmann machines. In NIPS, 2012. Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, 2014. Dosovitskiy, Alexey, Tobias Springenberg, Jost, and Brox, Thomas. Learning to generate chairs with convolutional neural networks. In CVPR, 2015. Yang, Jimei, Reed, Scott, Yang, Ming-Hsuan, and Lee, Honglak. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In NIPS, 2015. Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015. Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. Ren, Mengye, Kiros, Ryan, and Zemel, Richard. Exploring models and data for image question answering. In NIPS, 2015. Wang, Peng, Wu, Qi, Shen, Chunhua, Hengel, Anton van den, and Dick, Anthony. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015. Zhu, Yukun, Kiros, Ryan, Zemel, Rich, Salakhutdinov, Ruslan, Urtasun, Raquel, Torralba, Antonio, and Fidler, Sanja. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015. Mansimov, Elman, Parisotto, Emilio, Ba, Jimmy Lei, and Salakhutdinov, Ruslan. Generating images from captions with attention. ICLR, 2016. Y. Feng and M. Lapata, “How Many Words Is a Picture Worth? Automatic Caption Generation for News Images,” Proc. Assoc. for Computational Linguistics, pp. 1239-1249, 2010. L. Zhou and E. Hovy, “Template-Filtered Headline Summarization,” Proc. ACL Workshop Text Summarization Branches Out, July 2004. A. Aker and R. Gaizauskas, “Generating Image Descriptions Using Dependency Relational Patterns,” Proc. 28th Ann. Meeting Assoc. for Computational Linguistics, pp. 1250-1258, 2010. A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D.A. Forsyth, “Every Picture Tells a Story: Generating Sentences for Images,” Proc. European Conf. Computer Vision, 2010. V. Ordonez, G. Kulkarni, and T.L. Berg, “Im2text: Describing Images Using 1 Million Captioned Photographs,” Proc. Neural Information Processing Systems), 2011. P. Kuznetsova, V. Ordonez, A.C. Berg, T.L. Berg, and Y. Choi, “Collective Generation of Natural Image Descriptions,” Proc. Conf. Assoc. for Computational Linguistics, 2012. M. Arun ,S.S. Salva diswar, J.Sibidharan, Design and Implementation of Text To Speech Conversion for Visually Impaired Using ‘i’ Novel Algorithm, Journal on Today’s Ideas –Tomorrow’s Technologies,Vol.2, No.1,June 2014 pp

[23] Y. Yang, C.L. Teo, H. Daume´ III, and Y. Aloimonos, “CorpusGuided Sentence Generation of Natural Images,” Proc. Conf. Empirical Methods in Natural Language Processing, 2011. [24] B. Yao, X. Yang, L. Lin, M.W. Lee, and S.-C. Zhu, “I2t: Image Parsing to Text Description,” Proc. IEEE, vol. 98, no. 8, Aug. 2010. [25] S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, and Y. Choi, “Composing Simple Image Descriptions Using Web-Scale n-Grams,” Proc. 15th Conf. Computational Natural Language Learning, pp. 220-228, June 2011. [26] Yogesh N. Shinde1 , Mrunmayee Patil: International Journal for Research in Applied Science & Engineering Technology (IJRASET), “Translating Images into Text Descriptions and Speech Synthesis for Learning Purpose”, Volume 4 Issue VI, June 2016. [27] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg and Tamara L. Berg, ―Baby Talk: Understanding and Generating Simple Descriptions,‖ IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 12, DECEMBER 2013. [28] Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee and SongChun Zhu, ―I2T: Image Parsing to Text Description‖ ,IEEE transactions on image processing, 2008. [29] Iasonas Kokkinos, Member, IEEE, and Petros Maragos, Fellow, IEEE ―Synergy between Object Recognition and Image Segmentation Using the Expectation-Maximization Algorithm‖, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 8, AUGUST 2009. [30] Fan-Chieh Cheng, Shih-Chia Huang, and Shanq-Jang Ruan, Member, IEEE ―Illumination-Sensitive Background Modeling Approach for Accurate Moving Object Detection‖, IEEE TRANSACTIONS ON BROADCASTING, VOL. 57, NO. 4, DECEMBER 2011. [31] DHIRAJ JOSHI, JAMES Z. WANG and JIA LI, The Pennsylvania State University, ―The Story Picturing Engine—A System for Automatic Text Illustration‖, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 1, February 2006. [32] Munawar Hayat, Mohammed Bennamoun and Senjian An ―Deep Reconstruction Models for Image Set Classification‖, IEEE Transactions on Pattern Analysis and Machine Intelligence. [33] A. Mian, M. Bennamoun, and R. Owens, ―An efficient multimodal 2d3d hybrid approach to automatic face recognition,‖ Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, no. 11, pp. 1927– 1943, 2007. [34] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, ―Content-based image retrieval at the end of the early years,‖ IEEE Trans. PAMI, vol. 22, no. 12, 2000. [35] S. Feng, D. Xu, X. Yang, Attention-driven salient edge(s) and region(s) extraction with application to CBIR, Signal Processing 90, pp. 1–15, 2010. [36] A. Vailaya, A. Jain, H.J Zhang, On Image Classification: City Images vs. Landscape, Proceeding of the IEEE workshop on Content-Based Access of Image and Video Libraries, pp. 3-8, 1998. [37] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Multi-Class Object Layout,” Proc. 12th IEEE Int’l Conf. Computer Vision, 2009. [38] A. Gupta and L.S. Davis, “Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers,” Proc. European Conf. Computer Vision, 2008. [39] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,” Int’l J. Computer Vision, vol. 81, pp. 2-23, Jan. 2009. [40] A. Torralba, K.P. Murphy, and W.T. Freeman, “Using the Forest to See the Trees: Exploiting Context for Visual Object Detection and Localization,” Comm. ACM, vol. 53, pp [41] T.L. Berg, A.C. Berg, and J. Shih, “Automatic Attribute Discovery and Characterization from Noisy Web Data,” Proc. European Conf. Computer Vision, 2010.

[42] A. Farhadi, I. Endres, D. Hoiem, and D.A. Forsyth, “Describing Objects by Their Attributes,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. [43] V. Ferrari and A. Zisserman, “Learning Visual Attributes,” Proc. Neural Information Processing Systems Conf., 2007. [44] N. Kumar, A.C. Berg, P.N. Belhumeur, and S.K. Nayar, “Attribute and Simile Classifiers for Face Verification,” Proc. 12th IEEE Int’l Conf. Computer Vision, 2009.

[45] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. [46] J. Wang, K. Markert, and M. Everingham, “Learning Models for Object Recognition from Natural Language Descriptions,” Proc. British Machine Vision Conf., 2009.