Deep Learning Based Technique for Plagiarism Detection in Arabic

0 downloads 0 Views 319KB Size Report
Abstract –Plagiarism detection is very important especially for academician, researchers and students. Although, there are many plagiarism detection tools, it is ...
Deep Learning Based Technique for Plagiarism Detection in Arabic Texts Dima Suleiman Computer Science Department Princess Sumaya University for Technology Teacher at the University of Jordan Amman, Jordan [email protected]

Arafat Awajan Computer Science Department Princess Sumaya University for Technology [email protected]

Abstract –Plagiarism detection is very important especially for academician, researchers and students. Although, there are many plagiarism detection tools, it is still challenging task because of huge amount of online documents. In this research, we propose to use word2vec model to detect the semantic similarity between words in Arabic language which can help in detecting plagiarism. Word2vec is a deep learning technique that is used to represent words as features of vectors with high precision. The quality of vectors representation depends on the quality of corpus used in training phase. In this paper, we used OSAC corpus for training word2vec model. Moreover cosine similarity measure is used to compute the similarity between words’ vectors. The similarity measures show how simple changes in text such as changing one word, or changing the position of verbs and nouns results with similarity value equal to 99% which provide the possibility to detect plagiarism even if the test is altered by replacing words by their synonyms or changing the words order. Keywords—similarity detection; Arabic Language Processing; Word2Vec; Deep Learning; plagiarism.

I. INTRODUCTION Contextual word representation is very important for many Natural Language Processing (NLP) applications such as text classification, automatic summarization, information retrieval, query suggestions and plagiarism [19]. Its importance related to the fact that it facilitates the process of finding relationships between two terms and computing their similarities. In order to compute the contextual representation of the words, word2vec method was used [1]. Word2vec model is a deep learning technique [25] that is used to compute the vector representation of words using neural network with one linear hidden layer on large dataset. In addition, word2vec train the model based on sliding window, the neighbor’s words within the window are taken into consideration to compute the probability of words occurrence, and moreover the window keeps sliding over the whole corpus recursively.

Nailah Al-Madi Computer Science Department Princess Sumaya University for Technology [email protected]

This technique basically trains a model based on a neighborhood window of words in a corpus and then projects the result onto [an arbitrary number of] n dimensions where each word is a vector in the n dimensional space. Then the words can be compared using the cosine similarity of their vectors There are two models of word2vec: Continuous Bag-ofWords model (CBOW), which predicts the current word depending on neighboring words and continuous Skip-gram model (Skip-gram) where the current word is the input and the model predicts the surrounding words. In both models, sliding window is used. Word2vec was used in many applications of English language, such a bilingual machine translation between English and Spanish [8], part of speech tagging (POST) for eight languages where one of them was Arabic Language [10]. It was used in keywords extraction where the extraction of keywords is realized without any prior knowledge about the document domain. Furthermore, word2vec was used in Arabic language processing. Although the detection and recognition of Arabic Named Entity Recognition is challenging task, [11] presented NER system with significant Results. Another application of using word2vec is in building morphological analyzer for Arabic, English and other languages since this process depends on vector space[12][14]. Plagiarism detection is one of the active research topics in Natural Language Processing (NLP). It aims to detect the reusing, reproducing and/or changing the text from one form to another one [23]. Although, there are many tools used for plagiarism detection, it is still a challenging topic because of huge amount of electronic documents available now especially in academic field. In addition, the performance of plagiarism tools for Arabic text still very week or inexistent. Existing methods depend mainly on using paraphrase detection and matching of keywords on small dataset. In this work, the word2vec model is used in order to find the features of vectors of words, and to use these features to compute the contextual similarity between words. To increase the performance of this model, large corpus will be used for training the model.

The rest of this paper is organized as follows: section II talks about related work. Background of some concepts such as vector representation and word2vec are covered in section III. The most important characteristics of Arabic language is presented in section IV. Section V provides explanation for corpus, preprocessing and experimental results and finally section VI concludes the paper.

II. RELATED WORKS Many natural language processing (NLP) techniques are based on counting over words such as PCFGs [20]. This model suffer from two problems, the first one is that it is difficult to make generalization if the words that exits in testing are not exists in training. The second problem is the over fitting of data that results according to curse of dimensionality. One approach to solve the problem of curse of dimensionality is N-gram. Curse dimensionality problem occurs because of the word sequence that is used in training process, is most likely different from the one to be used in testing. The reason for focusing on such a problem is that statistical language modeling main goal is to use sequence of words and learn the probability function for them [6]. Every word has a distributed representation that can be learned. After learning, the neighboring sentences that are semantically related will be informed using training sentences. Using distributed representation, each word is represented as a feature vector of real numbers. High quality contiguous vector representations were computed from large data sets [1], and the quality was measured using word similarity. The computational cost was reduced and accuracy was improved where the model compared with already existing neural network models. Two models for word embedding were proposed: Continuous Bagof-Words Model (CBOW) and Continuous Skip-gram Model (Skip-gram). CBOW is similar to feedforward neural network which predicts the current word from words in the context. However, Skip-gram also uses feed forward neural network but instead of predicting the current word from the context, it uses current word as input to predict other words in the same sentence within certain range around the input word. Several extensions for CBOW and Skip-gram models were proposed in [7] in order to increase the training speed and to improve vectors quality. The training speed was increased due to frequent words subsampling also the accuracy of infrequent words representation was improved. There are two limitations of CBOW and Skip-gram models which were solved in [7], the first one is that the word order is with no concern and the second limitation is that the idiomatic phrases such as “Air Canada” cannot be represented. The solution is to deal with phrase model instead of just word model. Bag of Word concept (BoW) is one of the fixed length features that have limitations such as ignoring the order of words and not taking the semantic of words into consideration. Many of machine learning algorithms takes fixed length

features as input such as BoW. Because of BoW limitations, in their research [9] authors proposed to predict the documents words by representing each document as dense vector instead of sparse one, an example of one-hot representation is sparse. They used variable length texts, sentences, paragraphs and documents in order to learn the representation of fixed length features by using unsupervised algorithm or Paragraph Vector. In addition to proposing new algorithm, also significant results were achieved in using Paragraph Vector in sentiment analysis and text classification. Representing each word as features vector and ignoring the words morphology produces a problem in many natural languages processing, especially for morphological rich languages with many rare words and large number of vocabulary. This problem can be solved by representing each character instead of word as vector [14]. Each n-gram characters can be represented by vector, and words can be represented using the sum of vectors of characters of each word. Instead of depending on BoW and preprocessing of text, in their research [17] authors used word and document embedding for Arabic text classification. The document vector can be obtained either by using doc2vec model or by averaging all words vectors that represent the document. Experiments showed that the document embedding is better than traditional one. Distributed word representation can be used in many applications such as machine translation [8], keywords extraction [13], part of speech tagging (POST) [10], Name Entity Recognition [11], sentiment analysis [15], twitter spam detection [18] and many others. Bilingual machine translation between English and Spanish was made with 90% precision. Since machine translation based mainly in phrase tables and dictionaries, distributed word representation can be used to automate the process of generation such tables and dictionaries. Even for missing values in dictionary, the model can infer it using two languages vector spaces and making linear projection learning between the vectors. Although experiments were made using two languages that are English and Spanish this model can be used for any language pairs. CBOW and Skip-gram were used for representing words. Another application is unsupervised part of speech tagging (POST) [10]. One of the features that seems valuable for problems that are supervised learning is unsupervised word embedding. Eight languages were used including Arabic and English. Most of keyword extractions methods from research papers require domain knowledge. These methods may use supervised machine learning techniques or linguistic rules that are related to certain domain or combination of two. Domain knowledge keywords extraction requires human effort which is a problem. In order to overcome such a problem, in their research [13] proposed new keyword extraction using word embedding with no domain knowledge is needed. Also word2vec was used in Arabic language applications such as Named Entity Recognition (NER). NER means making categories such as: Location, Organization and Person

and then classifying atomic elements into one of the categories [16]. NER in most cases depends on gazetteers which suffer from problem of low coverage especially when used for context processing of social media. The problem of social media is that it uses combination of Modern Standard Arabic (MSA) and Dialectal Arabic (DA) where such a combination may result in poor performance [11]. Large number of morphological rules can be discovered using unsupervised method, which in turn will be used to build morphological analyzer [12]. The method depends on vector space embedding and includes words morphological transformation. Experiments were made using six languages including Arabic and English and nine datasets. According to the growing of social media users and the social media networks such blogs, forums, and others, sentiments analysis becomes very important especially for Arabic users. Arabic reviews were used in order to determine the polarity of sentiments, in their research [15], the authors used Arabic word embedding as input to sentiment classifiers. The corpus that was used for training consists of 10 billion words which were built using web-crawler. The experimental results showed that there is a significant increase in performance compared with already existing methods. Many techniques were developed in order to solve one of the most important problems in social media which is twitter spam detection [18]. The difficulty of detecting twitter spams is related to the fact that spamming activities varies through the real life and there are many scenarios. In addition, using URL to manually detect suspicious is a time consuming process. Researchers in [18] used deep learning to solve the problem where word vector was used to learn the tweet syntax. The results showed that the accuracy was increased. In search engine and search query, semantic text similarity is very important. Using semantic features can be used to determine text similarity [19]. Similarity here refers to representing the meaning instead of depending on similarity in syntactic and lexical representation. Word embedding can be used to represent the vectors that represent the word in semantic space. To our knowledge, for Arabic language there is no such research for studying the similarity of words, thus the semantic similarity of the words will be studied. In addition, using semantic similarity is promising in plagiarism detection [23]. In addition, deep learning was used in Persian plagiarism detection, where the words are represented using word2vec [26]. The words are represented as multi-dimensional vectors, and then in order to represent a sentence, the vectors of the words forming the sentence will be combined. However, in order to determine the existence of plagiarism, each pair of sentences will be compared, therefore if the similarity between the sentences is high, then this will be an indication for the existence of plagiarism. On the other hand, deep learning and word embedding were not used in Arabic plagiarism detection. In [27] Arabic language plagiarism detection tool were proposed, this tool identify the similarity using heuristics at different logical levels, also comparisons between their proposed method and

already existing algorithms such as Turnitin were made using precision and recall. Moreover, several existing Arabic language plagiarism detection methods were studied in [28]. This paper differs than the related work discussed above, since it uses word2vec for the first time for Arabic language plagiarism detection. However, the existing plagiarism detection methods used different techniques and have many drawbacks. III.

Background

In order to understand the main technique used in this paper, few concepts about vector representation and word2vec must be explained. Instead of dealing with words, word2vec model uses the distributional representation of words, in addition to using neural network for learning the vector representation. More details will be explained in the following subsections.

A. Word Vector Representations The vast majority of NLP rule-based and statistical algorithms consider words as atomic symbols. However, converting a word into vector is one of the most important ideas to deal with NLP problems. In vector representation the dimensions of the vector equal to vocabulary size, thus, if we have 10 vocabularies then we must have 10 vectors and the size of each one is 10. The representation of each vector consists of “1” as a value in one index and the rest of indexes are zeros. Every word in vocabulary has an index in the vector, for example the third word representation will have 1 at index three and zeros otherwise. This representation is called “onehot” or “one-on”. There are two problems of one-hot representation, the first one is the sparse spaces and the second one is the difficulty of computing similarity between two words from their vectors. Similarity between vectors is very important for NLP, for example if the vector for word “dog” is [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] and the vector for vector “cat” is [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] then if we conjunct the two vectors the value will be zero. This mean that if we find word “cat” in surrounding then and during testing if word “dog” appears in the same location of word “cat” it cannot use the same information.

B. Word2vec Semantic and syntax similarity between words are very important for many of NLP systems. Models deal with atomic units, have characteristics of simplicity and robustness and can be trained in huge amount of data such as N-gram models [2]. Simplicity is not always good; sometimes there is a need for more advance techniques that can improve the performance by training complex models using larger dataset.

Word vector representation can be computed by applying neural network on very large dataset. Mikolov and his colleagues [1] proposed what is called word2vec which consists of two models: Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model (Skip-gram) both uses neural network for representing the words as vectors. The quality of proposed approaches was measured using similarity tasks and was compared with previous techniques. The results showed that improvements were achieved in term of computational cost and accuracy. Representing words as vectors simplifies performing algebraic operations such as addition and subtraction of two vectors, for example the result of vector (“Queen”) – vector (“Woman”) + vector (“king”) will be close to the vector representation for “Man” [3]. Performing Algebraic operation can help in finding semantic relationship between words, for example if we subtract the vector for word “‫ ”امراة‬from the vector of word “‫ ”رجل‬or vice versa the difference will be small since the two words have similar meaning but differ in gender. Word2Vec used neural network model with two steps for training: the first step is to use simple model to learn the continuous words, and the second step is to train N-gram Feedforward Neural Net Language Model (NNLM) [4] on the top of words distributed representation. The main idea of word2vec is to train word vectors that have high dimensionality using huge amount of dataset. This results in representing the semantic relationship between words. This semantic relationship can be used in many NLP applications such as information retrieval, questions and answering, text classification, sentiment analysis, part of speech tagging, machine translation, and keyword extraction and many other applications. The following subsections will provide brief explanations for both approaches.

words, which means that the value of c is 4. Therefore, if we want to predict certain word then 4 words before it and 4 words after it must be taken into consideration. After predicting the word, the window should slide in order to predict the next word and so on.

2) Continuous Skip-gram Model Skip-gram model is similar to CBOW but instead of using the context to predict the current word, Skip-gram uses the current word to predict the context. In CBOW the input for loglinear classifier is the context words, for example ten words, five future words and five history words and the output is the current word. On the other hand in case of Skip-gram the input is the current word and the output is the context words, for example, five history words and five future words. As the number of words considered in the context increased, the quality is improved, while the complexity of computations is increased too. The distant words are given less weight since they are less related to the current word. Figure 1 displays Skip-gram architecture. Given the context { wt-c , … ,wt-2 , wt-1 , wt+1 , wt+2 , … , wt+c }, the model works to maximization equation (2) [5] in order to predict the target word wt.. The same idea of sliding window used in CBOW can be used in Skip-gram. | |

1 | |

log (

|

) … … … … . (2)

,

1) Continuous Bag-of-Words Model Continuous Bag-of-Words Model (CBOW) and feedforward NNLM are similar to each other, both of them uses Bag-of-word model, and in both models, all words share the projection layer. Moreover, the non-liner hidden layer is removed in order to decrease the computation time. CBOW uses the words from the future and from the history and the used classifier is the log-linear classifier which is used to classify the middle (current) word. Furthermore, CBOW used the context continuous distributed representation. Figure 1 displays CBOW architecture. For a given word wt assume that its context is { wt-c , … ,wt-2 , wt-1 , wt+1 , wt+2 , … , wt+c }, then CBOW model works to maximize equation (1) [5]. 1 |V|

| |

log p wt

wt − c , … , wt − 2 , wt − 1 , wt + 1 , wt + 2 , … , wt + c

… (1)

Where |v| is the number of vocabularies or words in corpus and c is the context size. The context size depends on sliding window size, if sliding window size is 9 then it consists of 9

Figure 1 - CBOW and Skip-gram models architecture [1]

In both models the input is the vector of the word not the word itself, represented using one-hot representation. For example, if we have four words “Man”, ”Woman”, ”Queen” and “King” then the representation for words can be as in Figure 2. However, it is clear from vector representation that we cannot find relationship between the words. Thus, instead of using on-hot representation, word2vec uses distributed representation. Moreover, in distributed representation each word is represented using weights, the word representation spread over all the vector entries instead on only one entry

where every entry contribute to many words definition [22]. Figure 3 shows the vectors of “Man” and “Woman” words after using word2vec. In addition, the weight between Man and Woman is high, however, the weight between “Man” and “Table” is low. It can be clearly seen that, the weights are very sensitive to corpus quality in addition to semantic and syntactic relationship between words.

stem and is called enclitics. On the other hand, affixes are divided into three types: prefixes, suffixes and circumfixes. Prefix appears between the stem and proclitic, while suffixes appear between the enclitics and stem and circumfixes combines both. Words such as the one we use from foreign languages do not have template for example “‫”تلفاز‬, such words can be learned using word2vec. Moreover, Arabic language consists of stop words with little number but high frequency such as pronouns, conjunctions and prepositions, in most of NLP preprocessing stop words are removed. V. Experiments and Results

Figure 2 - One-hot representation of six words: Man, Woman, Queen, King, Table and Chair

Figure 3 - Distributed representation of two words: Man, Woman

IV.

ARABIC LANGUAGE

In this research, the experiments will be made on Arabic documents. One of the important characteristics of Arabic language is that it is very rich of morphemes, thus morphology analysis is very important for NLP. Morphemes may be one of two types: either concatenative or templatic. The concatenative morpheme consists of stem and stem may be surrounded by clitics, affixes or both. There are two type of clitics, the first one that appears at the beginning of the stem and called proclitics, and the second one that may appear at the end of the

In our research, we propose the use word2vec model in order to compute vector of features for every word. The features vectors of words can be used to compute the similarity between words which in turn can help in detecting plagiarism. In our experiments we used publicly available OSAC Arabic corpus which consists of 22,429 text documents [24]. The text documents in OSAC were divided into ten categories: Health, Astronomy, Entertainments, Low, Education & Family, History, Sports, Stories, Religious and Fatwas Economics, and Cooking Recipes. The corpus was collected from different websites, and the number of words in corpus is 18,183,511 (18M) words, moreover all files in corpus were converted to utf-8 encoding. For our experiments, the whole corpus was used for training the model, on the other hand for testing the model and in order to detect plagiarism, we choose documents from the corpus itself, however the documents that we used for testing was processed and the preprocessing that was made is stop words removal. Sample of the text after removing stop words can be shown in Figure 4. In order to build our word2vec model we used Gensim API in python. Word2vec has many parameters that can affect the results such as word2vec model (CBOW or Skip-gram), sliding window-size, dimensions for each vector and epochs value. Experimentally we choose the values of these parameters such that we used CBOW model and window size of value equal to 10, number of dimensions is 100 and finally we used 10 epochs. On the other hand the similarity between vectors were computed by using cosine similarity, as in equation (3) Cosine Similarity(word1, word2) 1. 2 … … … … . (3) = | 1| | 2|

‫كشف دراسة طبية بريطانية األسبرين يعتبر خطورة‬ ‫يعتقد سابقا ً األخص بالنسبة لكبار السن توصلت‬ ‫الدراسة عقار األسبرين يساعد الحد اإلصابة‬ ‫السكتات القلبية الجلطات الدماغية وراء حاالت‬ ‫النزيف المرضى األخص معدة‬ Figure 4- Original Text

Many changes were made, such as replacing verb with another one where they either having the same meaning or having different meaning, for example we replaced the verb “‫ ”يعتبر‬with the verb “‫”يشكل‬, and since the two wards have the same meaning the similarity between the two sentences is still very high with 99%. However when replacing the verb “‫”يعتبر‬ with the verb “‫ ”ياكل‬the similarity become less since the two words have different meanings. In addition to that we replaced the noun “‫ ”المعدة‬with the noun “‫ ”االمعاء‬and the results showed that the similarity still high because the two words have similar meaning. Also we tried to change the order of the words such as using “verb, subject, object” instead of “subject, verb, object” and vice versa. For example “‫”كشف دراسة طبية‬, “ ‫دراسة‬ ‫ ”طبية كشف‬here we end up with 100% similarity since the meaning remained the same, results of different changes can be seen in Table 1. Other changes were made such as using plural instead of singular or singular instead of plural for example we used word “‫ ”جلطة‬instead of word “‫”جلطات‬. As the number of changes increased, the similarity between sentences will decrease

precision increased. Moreover, our model can be used to detect similarity between texts from several domains such as, sport, education, health care and others. More training and experiments must be made in order to get more results, also we will use more evaluation models such as cross validation and other metrics to evaluate our model. TABLE I. Cosine Similarity for different changes Cosine Similarity 99.99% 97.82% %100 98.47 % 99.93 %

Change Type Verb Verb Order of Words Noun Part of Statement

87.58 %

Part of Statement

42.33 %

Whole Statement

12.41%

Whole Statement VI.

For example, we made many changes in the original sentence in Figure 4. As a result of such changes the similarities between sentences decreased but at the same time the semantic remains the same that were no changes occurred in the meaning. For example when original sentence changed to “ ‫يشكل األسبرين خطورة بناء على دراسة قامت بھا لجنة طبية بريطانية يعتقد‬ ‫سابقا ً األخص بالنسبة لكبار السن توصلت الدراسة عقار األسبرين يساعد الحد‬ ‫اإلصابة السكتات القلبية الجلطات الدماغية وراء حاالت النزيف المرضى األخص‬ ‫ ”معدة‬and “ ‫عكس معروف األسبرين يساعد الحد االصابة السكتات القلبية تبين‬ ‫ ”سبب خطورة حسب دراسة بريطانية‬the similarity became 87.58 % and 42.33 % respectively, for sure the second sentence is better since it has a good rephrasing, less similarity and less plagiarism. Finally we tried to compare the original sentence with completely different one “‫ ”الجو جميل مناسب رحالت‬and we got 12% similarity. The main differences between our model and already existing ones are that we look for semantic similarity not only for syntax similarity. In addition, our model was learned in large dataset thus, the results became very accurate and

New Text

Original Text

‫يشكل‬ ‫يأكل‬ ‫دراسة طبية كشف‬

‫يعتبر‬ ‫يعتبر‬ ‫كشف دراسة‬ ‫طبية‬ ‫معدة‬ ‫كشف دراسة‬ ‫طبية بريطانية أن‬ ‫األسبرين يعتبر‬ ‫خطورة‬ ‫كشف دراسة‬ ‫طبية بريطانية أن‬ ‫األسبرين يعتبر‬ ‫خطورة‬ ‫الجملة االصلية‬

‫االمعاء‬ ‫يعتبر األسبرين‬ ‫خطورة حسب كشف‬ ‫دراسة طبية بريطانية‬ ‫يشكل األسبرين‬ ‫خطورة بناء على‬ ‫دراسة قامت بھا لجنة‬ ‫طبية بريطانية‬ ‫عكس معروف‬ ‫األسبرين يساعد الحد‬ ‫االصابة السكتات‬ ‫القلبية تبين سبب‬ ‫خطورة حسب دراسة‬ ‫بريطانية‬ ‫الجو جميل مناسب‬ ‫رحالت‬

‫الجملة االصلية‬

Conclusion

Plagiarism detection is one of serious tasks that represent a challenge for researchers, in this research we proposed to use word2vec model. However, Word2vec is a deep learning technique that uses large corpus for training, the output of this model is words that are represented as n dimensional vectors. Moreover, the cosine similarity between the vectors was used to detect plagiarism. In this case the similarity between vectors is contextual similarity since it depends on probability of occurrence of words within certain context. In addition to the fact that the quality of corpus determines the precision of vector representation which in turn affect the precision of plagiarism, in our experiments we used OSAC corpus. Therefore, our proposed technique is able to detect similarity between text if the changes are limited to single words replacement or order of verbs and nouns changed relatively. Accordingly, the experiment is able to detect plagiarism precision with 99% in this case.

REFERENCES [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR. [2] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language and Computational Language Learning, 2007. [3] T. Mikolov, W.T. Yih, G. Zweig. Linguistic Regularities in Continuous Space Word Representations. NAACL HLT 2013 [4] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003. [5] El Mahdaouy A., Gaussier E., El Alaoui S., 2016, Arabic Text Classification Based on Word and Document Embeddings, International Conference on Advanced Intelligent Systems and Informatics [6] Bengio Y., Ducharme R., Vincent P., Jauvin C., A Neural Probabilistic Language Model, Journal of Machine Learning Research 3 (2003) 1137–1155 [7] Mikolov T., Sutskever I., Chen K., Distributed Representations of Words and Phrases and their Compositionality, [8] T. Mikolov, V. Le Q., Sutskever I., Exploiting Similarities among Languages for Machine Translation, arXiv:1309.4168v1 [cs.CL] 17 Sep 2013 [9] Le Q., Mikolov T., Distributed Representations of Sentences and Documents, Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR:W&CP volume 32 [10] Lin C., Ammar W., Levin C., Unsupervised POS Induction withWord Embeddings, Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1311– 1316,Denver, Colorado, May 31 – June 5, 2015. [11] Zirikly A., Diab M., Named Entity Recognition for Arabic Social Media, Proceedings of NAACL-HLT 2015, pages 176–185,Denver, Colorado, May 31 – June 5, 2015 [12] Soricut R., Och F., Unsupervised Morphology Induction UsingWord Embeddings, Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1627– 1637, Denver, Colorado, May 31 – June 5, 2015 [13] Jiang B., Xun E., Qi J., A Domain Independent Approach for Extracting Terms from Research Papers, Springer International Publishing Switzerland 2015 M.A. Sharaf et al. (Eds.): ADC 2015, LNCS 9093, pp. 155–166, 2015. DOI: 10.1007/978-3-319-19548-3 13 [14] Bojanowski P., Grave E., Joulin A., Mikolov T., Enriching Word Vectors with Subword Information, arXiv:1607.04606v1 [cs.CL

[15] Dahou A., Xiong S., Zhou J., Haddoud M., Duan P., Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2418–2427, Osaka, Japan, December 11-17 2016. [16] Gridach M., Character-Aware Neural Networks for Arabic Named Entity Recognition for Social edia, Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing,pages 23–32, Osaka, Japan, December 11-17 2016. [17] El Mahdaouy A., Gaussier E, El Alaoui S., Arabic Text Classification Based on Word and Document Embedding, 2017, [18] Wu T., Liu S., Zhang J., Xiang Y., Twitter Spam Detection based on Deep Learning, ACSW ’17, January 31-February 03, 2017, Geelong, Australia c 2017 ACM. ISBN 978-1-4503-4768-6/17/01. . . $15.00 DOI: http://dx.doi.org/10.1145/3014812.3014815 [19] Kenter T., Rijke M., Short Text Similarity with Word Embeddings, October 19–23, 2015, Melbourne, Australia. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-45033794-6/15/10...$15.00. DOI: http://dx.doi.org/10.1145/2806416.2806475 [20] Socher, R. (2014).Recursive Deep Learning For Natural Language Processing and Computer Vision . Ph.D. thesis, Stanford University. [21] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. 1992.Class-based n-gram models of natural language. Computational Linguistics, 18. [22] https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ [23] Hariharan S., Automatic Plagiarism Detection Using Similarity Analysis, The International Arab Journal of Information Technology, Vol. 9, No. 4, July 2012 [24] Saad M., Ashour W., OSAC: Open Source Arabic Corpora, Published at the 6th International Conference on Electrical and Computer Systems (EECS’10), Nov 25-26, 2010, Lefke, North Cyprus. [25] http://byterot.blogspot.com/2015/06/five-crazy-abstractions-my-deeplearning-word2doc-model-just-did-NLP-gensim.html [26] Gharavi E., Bijari K., Zahirnia K., Veisi H., A Deep Learning Approach to Persian Plagiarism Detection, FIRE (Working Notes) 2016: 154-159 [27] Menai, M.E.B.: Detection of Plagiarism in Arabic Documents. International Journal of Information Technology and Computer Science 10, 80–89 (2012) [28] Kahloula B., Berri J., Plagiarism Detection in Arabic Documents: Approaches, Architecture and Systems, Journal of Digital Information Management, Volume 14, Number 2, April 2016