Classification of the solutions proposed in the ...

3 downloads 555 Views 1MB Size Report
Classification of the solutions proposed in the correction of the. Arabic words derived using the use of surface patterns. Nejja Mohammed1 and Yousfi Abdellah2.
Classification of the solutions proposed in the correction of the Arabic words derived using the use of surface patterns Nejja Mohammed1 and Yousfi Abdellah2 1

2

Équipe TSE, ENSIAS, Université Mohammed V Souissi Rabat- Maroc Équipe ERADIASS FSJES, Université Mohammed V Souissi Rabat- Maroc

Abstract: The automatic spell checker is one of the most important axes of automatic language processing systems. The performance of such spell-checker varies according to the correction mechanisms implemented. The lexis used by an automatic spell checker system represents one of the main drawbacks because the lexis’ sizes are often insufficient. Thus, to solve this issue, we developed a new approach of automatic spelling checking of misspelled words in Arabic texts. Our method is based on a corpus constituted of patterns of surface and roots characterized by a scaled down size compared to conventional approaches. Indeed, this new approach can check a set of misspelled words from a single data. The results of our new approach are very satisfactory and favorable, which shows the importance of the developed method and to assess its validity.

Currently, and with the evolution of computer that has experienced these last decades, several studies of correction have emerged and are available for exploitation namely: 





Keywords : NLPA, MISSPELLE, WORD, Spell checking, Edit distance, surface patterns. 1.

Introduction

The automatic language processing (NLP) is a discipline that combines linguistics and computer science. Such systems aim at modeling and developing IT solutions to be applied to linguistic data; which why it has become more and more widespread according to the increase in the studies and researches carried out in the NLP. Thus, the automatic correction of the words is to correct errors in a text, using a set of models and methods developed to seek and find the closest word to the wrong one to provide proposals for correction.





The n-gram approach is based on the decomposition of a word in n items constructed from a sequence of data. This technique takes care to compare each sequence in a corpus of learning to produce a similarity index to identify the closest words to the misspelled ones (1) Another recent technique has been developed, it is based on the principle of the use of HMM, where each word in the lexis is represented by a hidden Markov model (2). Levenshtein(3) defined a new distance that calculates the minimum number of elementary operations required to go from one word to another. This distance is defined based on three types of errors only: substitution, addition and deletion. K.Oflazer presents a new approach based on the notion of error-tolerant recognition with finite-state recognizers(4). The latter is based on the use of a dictionary represented in the form of automat on a finite-state. A.Savary(5) presented a correction algorithm based on the Oflazer works with changes that are manifested in the fact that it draws aside the calculation of cut-off edit distance, first it looks for the presence of the word entered in the automat, in case of failure, it returns back (back-tracking). Whenever it returns to a previously visited State, it tries to find another way of continuation admitting





one of the four edition operations (insertion, inversion, omission, and replacement). Gueddah, Yousfi and Belkasmi(6) proposed a new approach in order to improve planning of solutions of an erroneous word in Arabic documents by integrating frequency editing errors matrices in the levenshtein algorithm. A new approach has been proposed by Bakkali and Yousfi(7) based on the use of a dictionary of the stems of Buckwalter to integrate morphological analysis in the levenshtein algorithm.

Nevertheless, a major gap arises in the automatic spelling correction systems which is the inadequacy of the lexis used in dictionaries that do not contain all the words of the language. Thus, a corpus with a very large size is required to create a dictionary that covers all of the words, and that is generally difficult to build. However, such dictionaries influence on the access time to the dictionary that becomes prohibitive. In this article, we have introduced these patterns of surface in the approach of Levenshtein in order to highlight utility of the introduction of patterns of surface during the correction.

2. The Levenshtein Distance The levenshtein distance or the Edit distance, measures the degree of similarity between two strings. It calculates the minimum number of basic operations required to go from one word to another, by doing the following operation :  Insertion ( ‫ سمع‬ ‫) سمغ‬

 

Deletion ( ‫ سكت‬ ‫)شسكت‬ Permutation ( ‫ جمع‬ ‫)جع‬

In the mathematical sense, the distance is equal to the minimum number of characters to delete, insert, or replace to transform a misspelled word to a dictionary word. If the two words are identical then the distance is null. The levenshtein algorithm uses a matrix of (N + 1) * (P + 1) (with N and P the lengths of the strings to compare T, S) allows to calculate recursively the distance between the strings T, S. The calculation of the cell M[N , P] equal to the minimum between the elementary operations: (

)

, -

( { ( (

) ) )

(

( ) {

() ()

() ()

)

, -

So, it is deduced that the Levenshtein distance between S and T is found in M[N,P]. However, this approach has a major drawback lies in the limitation of the used dictionaries. Indeed, these dictionaries often do not contain certain words, where did the idea of the use of surface patterns to correct the derived words.

3. The Surface Pattern The Arab Pattern permits essentially to determine the structure of most words (the names, the conjugated verbs ..). The Patterns are variations of the word "‫ "فعل‬which are obtained by using diacritics or adding of affixes. The pattern of َ‫ ’ﻳَﻜْﺘُﺒُﻮن‬is َ‫ﻳَﻔْعَﻠُﻮن‬. The letters ‫ل‬،‫ع‬،‫ف‬ replace the letters of the root of َ‫ ﻳَﻜْﺘُﺒُﻮن‬and the pattern ofَ ‫ ﻧَﺎل‬isَ‫( فَعَل‬8). This type of pattern cannot present the morphological variations of the word (for example the noun ٌ‫ قَﺎئِل‬of the verb ‫)قَﺎل‬, that is why M. Yousfi proposed an adapted pattern named surface pattern(9).

Example: The conjugation of the word ‫ رَعَى‬to the active participle in the 1st person singular is ِ‫;رَاع‬ therefore, the surface pattern of the root ‫ رَعَى‬is ‫فَعَى‬ and ِ‫ فَﺎع‬is the surface pattern of .ِ‫رَاع‬ The surface pattern of ٌ‫ آجِر‬is ٌ‫ آفِع‬and of ٌ‫ آجِرَات‬is ٌ‫آفِعَﺎت‬.

T he erroneous word

Select the corresponding pattern Data Base of pattern

4. The Correction Morphological by Patterns of Surface in the Levenshtein Algorithm As known, automatic spelling correction is based on the use of built-in dictionaries of large size, to provide an important warehouse of Arabic word to be used by the Corrector which influences the performance of the systems because of the large number of data stored by these dictionaries. In this paper, we discuss a special case which consists in correcting the words derived using the concept of surface patterns to ensure performance of the automatic spelling correction. To achieve this idea, we developed three approaches that are based on the identification of the surface patterns nearest to the erroneous words using the levenshtein distance adapted to Arabic: (

)

, -

( { ( ( ( {

) ( ) ( )

) ) )

(

) , -

( ) ( )

( ) * ‫ل ع ف‬+ ( ) * ‫ل ع ف‬+

no

p attern ex Corresponding

particular word

yes pretreatment

Select the corresponding root

Data Base of root

no

yes

Corresponding root existt

Correct the word

correct word

Fig 1 Correction process an Arabic word with the patterns of surface These approaches are very satisfactory and favorable but these approaches has in turn a disadvantage which resides in the fact that the desired solution can sometimes be located in the final position. Thus, and in order to remedy this shortcoming, we improved these approaches and we extended in a way to promote a solution over another. To do this, we modified the formula used for the identification of the nearest pattern to the word: (

)

, -

( { ( (

and the correction of words through the surface patterns identified using methods of correction(10).

(

{

) ) )

(

)

) , -

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) *‫ل ع ف‬+ ( ) * ‫ل ع ف‬+ ( ) *‫ل ع ف‬+ ( ) *‫ل ع ف‬+

In this way we filter the proposed solutions of the same frequency of occurrence in a manner to display in a first order the desired sol ution.

7. For an Independent Spell-Checking System from the Arabic Language Vocabulary. Bakkali, H. 2014, International Journal of Advanced Computer Science and Applications, Vol. 5.

5. Conclusion

8. Ghilani, Mustapha. ‫جﺎﻣع اﻟﺪروس اﻟعرﺑﻴﺔ‬. s.l. : ‫اﻟمﻜﺘﺒﺔ‬ ‫اﻟعصرﻳﺔ‬, 1999. ‫اﻟﻐﻼﻳﻴﻨﻲ‬, ‫ﻣصطﻔى‬.

Each study carried out for automatic spelling correction systems tries to achieve the highest performance but none of automatic spelling correction systems can lead to a perfect performance including our system. Nevertheless, thanks to our approach, we were able to reduce the size of the dictionary used, which reflects positively on the performance of our system while maintaining a higher coverage.

References 1. Approximate string matching with q-grams and maximal matches. Ukkonen, E. 1992. Theoretical Computer Science. pp. 191-211. 2. Représentation de chaînes de caractères par des chaînes induites de Markov. Brucq, D et El Youbi, A. 1996. Actes RFIA. pp. 651-658. 3. Binary codes capable of correcting deletions, insertions and reversals. Levenshtein, V. 1966, pp. 707-710. 4. Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction. Oflazer, K. Mars 1996, Computational Linguistics archive, Vol. 22, pp. 73-89. 5. Recensement et description des mots composés – méthodes et applications. Savary, A. 14 DEC 2000, pp. 149-158. 6. Introduction of the Weight Edition Errors in the Levenshtein Distance. Gueddah, H, Yousfi, A et Belkasmi, M. 2012, International Journal of Advanced Research in Artificial Intelligence, pp. 30-32.

9. The morphological analysis of Arabic verbs by using the surface patterns. Yousfi, Abdellah. 2010, Vol. 7. 10. Correction of the Arabic words derived using surface patterns. Nejja, Mohammed and Yousfi, Abdellah. El jadida maroc : s.n., 2014.