Tokenizing an Arabic Script Language

0 downloads 0 Views 255KB Size Report
(D) #Number drCd -> #Percent. • the non-numerical format of the Numbers is widely used in newspapers. It uses numerical values and words such as /mylyard/, ...
Tokenizing an Arabic Script Language

Siamak Rezaei [email protected] School of Computer Science, McGill University

Abstract In any natural language processing project, the input text needs to undergo tokenization before morphological analysis or parsing. For Arabic script languages the tokenization process faces more problems and it plays a more crucial role in natural language processing (NLP) systems for Arabic script languages. In this work we elaborate on some of these problems and present solutions for these. The research is based on a project for tokenization and parsing Persian, Arabic and Kurdish texts.

This kind of grammatical ambiguity can be captured by a language model and using semantics in the parsing stage. Still there are some other problems which are common in processing most of Arabic script languages. In this work we discuss some of these problems and present some solutions for them. We will concentrate on punctuation, capitalization, abbreviations, acronyms, sentence and word boundaries for Arabic script languages.

2 Capitalization, Abbreviations and Acronyms

1 Introduction Persian uses an extension of the Arabic alphabet (with four additional letters) and is written from right to left. In Persian script, like Arabic, some of the vowels are represented by diacritics which are usually left out in the texts which appear in books or newspapers. The “e” sound (kasre) in Persian is used for (1) attaching modifiers like adjectives to the head noun, e.g. in “masjed-e jame” (the jame mosque) (2) attaching genitives to the head noun, e.g. in “masjed-e jame-e shiraz” (The Jame mosque of Shiraz). The lack of representation of this vowel in the Persian texts creates ambiguity in parsing and tokenization of the text. An analogous problem exists in the Arabic script and “e” vowel is not usually represented in newspapers and books (except for religious texts or elementary education). In the modified Arabic script of Kurdish, a new letter has been added to the alphabet for representing the “e” vowel. In Kurdish Arabic script vowels such as “a” and “o” have also a corresponding letter in the modified alphabet (Gautier, 1998). “a” and “o” vowels do not play a crucial disambiguation grammatical role in Kurdish or Persian. But in Arabic “a” (fathe) and “o” (zamme) are used for grammatical marking for object and subject.

Figure 1: Final, Medial and Initial Forms in Kurdish Arabic Script Persian script like other Arabic scripts distinguishes between the initial, medial, final and isolated forms (see Figure 1). In contrast, English texts distinguish between capital and non-capital forms. The capitalization in English is used for marking proper nouns (e.g. Lebanon), beginning of sentences and acronyms (e.g. BBC).

Arabic scripts do not use the capitalization for proper nouns or beginning of sentences or for acronyms. This issue introduces an extra level of ambiguity in the parsing of Arabic script languages. But Arabic script languages use the different forms of initial, medial, final and isolated forms in a special way that helps the disambiguation process. Traditional approaches to transliteration of Arabic script texts ignore the ”Arabic style capitalization” rule and consider only one letter from Latin alphabet for each letter when transliterating initial, medial, final and isolated forms of a letter in an Arabic script alphabet. For example in Persian, the acronym for BBC will be represented as /bybysy/. A Persian writer is aware that the /y/ forms in /bybysy/ are in final form and they do not attach to the next letter. In other words the transliteration of BBC in Persian should take into account a control character (short space) that signals final forms in the script. Such a control character is called zero-width-joiner in Unicode literature. We use for representing this short space in transliteration of the Arabic script languages such as Persian (Rezaei, 1998). BBC will be transliterated as /by by sy /. A corresponding table for the translation of all Latin words can be constructed (Table 1) to help the tokenizer to distinguish and translate the English acronyms (e.g. /by by sy /) used in Arabic script texts into their correct form (e.g. BBC). The Arabic special capitalization rule helps the tokenizer. Some of the acronyms in Persian also use this rule of Arabic capitalization and /p k k / will represent the corresponding PKK acronym in English. Still there are other acronyms in Persian which are not captured by the above rules. The acronym for CIA is /sya/ in Persian which should be included in the lexicon1 .

3 Punctuation and Sentence Boundary In addition to the Arabic special capitalization rule, Persian uses stop for some acronyms (e.g. /ar.py.jy/ for RPG), especially in academic texts. Stop also marks a sentence boundary in Arabic script languages. Certain punctuation marks in Arabic Script languages like exclamative mark and interrogation mark represent sentence boundaries unambiguously. Other punctuation like comma, quotes, brackets and colon can also be used for indicating boundaries in tokenization of Arabic script texts. Apart from these marks, slash (/) is used in dates and numbers and dash (-) sign is used to separate compound words. The ambiguous and unambiguous punctuation 1

The transliteration used for Persian script consists of the following in alphabetical order: a (alif), b, p, t, c, j, G, H, K, d, Z, r, z, J, s, S, C, x, T, X, e (ain), Q, f, q, k, g, l, m, n, v, h, y, i (hamze). represents short space. 

Latin letter A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Persian Transliteration a by or b sy d ay af jy or g hash ay jy ka al am an av py k ar as ty yv vy v ayks vay z

Table 1: Acronyms: Latin vs Persian

marks and the context that they can occur can be represented by a formal notation such as a finite state model. Using this notation one can write a program (e.g. in Perl) for tokenizing dates, numbers, acronyms and abbreviations based on these punctuation marks, short space ( ) and a set of closed class words that we discussed in the previous section for acronyms and abbreviations. If there is any case of ambiguity, then the tokenizer should generate all the possible forms. For example in the case when the stop is ambiguous between an end of sentence and an acronym then the tokenizer should generate two forms corresponding to each one. Later stages of the parsing and analysis will remove the incorrect alternatives. Note that a similar case of ambiguity can arise in English for the stop in the case of eg. and the stop for the end of a sentence.

4 Space and Word Boundary One main task of the tokenizer is to detect word boundaries in the text and segment the text for the later stages of the text processing system. Like English, in Arabic script languages word boundaries can be delimited by space or punctuation mark. But in Arabic script languages, Arabic style capitalization (that we discussed earlier) plays a significant role in tokenizing texts in which the words or affixes are not properly separated by space from each other. In handwritten texts it is very common to see space between words disappearing. Even in online newspapers and printed material one can see instances in which a word comes exactly before another word without a separating space. In addition in some cases there is this problem that an affix is detached from the word that it should attach to. This happens specially in Persian for a limited set of affixes like aspect marker /my/ (continuous marker) in /myrvm/ (I am going) from root /rv/ (to go) that can be written as /myrvm/, /my rvm/ or even as /my rvm/. As a general rule, two concatenated words can be put into separate tokens if the first part ends in a final form character, when one expects a medial form. Here again the short space marker ( ) plays a crucial rule in tokenizing cases where the boundary space between two words is missing. One should note that in Arabic script languages, certain letters have no medial forms (alef, dal, zal, re, ze, je, waw) and this issue should be taken into account in the algorithm for the Arabic special capitalization. If such a letter (with no medial form) ends the first part of a concatenated pair, then the tokenizer would not be able to determine the word boundary. In this case the algorithm should rely on lexical information for further disambiguation. In Persian, compound nouns like (masjed alagsa) mostly appear without a space separating the two parts

of the complex entity (head and adjoined part). In the case of compounds if the two parts are separated in the text (or by the tokenizer), a post-tokenizer or parser should re-attach them together later. This is also true of cases when an affix like /my / is detached from its root in the text or by the tokenizer. A post-tokenizer module can deal with these cases. Similar to compound nouns, the two parts of a complex verb (Karimi-Doostan, 1997) like “zolm kardan” (to oppress) should be joined later if they are separated in the text or by the tokenizer. Complex verbs are characteristics of non-Semitic languages like Iranian and Turkic that have borrowed extensively from Arabic. Arabic itself uses a rich set of semantic templates for its vocabulary of nouns and verbs.

5 Proper Nouns In Arabic script languages the proper nouns are not marked. Contextual information help to tag a constituent as a proper noun. For processing proper nouns in Arabic script languages, one can develop a small grammar of potential proper-noun markers for Persian and other Arabic script languages2. In Persian there is no general rule for distinguishing proper nouns from the other names. But there are some heuristics that one can use to distinguish proper names. In this section we will use regular grammars and confidence measures such as: (U) Unlikely=20%, (P) Possible=50%, (V) Very likely=80% and (D) Definite=100%. /Haj/, /syd/, /myr/, /SyK/ mark the beginning of proper names and /Kan/ marks the end. 

(D) (D) (D) (D) (D)

Haj Haj syd myr SyK

#FORNAME #SURNAME #FORNAME #SURNAME #FORNAME

-> -> -> -> ->

#FORNAME #SURNAME #FORNAME #SURNAME #FORNAME

(D) #FORNAME Kan -> #FORNAME 

Adding /zadh/, /pvr/ or /af/, /yan/ to proper nouns (PN) makes a proper SURNAME. E.g. ahmd + zadh = ahmdzadh. (V) (V) (P) (U) (U)

#FORNAME #FORNAME #FORNAME #FORNAME #FORNAME

+ + + + +

zadeh pvr af f yan

-> -> -> -> ->

#SURNAME #SURNAME #SURNAME #SURNAME #SURNAME

2 (Glover and Knight, 1998) and (Arbabi et al., 1994) have concentrated on translating Arabic proper nouns into English.

/Abad/ marks the end of places. E.g. aslam + Abad = aslamAbad. 

(P) #Noun + abad -> #GEOG

byvh Kbrngar dbyrkl riys rvznamh˜ngar riys˜jmhvry rhbr sKngv srdbyr srmrby srprst Saer Shyd exv karKanh˜dar mdyr mdyrkl mdyreaml mrby mrbyan mrHvm mrHvmh msivl mSavr meavn mntqd nKst˜vzyr nvysndh nmayndh Shrdar vzyr vkyl vlyehd

Adding /y/ at the end of place names (GEOG) and proper names makes a proper SURNAME. E.g. njf + y = njfy. And adding /yy/ when the name ends in a vowel. 

(V) #GEOG + y -> #SURNAME (V) #FORNAME + y -> #SURNAME /abval/, /ebd/, /ebdal/, /abn/, /bnt/, /abv/ and /al/ mark the beginning of Arabic proper names. /aldyn/ and /allh/ marks the end of Arabic names. 

(D) (D) (D) (D) (D) (D) (V) (V) (D) (D) (D)

abval + #FORNAME -> ebd + #FORNAME -> ebdal + #FORNAME -> ebval + #FORNAME -> bnt + #FORNAME -> abv + #FORNAME -> al + #FORNAME -> bn + #FORNAME -> am˜al + #FORNAME -> #FORNAME + allh -> #FORENAME + aldyn->

#FORNAME #FORNAME #FORNAME #FORNAME #PN #PN #FORNAME #FORNAME #FORNAME #FORNAME #FORNAME

/ mk/ marks the beginning of foreign names. E.g. Macdonald. /brg/, /bvrg/, /svn/ and /sn/ also mark the end of foreign SURNAMEs or names. 

(D) (V) (P) (D) (V) (D) (P)

˜mk + #FORNAME -> #SURNAME #FORENAME + svn -> #SURNAME #FORNAME + sn -> #SURNAME #PN + ska -> #SURNAME #PN + sky -> #SURNAME #PN + bvrg -> #GEOG #PN + brg -> #GEOG

A list of patterns for recognizing potential proper names using address forms such as ”Mr. X, President of Y Inc.” have also been developed to tag items as proper nouns. When there is an unknown word in the text, the system can apply the proper name recognizer to tag it as a proper name. The recognition process occurs in multiple stages with basic tags being applied first - e.g. first name, last name, companystart word, company end word, title. 

#FORNAME? #SURNAME #TITLE -> #PN Amvzgar astandar aeXay

-> #TITLE -> #TITLE -> #TITLE



-> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> ->

#TITLE #TITLE1 #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE #TITLE

#TITLE1 #FORNAME? #SURNAME > #PN Aqay -> #TITLE1 Aqay˜dktr -> #TITLE1 Ayt˜allh -> #TITLE1 Ayt˜allh aleXmy -> #TITLE1 astvar -> #TITLE1 amam -> #TITLE1 banv -> #TITLE1 bh˜asm -> #TITLE1 bh˜nam -> #TITLE1 bh˜namhay -> #TITLE1 trjmh: -> #TITLE1 trjmh -> #TITLE1 tymsar -> #TITLE1 Hxrt˜Ayt˜allh -> #TITLE1 Hxrt˜Ayt˜allh aleXmy -> #TITLE1 Kanm -> #TITLE1 Kanm˜dktr -> #TITLE1 dktr -> #TITLE1

dryadar -> #TITLE1 saKth -> #TITLE1 srgrd -> #TITLE1 srtyp -> #TITLE1 srvan -> #TITLE1 grvhban -> #TITLE1 naSr: -> #TITLE1 nvSth: -> #TITLE1 nvSth -> #TITLE1 mvlf: -> #TITLE1 mvlfan: -> #TITLE1 mhnds -> #TITLE1 hjt˜alaslam -> #TITLE1 hjt˜alaslam˜valmslmyn -> #TITLE1

Shr -> #PLACE_TITLE Shrstan -> #PLACE_TITLE frhngsray -> #TITLE_ORG frhngstan #NOUN? -> #PLACE_TITLE kanvn -> #TITLE_ORG ktabKanh -> #PLACE_TITLE kvGh -> #PLACE_TITLE galry -> #PLACE_TITLE grvh #NOUN? -> #PLACE_TITLE mdrsh -> #PLACE_TITLE mjmveh˜frhngy -> #PLACE_TITLE mnTqh NUM? -> #PLACE_TITLE msafrKanh -> #PLACE_TITLE mydan -> #PLACE_TITLE htl -> #PLACE_TITLE

Company 

6 Finite State Implementation #company_start_word #other* #company_last_word -> #company #TITLE_ORG #PN -> #ORG

A list of patterns and heuristics can be constructed for Arabic script languages to help identifying acronyms and abbreviations. Some of the general guildlines for creating acronyms in Persian are formalized by the following rules.

atHadyh #NOUN? -> #PLACE_TITLE bymarstan -> #TITLE_ORG park˜svar -> #TITLE_ORG danSgah -> #TITLE_ORG danSgah˜Azad˜aslamy > #TITLE_ORG sazman -> #TITLE_ORG synma -> #TITLE_ORG Srkt -> #TITLE_ORG Srkt˜shamy -> #TITLE_ORG Srkt˜shamy˜KaC -> #TITLE_ORG Srkt˜shamy˜eam -> #TITLE_ORG Shrdary -> #TITLE_ORG kStargah -> #TITLE_ORG kmpany #ADJECTIVE? -> #TITLE_ORG mjtme -> #TITLE_ORG nmaySgah -> #TITLE_ORG

Any period not followed by a blank is not a full stop and is an abbreviation. Note that fractions in Persian use / and not dot3 . 

(D) [A-y]+ .[A-y0-9.]+ > #ABREVIATION The most general format for abbreviations is: 

(D) ([A-y].)+ -> #ABREVIATION This covers examples like ” x. k. ”. In some other literature any combination of a word ending with a dot (.) can be considered as acronym (e.g. /Alm./ for /Almany/ = German). The last letter should be non-final to distinguish the dot from end of sentence. I.e: 

(D) [A-y]+. -> #ABREVIATION 

#PLACE_TITLE #PN -> #PLACE There are other exceptions and one letter word with no dot can also be considered as an acronym (e.g. /m/ for /mylady/)., Except for /v/: 

Aramgah astan bvstan park pl talar jadh Kyaban rstvran rvstay saln

-> -> -> -> -> -> -> -> -> -> ->

#PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE #PLACE_TITLE

(D) [A-nh-y] -> #ABREVIATION Two letter words can also be considered as acronyms. But these will be confused by a set of prepositions. Hence the possibility measure is very low. It should be used with a lexicon: 

3

We have used the formal notation used in the syntax of Perl language.

(U) [A-y][A-y] -> #ABREVIATION If a letter or two are enclosed inside a parenthesis, they represent an acronym. E.g. /(rh)/ or /(e)/: 

(D) \([A-y]?[A-y]\) > #ABREVIATION

7 Numbers Numbers are the least ambiguous of the structural types. The Persian/Arabic numbers can be captured by: 

(D) (+|-)?([0-9]+,?)+(/[09]+)? -> #Number The Percents: 



(D) #Number + % -> #Percent (D) #Number drCd -> #Percent the non-numerical format of the Numbers is widely used in newspapers. It uses numerical values and words such as /mylyard/, /mylyvn/, /hzar/. A grammar for parsing such numbers, including decimals, fractions can be constructed by the following rules from (Sanamrad, 1984): IX) NUM,ORD,FRC,PRPR - Value Gnd Cd Gnd hzar Gnd mylyvn Gnd mylyard dh + ha hzar dh + ha mylyvn dh + ha mylyard Cd + ha hzar Cd + ha mylyvn Cd + ha mylyard dh + ha Cd + ha hzar + ha mylyvn + ha mylyard ha

yk dv sh Ghar pnj SS

-> -> -> -> -> ->

#NUM0 #NUM0 #NUM0 #NUM0 #NUM0 #NUM0

-> -> -> -> -> -> -> -> -> -> -> -> -> -> ->

#NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM #NUM

hft hSt nh

-> #NUM0 -> #NUM0 -> #NUM0

dh yazdh dvazdh syzdh Ghardh panzdh Sanzdh hfdh hyjdh nvzdh

-> -> -> -> -> -> -> -> -> ->

#NUM2 #NUM2 #NUM2 #NUM2 #NUM2 #NUM2 #NUM2 #NUM2 #NUM2 #NUM2

byst sy Ghl pnjah SCt hftad hStad nvd

-> -> -> -> -> -> -> ->

#NUM1 #NUM1 #NUM1 #NUM1 #NUM1 #NUM1 #NUM1 #NUM1

Cd ykCd dvyst syCd GharCd panCD SSCd hftCd hStCd nhCd

-> -> -> -> -> -> -> -> -> ->

#NUM3 #NUM3 #NUM3 #NUM3 #NUM3 #NUM3 #NUM3 #NUM3 #NUM3 #NUM3

#NUM0 #NUM1 #NUM2 #NUM1 v #Number

-> #NUM4 -> #NUM4 -> #NUM4 #NUM0 -> #NUM4 -> #NUM4

#NUM3 v #NUM3 #NUM4 #Number

#NUM4 -> #NUM5 -> #NUM5 -> #NUM5 -> #NUM5

#NUM5 hzar v #NUM5 -> #NUM6 hzar v #NUM5 -> #NUM6 #NUM5 hzar -> #NUM6 hzar -> #NUM6 #Number -> #NUM6 #NUM5 mylyvn v #NUM5 -> #NUM7 #NUM5 mylyvn v #NUM6 -> #NUM7 #NUM5 mylyvn -> #NUM7

#NUM6 -> #NUM7 #Number -> #NUM7

(D) [0-9]+(/|-)[0-9][0-9]?(/|-)[09][0-9]? #Yeartype -> #Date

#NUM5 mylyard v #NUM5 #NUM5 mylyard v #NUM7 #NUM5 mylyard #Number -> #NUM



Cfr #NUM5 #NUM7

-> #NUM -> #NUM -> #NUM

nCf clc rbe nym

-> -> -> ->

-> #NUM -> #NUM -> #NUM

#Frc #Frc #Frc #Frc

Non-Numeric Fractions and percents can also be formed by the following rules (Sanamrad, 1984).

#Frc1 #Frc2 #NUM v #NUM #NUM + m #NUM #NUM + m #NUM v nym #NUM mmyz #NUM mnhay #NUM mnfy #NUM belavh #NUM bazafh #NUM mnhay #Frc mnfy #Frc belavh #Frc bazafh #Frc

-> -> -> -> -> -> -> ->

Cd + y #PR1 #NUM #PR1 #Frc1 #PR1 #Frc2 #NUM dr #NUM #Frc dr #NUM #Frc1 dr #NUM #Frc2 dr #NUM

-> -> -> -> -> ->

#Frc #Frc #Frc #Frc #Frc1 #Frc2

#NUM #NUM #NUM #NUM #Frc #Frc #Frc #Frc -> -> -> -> -> -> -> ->

#PR1 #PRPR #PRPR #PRPR #PRPR #PRPR #PRPR #PRPR

In Persian texts there are three types of calendars: Persian, Islamic and Julian (Gregorian). The following closed class items can be used for tagging dates4 . #MNTH mah mah #MNTH frvrdyn ardybhSt Krdad tyr mrdad Shryvr mhr Aban AZr dy bhmn asfnd

-> -> -> -> -> -> -> -> -> -> -> -> -> ->

#MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH

Janvyh fvryh mars avryl mh Jvin jvlay Agvst sptambr aktbr nvambr dsambr

-> -> -> -> -> -> -> -> -> -> -> ->

#MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH

mHrm Cfr rbye˜alavl rbye˜alcany jmady˜alavl jmady˜alcany rjb Seban rmxan Sval Zyqedh ZyHjh

-> -> -> -> -> -> -> -> -> -> -> ->

#MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH #MNTH

8 Dates Numeric Dates such as (1376/1/12 or 1378-1-12) are captured by:

KvrSydy -> #YearType Hjry KvrSydy -> #YearType 4

(D) [0-9]+(/|-)[0-9][0-9]?(/|-)[09][0-9]? -> #Date

In this work, all the rules with only one item in the left hand side have been implemented by listing them in the lexicon and marking them with the appropriate feature corresponding to the right hand side of the rule.

Smsy hjry Smsy h.S˜. h S˜

-> -> -> ->

#YearType #YearType #YearType #YearType

hjry qmry hjry h h. qmry q˜. q˜ h.q˜. h q˜

-> -> -> -> -> -> -> -> ->

#YearType #YearType #YearType #YearType #YearType #YearType #YearType #YearType #YearType

mylady -> m˜ -> m˜. -> qbl az mylad -> bed az mylad ->

#YearType #YearType #YearType #YearType #YearType

sal sal

#NUM #YearType -> YEAR #NUM + y #YearType -> YEAR

sal

#NUM

->

YEAR

9 Conclusion A pipeline of Perl programs have been constructed for the tokenization of time, numbers, acronyms and abbreviations for Persian texts. Analogous modules for the tokenization of other Arabic script languages can be constructed for developing NLP systems for these languages. The multiple output of tokenization could be later disambiguated by the morphology, lexicon and the parsing modules. The output of the tokenizer will then be used by the parsing system5 . The system can also be used in natural language processing projects where the system has no access to a dictionary.

References M. Arbabi, S. M. Fischthal, V. C. Cheng, and E. Bart. 1994. Algorithms for Arabic name transliteration. IBM Journal of Research and Development, 38(2):183–193. Gerard Gautier. 1998. Building a Kurdish Language Corpus: An Overview of the Technical Problems. In Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing (ICEMCO98), Cambridge. 5

See (Sanamrad and Matsumoto, 1985) and (Rezaei, 1999) for a survey of implemented parsers for Persian.

Bonnie Glover and Kevin Knight. 1998. Translating names and technical terms in arabic text. Gholamhossein Karimi-Doostan. 1997. Light Verb Constructions in Persian. Ph.D. thesis, University of Essex. Siamak Rezaei. 1998. Persian tokenization. Computing Research Laboratory, New Mexico State University, May. Siamak Rezaei. 1999. Linguistic and Computational Analysis of Word Order and Scrambling in Persian. Ph.D. thesis, Division of Informatics, University of Edinburgh, February. Mohammad A. Sanamrad and Hauya Matsumoto. 1985. PERSIS: A Natural-Language Analyzer for Persian . In Journal of Information Processing, vol 8, No. 4, pages 271–279. Mohammad A. Sanamrad. 1984. Machine Processing and Understanding of Spoken and Written Natural Language. Ph.D. thesis, Kobe University, December.