وخستیه کىفراوس بیه المللی پردازش خط ي زبان فارسی 5935 شهریًر51 ي51 داوشکذٌ مهىذسی برق ي کامپیًتر- داوشگاٌ سمىان
An Introduction to Noor Corpus and its Language Model Mohammad Hossein Elahimanesh
Behrouz Minaei-Bidgoli
Islamic Azad University, Qazvin Branch, Qazvin, Iran Computer Research Center of Islamic Sciences Qom, Iran
[email protected]
Iran University of Science and Technology, Tehran, Iran Computer Research Center of Islamic Sciences Qom, Iran
[email protected]
Mohammad Javad Gholami
Hossein Juzi
Computer Research Center of Islamic Sciences Qom, Iran mjgholami @noornet.net
Computer Research Center of Islamic Sciences, Qom, Iran
[email protected]
Abstract— In Linguistics, a text corpus is defined as a large group of text documents. Text corpora are used in order to extract the hidden laws of languages. As one application for statistical researches and hidden laws extraction, language models are made to be used for information retrieval applications. In this paper we introduce one of the greatest text corpora in Islamic science which is called Noor Corpus, and then we provide the Language model of this corpus. The Noor Corpus is results of a decade of efforts from theological researchers and computer engineers of Computer Research Center of Islamic Sciences (CRCIS). This corpus includes thousands of Islamic Books are classified into different categories. Most of the existing texts are Arabic and Persian. There are 1.2 billion Arabic words as well as 616 million Persian words. The bigram language models of this corpus have 80 million distinct bigram words in Arabic and 44 million distinct bigram words in Persian. Keywords-component; Islamic Corpus; Language Model; Natural Language Processing
I.
INTRODUCTION (HEADING 1)
The rapid growth of textual information in the world has led to a huge amount of information whose manage and control seems to be very difficult. To address this problem various techniques are developed by experts in the text mining and information retrieval areas. One of these techniques is applying language models, which is a part of information retrieval science [1]. A large fraction of textual information resources in the world consists of religious resources and we can list several companies and research institutes that develop these resources.
Digital libraries, such as the Maktaba Shamila's library1 or the CRCIS's Noor library2 , are some of the mentioned resources. This paper introduces one of greatest Islamic dataset that is prepared by the CRCIS. This dataset, known as Noor corpus, contains different fields of Islamic science. The secondary purpose for this paper is to build the language model of this corpus for information retrieval aims. The rest of the paper is organized as follows: in section 2 the corpora that are similar to Noor corpus are criticized. In section 3, we try to define the Ngram language model. Sections 4 and 5, explain Noor corpus statistics along with the results of the language model based on this corpus. Conclusions and future works are presented in section. II.
RELATED WORKS
Many of the previous corpora, in Persian and Arabic, contain newswire text data acquired from Persian and Arabic news sources. The corpus "Arabic Gigaword" whose last edition is called "Arabic Gigaword Fifth Edition" is an example of this type of corpora. This corpus is a huge archive full of newswire texts prepared by Pennsylvania University and Linguist Data Consortium (LDC) and according to the Catalog number LDC2011T11 [4]. In Persian, an instance for these corpora is "Hamshahri". Darrudi et al. have described this corpus with 63 million words (3.97-character average length for each word) [2]. One of the greatest sets of early religious texts can be found in Maktaba Shamila. This program contains more than 2500 This research was supported with CRCIS. 1 http://shamela.ws 2 http://www.noorlib.ir
وخستیه کىفراوس بیه المللی پردازش خط ي زبان فارسی 5935 شهریًر51 ي51 داوشکذٌ مهىذسی برق ي کامپیًتر- داوشگاٌ سمىان Islamic books appearing as a digital library for researchers. Other similar resources can be found in the CRCIS’s software programs. Although these resources are very large, but few of them have been used for text mining applications. An example of this application is part of speech tagging of the Holy Quran. Mohamed Elahdj described a statistical part of speech tagger based on Hidden Markov Model for this task [9]. Experimental results of his approach have shown a recognition rate of about 96% on this dataset. Other examples can be found in Al-Hadith classification literatures [3, 5 and 6]. III.
STATISTICAL LANGUAGE MODEL
A statistical language model is simply a probability distribution over all possible sentences 3 [7]. Usage of language models in many of Natural Language Processing (NLP) applications can be founded. Applications such as Part of Speech (POS) tagging, speech recognition and word prediction can be mentioned. There are different types of language models. Two of popular language models are unigram language model and the bigram one. Using of bigram language model is the most frequent model in previous literatures [1]. A. Unigram Language Model Unigram language model is defined as multiplication of unconditioned probabilities for words of sequence W ( ). is computed according to (1):
∏
(1)
According to this equation the amount of is equal to ratio of number word occurrence to the number whole words in the corpus. A problem that always appears in language models is that a specific word has never occurred in the corpus. One usual way to smooth this kind of words is to determine their occurrence number as one. We used this smoothing technique for unknown words. B. Bigram Language Model In bigram language model, calculation of the probability is defined as the multiplication of conditional probabilities for s. Conditional probability for each is equal to the probability of given . Equation (2) shows how to calculate :
3
Or spoken utterances, documents, or any linguistic unit.
|
∏ To find
|
(2)
we follow the (3):
|
(3)
The function f represents the frequency of input string in the corpus. In this type of model, in addition to smoothing mentioned in last part, we would need another kind of smoothing that is used for unknown bigrams like . Unknown bigrams are those that have not been observed in the corpus. In this paper, we have used linear interpolation technique introduced by Brants [8]. According to this smoothing technique, the way to | calculate the probability is shown in (4): | Where,
|
(4)
. IV.
NOOR CORPUS
In this section, we want to introduce Noor corpus which is a corpus produced by the CRCIS. The CRCIS in cooperation with researchers in the fields of computer and religious sciences has made a lot of efforts to digitize Islamic textual documents. The most important aim for this center is to gather and present Islamic texts as desktop programs and websites. These efforts have resulted in many rich libraries in Islamic sciences. In this part, we are going to describe this corpus from different aspects. A. Statistical reports of corpus Noor corpus, totally includes 7290 books, each book, by average, has two volumes. The largest book is named "Bihar al-Anwar" consisting of 110 volumes and more than 89 million characters (14 million words). If we count the characters of this corpus regardless of the ones added during the enrichment process, we will have some 8.2 billion characters. Furthermore, the mean length for each book is 1.1 million characters (256,000 words). B. Corpus's language distribution The language distribution means the way the textual documents in different languages existing in this corpus, are distributed. Table I shows languages, number of books, characters and words existing in the corpus. In this table the Arabic-Persian books are bilingual and contain texts both in Arabic and Persian.
وخستیه کىفراوس بیه المللی پردازش خط ي زبان فارسی 5935 شهریًر51 ي51 داوشکذٌ مهىذسی برق ي کامپیًتر- داوشگاٌ سمىان C. Corpus's books classification The researchers of the CRCIS have classified Noor corpus books on the basis of religious categories. Each category in this corpus mostly has caused specific software production in the relative field. Table II explains the distribution of books in this corpus among the most frequent categories. Distribution of books in Arabic and Persian, in this corpus, is different. Then, table III shows the distribution of Arabic books and table IV, the distribution of Persian books amongst different categories.
Number of books
Number of characters
Number of words
Arabic
4,329
5,250,940,980
1,179,373,254
Persian
2,679
2,624,192,540
616,668,604
Arabic-Persian
231
253,711,374
59,742,677
Others
19
17,056,170
4,045,220
Total
7,290
8,176,706,536
1,867,118,969
TABLE II.
NOOR CORPUS STATISTICS ON THE BASIS OF TEXT GENRE
Category Reasoning-based jurisprudence Legal theory
C(book)
C(char)
C(word)
695
914,598,202
202,258,740
427
475,000,555
101,542,542
Fatwa-based jurisprudence
206
134,536,929
30,649,561
Geography
158
187,141,263
42,415,480
History
130
179,934,236
41,254,387
Literature
127
109,525,636
26,020,057
Itinerary
117
70,927,900
15,981,599
Nahj al-Balagah
98
144,821,918
33,394,436
TABLE III.
Category
C(book)
C(char)
C(word)
Literature
84
52,290,640
12,486,285
Fatwa-based Jurisprudence
84
44,894,915
10,682,714
Hajj
70
21,110,306
4,901,058
Geography of Cities
69
63,194,610
14,456,171
NOOR CORPUS LANGUAGE-BASED STATISTICS
Language
ARABIC NOOR CORPUS STATISTICS ON THE BASIS OF TEXT GENRE
Category
Itinerary
69
45,703,025
10,557,877
Legal theory
61
72,972,086
17,017,638
Qajar Dynasty
59
77,108,236
17,185,921
General History
53
132,957,692
30,747,480
D. Distribution of words length The mean length for Arabic words in Noor corpus is 3.85 characters and for Persian words, this is 3.55 characters. Figure 1 tells us the way words lengths in Arabic and Persian, in this corpus, is distributed. As you see in figure 1, the most repetitions for Arabic words belong to three-character-length words and for Persian words, two-character-length ones. We notice that any words in the corpus except punctuation Marks are used in this experiment.
25 20
15
Arabic
10
Persian
5 0 1 2 3 4 5 6 7 8 9 10
C(book)
C(char)
C(word)
638
829,267,100
182,198,313
364
397,092,805
83,412,901
Fatwa-based jurisprudence A full collection of historical sources Geography of Cities
118
88,093,446
19,599,199
102
122,949,459
28,063,832
89
123,946,653
27,959,309
General translations
73
204,672,655
46,091,378
Peripatecism
66
30,825,401
6,921,714
Exegesis
52
163,688,506
37,044,997
Reasoning-based jurisprudence Legal theory
PERSIAN NOOR CORPUS STATISTICS ON THE BASIS OF TEXT GENRE
count x 10,000,000
TABLE I.
TABLE IV.
Figure 1. distribution of word length in corpus.
V.
NOOR CORPUS'S LANGUAGE MODEL
The result of unigrams and bigrams production is that so far, we have 2,290,525 different unigrams and 79,926,320 different bigrams for Arabic and for Persian, these numbers are, respectively, 1,481,642 and 44,442,394. Production of bigrams is one the most challenging points in preparing the language model for this corpus. Tables number V and VI present ten Arabic and Persian unigrams and bigrams which are repetitious.
وخستیه کىفراوس بیه المللی پردازش خط ي زبان فارسی 5935 شهریًر51 ي51 داوشکذٌ مهىذسی برق ي کامپیًتر- داوشگاٌ سمىان TABLE V. Id
TEN ARABIC MOST FREQUENT UNIGRAMS AND BIGRAMS
Ar-unigram
Ar-bigram
complex language models for this corpus in order to develop more efficient information retrieval.
Word
Count
bigram
count
1
و
92,695,985
و،
22,862,961
2
،
85,659,475
و.
11,278,757
3
.
39,034,326
: قال
3,480,800
4
:
34,336,554
و ال
3,301,256
5
في
24,970,025
و هو
2,559,540
6
مه
23,030,455
و:
2,425,299
TABLE VIII.
7
به
13,115,677
اهلل عليه
2,227,760
Id
8
على
12,170,705
عليه و
2,095,200
9
اهلل
11,204,098
محمد به
2,041,946
10
ال
10,957,949
عليه السالم
1,943,612
TABLE VI. Id
TABLE VII.
Arabic Persian
TEN PERSIAN REPETITIOUS UNIGRAMS AND BIGRAMS
Pe-unigram
Pe-bigram
word
count
bigram
count
1
و
37,594,750
و،
2,641,436
2
،
23,428,992
. است
2,148,170
3
.
21,503,370
و.
1,779,863
4
از
13,292,239
است كه
1,648,044
5
كه
13,151,501
كه در
1,197,099
6
در
12,954,586
و در
1,179,030
7
به
11,755,538
است و
1,140,450
8
را
9,631,841
، است
1,051,191
9
:
9,494,427
را به
997,047
10
است
8,967,840
و از
866,822
As we discussed in the part for language model, calculating lambda in order to smooth the language model is based on the algorithm offered by Brants [8]. Quantities, coming from the mentioned algorithm, are suitable for both languages, Arabic and Persian, have been put in the Table VII. At the end, Table VIII states ten repetitious characters for each language.
CONCLUSION AND FUTURE WORKS
In this paper we have introduced Noor corpus that in comparison with other Islamic textual corpora, is the greatest one. Number of books, in this corpus, is three times more than the Shamila library. You can compare Noor corpus's number of words with the ones from corpora containing informative sentences. This corpus and its language model can prepare the information needed for so many of context investigation activities such as part of speech tagging, word prediction and translator machines. For future work, we can discuss more
83.8% 82.4%
16.2% 17.6%
TEN MOST FREQUENT CHARACTERS IN NOOR CORPUS
Arabic-char
Persian-char
char
count
char
count
1
ا
564,139,203
ا
268,465,150
2
ل
478,263,812
ر
144,880,133
3
م
254,095,481
ن
138,419,184
4
ي
247,568,989
د
130,967,016
5
و
228,604,769
و
123,187,225
6
ن
226,842,132
م
116,182,786
7
ب
166,237,462
ه
111,763,273
8
ه
163,037,900
ي
100,064,001
9
ر
160,472,941
ب
90,631,497
10
ع
144,963,025
ل
78,044,202
REFERENCES [1] [2]
[3]
[4] [5]
[6]
[7]
VI.
LAMBDA QUANTITIES FOR SMOOTHING
[8]
[9]
C., Manning, P., Raghavan, H., Schütze, “Introduction to Information Retrieval,” Cambridge University Press, 2009. E., Darrudi, M.R., Hejazi, F., Oroumchian, “Assessment of a modern Persian corpus,” Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID), ITRC, Iran, 2004. K., Jbara, “Knowledge Discovery in Al-Hadith Using Text Classification 4Algorithm,” Proceeding of 6th Journal of American Science, 2011. Linguistic Data Consortium, “The Arabic Gigaword corpus,” (LDC2011T11), 2011. M. N., AL-Kabi, S. I., AL-Sinjilawi, “A Comparative Study of the Efficiency of Different Measures to Classify Arabic Text,” University of Sharjah Journal of Pure and Applied Sciences, 4(2), pp: 13-26, 2007. M. N., Al-Kabi, G., Kanaan, R., Al-Shalabi, “Al-Hadith Text Classifier,” Proceeding of 5th Journal of Applied Sciences, pp: 584-587, 2005. R., Rosenfeld, “Two Decades of Statistical Language Modeling: Where Do We Go From Here?,” Proceeding of the IEEE, 88(8), 2000. T., Brants, “TnT: A statistical part of speech tagger,” Proceedings of the 6th Conference on Applied Natural Language Processing, Apr. 29-May 04, Association for Computational Linguistics Morristown, New Jersey, USA, 2000. Y. O. M., Elhadj, “Statistical Part-Of-Speech Tagger for Traditional Arabic Texts,” Proceeding of 5th Journal of Computer Sciences, V 11, pp: 794-800, 2009.
وخستیه کىفراوس بیه المللی پردازش خط ي زبان فارسی 51ي 51شهریًر 5935 داوشگاٌ سمىان -داوشکذٌ مهىذسی برق ي کامپیًتر