Automatic Thai Keyword Extraction from Categorized Text Corpus

6 downloads 229 Views 197KB Size Report
different categories from the given text corpus. Keywords: Text Mining, Information Extraction,. Mining Sequential Patterns, Substring Pattern Analysis,.
Automatic Thai Keyword Extraction from Categorized Text Corpus Choochart Haruechaiyasak, Prapass Srichaivattana, Sarawoot Kongyoung and Chaianun Damrongrat Information Research and Development Division National Electronics and Computer Technology Center 112 Thailand Science Park, Phahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120, Thailand Tel. (0)2-564-6900 ext. 2260 Fax. (0)2-564-6873 Email: [email protected] ABSTRACT Information Extraction (IE) is a process of discovering implicit and potentially important keywords underlying unstructured natural-language text corpus. Most previously proposed solutions to IE were accomplished by constructing a set of words from given text corpus during the preprocessing step. Due to the inherent chracteristic of Thai written language which does not explicitly use any word delimiting characters, identifying individual words, i.e., word segmentation, is a challenging task and has become one of the important research topics in Natural Language Processing (NLP). In this paper, an alternative method to word segmentation for extracting important keywords from categorized text corpus is proposed. The approach is based on the analysis of frequent substring-sets, as a result, this method is language-independent, i.e., does not rely on the use of any dictionary or language grammatical knowledge. We refer to this method as Automatic Categorized Keyword Extraction (ACKE). Applying the ACKE algorithm to a text corpus yields sets of keywords which are highly distinct between different categories from the given text corpus. Keywords: Text Mining, Information Extraction, Mining Sequential Patterns, Substring Pattern Analysis, Natural Language Processing. 1. INTRODUCTION Text mining is the application of data mining for text processing and Information Retrieval (IR). Text mining describes a process of discovering useful information or knowledge from unstructured text corpus [1, 2]. One important task in text mining is Information Extraction (IE) which is the process of discovering specific pieces of information from a corpus of naturallanguage texts [3, 4]. Most previously proposed algorithms to IE were accomplished by modeling textual documents as a set of words [3-5], i.e., word-level. These algorithms are suitable for Latin-based languages, such as English and

Spanish, in which words are delimited by using special characters such as period (.), comma (,) and space characters. These languages are often referred to as segmented languages. However, the word-level approach can not be directly applied for some languages, which do not explicitly use any word delimiting characters, such as Chinese, Japanese, Korean and Thai. These languages are referred to as non-segmented languages. Identifying individual words from nonsegmented languages, i.e., word segmentation, has been one of the most widely studied research topics in Natural Language Processing (NLP). For Thai written language, many word segmentation algorithms are available, but none of them yields perfect results due to the ambiguity in language usage [6-8]. In addition, most word segmentation algorithms are language-dependent, e.g., they relies on the use of dictionary and language syntax in one particular language. In this paper, we propose an alternative solution to IE which is language-independent. The approach constructs a set of frequent substrings by using the technique of mining sequential patterns [9]. The method of mining sequential patterns is an extension of the association rule mining technique by imposing the order constraints into the process of constructing the rule set [10]. The original algorithm for mining sequential patterns was designed for well-structured databases composed of customer purchasing transactions. Applying the algorithm could reveal interesting customer buying patterns through time-ordering sequence. The method for mining sequential patterns is modified for the context of substring pattern generation. In addition, the data structure of suffix array is used to construct character indexes from texts in order to increase the efficiency in string processing and manipulation. In particular, we consider an IE problem of automatically extracting important keywords from categorized text corpus. We refer to this process as Automatic Categorized Keyword Extraction (ACKE). The resulting sets of categorized keywords help identifying important concepts underlying each

category. For example, a set of keywords {movie, music, show, television, actor, singer} would likely be extracted from the category Entertainment. The rest of this paper is organized as follows. In next section, we review the related works in substring pattern generation and analysis. In Section 3, the approach for generating frequent substring patterns based on the method of mining sequential patterns is described. In Section 4, experiments with some results are presented. The paper concludes in Section 5. 2. RELATED WORKS An efficient method for computing term frequency and document frequency for all substrings from text corpus was proposed in [11]. The algorithm could generate arbitrary long n-grams by processing the suffix arrays of text corpus. This method is applied for the whole text corpus in order to generate all substrings without particularly focusing on the IE problem. In [12], a study of language-indepent string pattern analysis was presented. Similar to our work, their study considers the generation of substrings from text corpora of non-segmented languages. However their work focuses on the NLP issues such as morphology and PartOf-Speech (POS) tagging. In [13], an efficient algorithm for discovering optimal string patterns was proposed. However, their work is based on the method of mining association rules in order to find the proximity of segmented words in given texts. Our work is based on the method of mining sequential patterns in order to generate important substring patterns from given texts. Therefore, our approach imposes the order constraint in characters and works in substring (series of characters) level. In [14], a method for automatically extracting open compounds from text corpora was proposed. Our work is similar to their work which is based on the detection of significant change in cooccurrences of substrings. However, their method considered the open compounds based on the cooccurrences from the whole text corpus whose categories are undefined. Our work extracts keywords from categorized text corpus based on the detection of significant distributional change in occurrences of substrings and information entropy is used to select only highly distinct keywords under different categories. 3. AUTOMATIC CATEGORIZED KEYWORD EXTRACTION The algorithm Automatic Categorized Keyword Extraction (ACKE) is a process of discovering important keywords from categorized text corpus. The algorithm is based on a method for mining sequential patterns originally proposed in [9]. ACKE is composed of two main steps: (1) a process of generating frequent substring patterns, which satisfy some constraints, we will refer to this step as substring generation, and (2) a process of merging those frequent substrings into keywords, we will refer to this step as substring

merging. The details of this algorithm are given as follows. Problem statement: Let A = {a1, a2 , …, am} be a finite set of alphabets or characters in a particular language. Given C = {C1, C2, …, Cn} a predefined set of categories, and T = {T1, T2 ,…,Tn} a set of categorized text collections where Ti contains a set of textual documents {ti0, ti1, …, ti|Ti|}, tij ∈ Ci and |Ti| is the total number of documents belonging to Ci. ACKE algorithm produces a set of categorized substrings S = {S1, S2 ,…, Sn} where Si contains a set of substrings {si0, si1,…, si|Si|}, sij ∈ Ci and |Si| is the total number of substrings belonging to Ci. sij is a substring composed of a series of characters drawn from A. Each sij represents a keyword which identifies important concepts underlying category Ci. We call a substring sij with length L, an L-charset, i.e., L-gram in the NLP context. For example, the substring “aabbc” is a 5charset. To generate the set of categorized substrings S, Each substring sij must pass two specified conditional parameters: minimum support (min-sup) and maximum total information entropy (max-total-ent). The support of substring sij under Ci, denoted by supi(sij), is defined as the number of documents belonging to Ci which contain sij divided by the total number of documents in Ci, i.e., |Ti|. Consider a substring sij under a category set C = {C1, C2,…, Cn}, we could generate a support-value set for sij as follows: sup(sij) = {sup1(sij), sup2(sij),…, supn(sij)}. The probability-value set for sij can be constructed by normalizing the support values over all categories, i.e., probi(sij) = supi(sij) / Σ(1 ≤ i ≤ n) supi(sij) The entropy-value set for sij can be calculated by using the following formula, enti(sij) = − probi(sij) ∗ log (probi(sij)) Total information entropy (total-ent) is the sum of all information entropy from all categories, i.e., total-ent(sij) = Σ(1 ≤ i ≤ n) enti(sij) Suppose the min-sup and max-total-ent are set equal to 10% and 2.5, respectively, this means only those substrings which appear at least in 10% out of all documents within any category and has the total information entropy of no greater than 2.5 will be output from the process. Thus, the min-sup helps removing those low frequently-occurred substrings. Whereas the max-total-ent helps selecting only those substrings which are highly distinct in a certain category. ACKE algorithm for substring generation is given as follows. Note that in order to increase the efficiency in string searching, a data structure of suffix array is used to construct chracter indexes for each document.

Algorithm: ACKE Input: substring set L = ∅ a set of characters, A a set of documents with suffix arrays, T a pre-specified minimum support value, min-sup a pre-specified maximum total information entropy, max-total-ent L=A; //put single character as initial substring set While L ≠ ∅ do begin //make a pass over the suffix arrays let candidate set R = ∅; forall elements in A do append each element in A with L into R forall texts t in T if t contains element r (from R) and t∈Ci then counti(r) = counti(r) + 1; end //calculating entropy forall categories in C supi(r) = counti(r) / |Ti| ; total-supi(r) += supi(r); end //checking min-sup and max-total-ent forall categories in C probi(r) = supi(r) / total-sup(r); enti(r) = − probi(r) ∗ log (probi(r)); total-ent(r) += ent i(r); if (sup i(r) ≥ min-sup) and (total-ent(r) ≤ max-total-ent) L = L + r; end end end After the potential substring set L is generated, substring merging process must be applied in order to integrate substrings which are potentially parts of the same keywords into one. To merge potential substrings, the probability distribution of substrings under different categories are compared. Only those substrings, which contain some overlapping series of characters and have matching probability distribution, are merged together to form a keyword. The probability distribution refers to the probability-value set, i.e., prob(sij) = {prob1(sij), prob2(sij),…, probn(sij)}, obtained by normalizing the support-value set of a substring. For example, suppose a keyword is known to be “abcdef” and two substrings, “abcd” and “bcdef”, are generated from the ACKE process. Since these two substrings share some overlapping partial substring, “bcd”, the probability distribution of “abcd”, i.e., prob(“abcd”) is compared with the probability distribution of “bcdef”, i.e., prob(“bcdef”). If prob(“abcd”) are exactly matched with prob(“bcdef”), “abcd” and “bcdef” are merged together to form the keyword, “abcdef”.

4. EXPERIMENTAL RESULTS In this section, an experiment is performed to extract and construct a keyword set based on the ACKE algorithm. The data set is composed of approximately 11,000 Thai newspaper articles collected from the World Wide Web in 2003. Table 1 shows the predefined categories with the assigned IDs. Table 1 : Predefined category set with assigned IDs for newspaper collection ID

Category

0

Politics

1

Economy

2

Education

3

Bangkok

4

International

5

Social

6

Technology

7

Agriculture

8

Entertainment

9

Sports

A sample list of keywords under different categories is shown in Table 2. These keywords help identifying important concepts from different categories. Due to the space limit of this paper, further experiments to qualitatively evaluate the list of extracted keywords will be included as part of our future work. 5. CONCLUSION A new approach to language-independent Information Extraction (IE) was proposed. This approach provided an alternative method to word segmentation algorithm for extracting keywords from natural-language text corpus. The algorithm is called the Automatic Categorized Keyword Extraction (ACKE), which identifies important keywords from categorized text corpus. This algorithm consists of two main processes: substring generation and substring merging. Substring generation is based on the method of mining sequential patterns with two conditional parameters, minimum support (min-sup) and maximum total entropy (max-total-ent). Specifying min-sup helps pruning for highly occurred substrings, and therefore reducing the execution time complexity. Specifying max-total-ent helps screening for only highly distinct keywords under different categories. The substring merging is applied to those substrings generated from the substring generation process. To merge potential substrings, the probability distribution of substrings under different categories are compared. Only those substrings, which contain some overlapping series of characters and have matching probability distribution, are merged together to form a keyword. Experiments were performed on Thai newspaper articles. Applying the ACKE algorithm on this news article corpus, a set of keywords can be generated which can be used to identify key concepts for each of the different categories.

Table 2 : A Sample of Sorted List of Categorized Keywords Politics Economy Education Bangkok International Social Technology Agriculture Entertainment Sports

พรรคไทยรักไทย พรรคประชาธิปตย หัวหนาพรรค การทุจริต รัฐธรรมนูญ การเลือกตั้ง กระทรวงการคลัง ดอกเบี้ย กรรมการผูจัดการ การลงทุน การสงออก ธนาคาร การศึกษาขั้นพื้นฐาน คณะกรรมการการศึกษา กระทรวงศึกษาธิการ อุดมศึกษา สถานศึกษา ประถม เดินรถ ทางดวน การจราจร กรุงเทพมหานคร การขนสง การกอสราง กรุงแบกแดด สํานักขาวตางประเทศ กองกําลัง นิวเคลียร ประธานาธิบดี สหประชาชาติ เสด็จ พระองค สมเด็จ สไตล ความรัก แตงงาน นักวิทยาศาสตร นักวิจัย ศึกษาวิจยั ของมหาวิทยาลัย ของสหรัฐฯ วิทยาศาสตร สายพันธุ เกษตรกร แปรรูป การเกษตร การปลูก กระทรวงเกษตร ในละคร นางเอก ทางชอง พระเอก รับบท แฟน แมตช ผูจัดการทีม ลูกหนัง นักเตะ ทีมชาติไทย ชิงชนะเลิศ

6. REFERENCES [1] M. A. Hearst, “Untangling text data mining,” Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3-10, 1999. [2] A.-H. Tan, “Text Mining: The state of the art and the challenges,” Proceedings of the PAKDD Workshop on Knowledge Discovery from Advanced Databases, pp. 65-70, 1999. [3] U. Nahm and R. Mooney, “Text Mining with Information Extraction,” Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, pp. 60-67, 2002. [4] S. Soderland, “Learning information extraction rules for semi-structured and free text,” Machine Learning, 34 (1-3), pp. 233-272, 1999. [5] R. Feldman, et. al, “Text mining at the term level,” Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 65-73, Sept. 1998. [6] P. Charoenpornsawat, B. Kijsirikul, and S. Meknavin, “Feature-based Thai Unknown Word Boundary Identification Using Winnow,” Proceedings of the 1998 IEEE Asia-Pacific Conference on Circuits and Systems (APCCAS’98), Thailand, 1998. [7] A. Kawtrakul, et. al., “Automatic Thai unknown word recognition,” Proceedings of the Natural Language Processing Pacific Rim Symposium, pp. 341-348, Thailand, 1997. [8] V. Sornlertlamvanich, T. Potipiti, and T. Charoenporn, “Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm,” Proceedings of the 18th International Conference on Computational Linguistics (COLING2000), pp. 802-807, 2000. [9] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proceedings of the Fifth International Conference on Extending Database Technology, pp. 3-17, 1996.

[10] R. Agrawal, T. Imilienski, and A. Swami, “Mining association rules between sets of items in large datasets,” Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 207-216, 1993. [11] M. Yamamoto, “Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus,” Computational Linguistics, 27(1), pp. 1-30, 2001. [12] Y. Tatsuo and Y. Matsumoto, “Language Independent Morphological Analysis,” Proceedings of the Sixth Applied Natural Language Processing Conference, pp.232-238, April 2000. [13] H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, “A fast algorithm for discovering optimal string patterns in large text databases,” Proceedings of the 9th International Workshop on Algorithmic Learning Theory, pp. 247-261, 1998. [14] V. Sornlertlamvanich and H. Tanaka, “The Automatic Extraction of Open Compounds from Text Corpora,” Proceedings of the International Conference on Computational Linguistics (COLING’96), pp. 1143-1146, 1996.