Paper Title (use style: paper title)

0 downloads 0 Views 438KB Size Report
In Computer Science, the ACM categorization system is commonly used for organizing research papers in the topical hierarchy defined by the ACM. Accurately.
Exploiting Reference Section to Classify Paper’s Topics Naseer Ahmed Sajid, Tariq Ali, Muhammad Tanvir Afzal, Munir Ahmad Center for Distributed and Semantic Computing (CDSC), Faculty of Engineering and Applied Sciences, Mohammad Ali Jinnah University Islamabad, Pakistan

[email protected] [email protected] [email protected] [email protected] ABSTRACT Classification is an important task in data mining. Classification is about organizing data into relevant nodes in taxonomy. In scientific domain, classification of documents to predefined category (ies) is an important research problem and supports number of tasks such as: information retrieval, finding experts, recommender systems etc. In Computer Science, the ACM categorization system is commonly used for organizing research papers in the topical hierarchy defined by the ACM. Accurately assigning a research paper to a predefined category (ACM topic) is a difficult task especially when the paper belongs to multiple topics. In the past, different approaches have been applied to find the actual topics of a paper such as content based analysis, metadata analysis, and semantic analysis etc. However, in this paper, we exploit the reference section of a research paper to discover topics of the paper. It is assumed that in most of the cases, papers belonging to the same or similar category are cited by an author. We have evaluated our technique for a dataset of Journal of Universal Computer Science (J.UCS). Our system collected 1460 documents from J.UCS along with their predefined topics assigned by authors and verified by journal’s administration. The system used 1010 documents for training dataset. The system extracted references from training dataset and grouped them in a Topic Reference pair such as TR {Topic, Reference}. Subsequently, the system was tested on the remaining 450 documents. The references of the focused paper are parsed and compared in the pair TR {Topic, Reference}. The system collects corresponding list of topics matched with the references in the said pair. Subsequently multiple weights are assigned during the process of this matching. The system was able to predict the first node in the ACM topic (topic A to K) with 70% accuracy.

Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Indexing Methods, H.3.3 [Information Search and Retrieval] Information Filtering, search Process, H.3.7 [Digital Libraries] System Issues, standards.

General Terms Algorithms, Management, Design, Experimentation,

Keywords Document classification, Scientific documents classification, Citations, Multi-label classification, ACM classification

1. INTRODUCTION Research community has been producing large number of scientific documents, sometimes it is referred that the information is doubling in every five years. These documents are then searchable over the internet using search engines, digital libraries and citation indexes. Any system can answer the query of a user if the system is able to accurately classify the indexed documents. Classification is a process through which a system organizes indexed knowledge resource in taxonomy. Moreover, the classification task is useful for number of systems such as: information retrieval, analysing trends, finding experts, recommender systems. The accurate classification system can further assist authors of research papers in the paper submission process. The classification of research papers becomes more difficult because of the fact that a paper may belong to multiple topics. Accurate categorization can be helpful in relevant information retrieval. Most of the previous systems automated the process of classification by assigning one category to each research paper. However, a research paper may belong to more than one category that is why we have developed our system in such a way that it supports multiple categorizations for single document. A paper may belong to one or more than one categories as shown in Figure 1. There are three levels of ACM Computing Classification System (ACM CCS) [9]. Level 1 represents topic A to K, on second level any topic from A-K may have sub topics as A.1, A.2… A.m, and similarly on third level subtopics as A.1.1, A.1.2,…, A.1.m. Our current work is limited to discover the first level in taxonomy; however, the similar approach can be used for second and third level. Classification is a two level approach in which at initial level the model is generated from the training set and then the model is used for predicting the future instance. In literature, document classification is based on some or all selected text from document. Different techniques use different classifiers for classifying new document to existing category. Most of the existing approaches to classification assign an instance to one class [3, 12, 13, 15, 16]. The major reason is that most of these approaches are for generic text classification. Up to our knowledge only [3, 12] address the

problem of classification relevant to scientific publication. Most of these approaches miss the multi class classification. Document classification is a process of assigning natural language texts to predefined categories [1]. Structural documents have higher classification accuracy over the unstructured documents [3]. Beside the structure priority of text document over the unstructured documents, there exists a strong relationship between documents as a citation relationship. Garfield highlighted the citation based

relationship between documents [18]. In our approach we evaluated the citation relationship between documents for category identification. This approach is an investigation of exploiting reference section for classification. Accuracy of the proposed approach can be increased by adding more information (metadata) for scientific publications. References for each document are retrieved and stored in the database.

Figure 1. Classifing new document to ACM hierarchy Normally it is observed that an author cites most of the time the papers belonging to same category, based on this heuristic, we build our proposed approach to evaluate reference section for classification. The problem we overcome in this paper is; assigning a new document (research paper) to a set of previously fix categories (ACM Hierarchy) as depicted in Figure 1. The pre defined categories can be of any type, normally in computer science the most commonly used categorization is the ACM CCS [9]. We picked first level of ACM Categorization as classification. The formal representation of the problem is defined as follows in Definition 1. Definition 1: D is a set of documents. C is a set of predefined category. The problem is classifying a document dt of type D to a category(ies) ci..j belonging to category set C. A document can belong to multiple categories ci..j. Mathematically it can be written as D = {d1, d2, …. dn} C = {c1, c2, ….cm}

dt : D ϵ ci..j: C

Eq(1)

We intially performed the expermintal results on the Journal of Universal computer sciences J.UCS dataset, this approach can further be tested on scientific documents form multiple datasets. We extracted references, authors and category information from a test document to classify. We already stored all the references of all the documents in a databases. The extracted references were matched against the stored references in the database. J.UCS dataset contains almost 1460 papers along with all its references. Overall accuracy of our approach using the references for classification is 70%. Accuracy of our approach can be increased by incorporating more information about a scientific publication like keywords, title, abstract or even full text. The rest of the paper is organized as, section II contains related approached to text classification. Section III contain framework and working of our proposed apprach. Section IV contains the experimental result on J.UCS dataset. Section V concludes the paper and provides some future directions.

2. LITERATURE REVIEW There have been many document classification techniques proposed using Machine Learning, Bayesian network, Support Vector Machine (SVM), K Nearest Neighbour (KNN) and Particle swarm optimization based (PSO). In all these approaches very few approaches [3,12] are proposed for the classification of scientific papers. Survey towards the document classification in the context of supervised learning technique is given in [2]. Classification in both structural and unstructured context is analyzed. With published literature [3] it is strongly argued that structural documents give more accuracy in classification then unstructured documents. accuracy and performance of document classifier is not only depends on the machine learning (training) but also on format, type or size of the data, which scoring(weight) methodology is used and threshold applied for quality extraction of the contents. This survey provides theoretical comparison of different techniques, as accuracy and efficiency of classifier is missing in different techniques. A PSO based document classification for web documents is given in [4]. The documents are first preprocessed. Initially stop words using a list containing 548 stop words [5] are removed. Then word stemming is performed using the well known porter stemmer algorithm [6]. After preprocessing, the documents were represented as document term frequency matrix. Documents are finally represented as term vectors, using TF-ITF weighting approach given in [7]. Feature selection was done through entropy weighting scheme [8]. Entropy weighting scheme is done using local weighting of term k and global weighting of term k as Ljk × Gk. after feature selection particle swarm optimization (PSO) is used as a classifier. Initialization of individual particles is done randomly, the structure for each particle at given iteration is represented as [4]

Where 0 represent the iteration and n represents the term numbers in document set. The velocity of individual is given as Which corresponds to the update quantity of all weight values? Finally the effectiveness is measure in terms of precision and recall as in [4]

Experimental evaluations were performed on two standard text dataset reauter-21578 and TREC-AP. In this approach [4], no weighting mechanism is defined. Similarly, classifying a document to multiple categories and hierarchical classification is not described. Effectiveness of PSO with respect to different dataset towards classification problem is given in [11]. Ten different dataset with multiple instances, classes and number of parameters composing each instance are taken. PSO accuracy in terms of error rate is compared with other nine classifiers. On three data sets PSO outperformed than all other classifiers. PSO efficiency with increasing the number of classes is highlighted, which may be due to implementation or similarity measure used for evaluating fitness function.

Improving the document classification with structural contents and citation based evidence is given in [12]. For classification both structural (title and abstract) and citation based information is considered. Different similarity measures for both structural (bag of words, cosine, and Okapi) and citation based (Bibliographic coupling, co citation, Amsler, and Companion) similarity are used. Genetic programming is used for the classification of new document. For prediction of new document best similarity tree for each class is maintained. Majority voting on the output of each classifier is used to predict the class of the new document. The authors claim multi classification but the underlying detail for multi classification is missing. A new approach based on the neighborhood preserving embedding (NPE) with PSO is given in [13]. NPE preserves the local manifold structure and preserve the most discriminating features for classification. Documents features in high domain X are reduced to the lower domain Y by using the NPE. PSO is used similar to the approach presented in [4] on the reduced features in the lower domain Y. Discriminative features extraction play an important role in increasing the document classification accuracy. Bayesian based approach for the classification of conference paper is given in [3]. 400 education conference papers in four categories (e-learning, Teacher Education, Intelligent Tutoring System, and Cognition issues) are used for constructing Bayesian network. Only keywords are used for conference paper classification. Compound keywords are parsed into Single keywords which are ranked with respect to frequency and top 7 keywords for each category are considered as input for Bayesian network. Each category shares some common keywords along with some individual keywords. The network has trained with 100 papers for each category. This technique is efficient due to less text as only keywords are used for classification of a new paper, conversely the misclassification error increases due to no keywords available in some papers or due to wrong keywords provided by the authors. Linear text classification using category relevance factors (CRF) is given in [10]. CRF is maintained for all documents belonging to similar category. Profile vector for each category is maintained from CRF feature vector. Based on the cosine similarity between papers and categories, the document is classified to the category having maximum similarity. Another text document classification technique based on similarity was proposed in [14]. In their similarity based categorization framework, a classification is based on small level documents and due to less number of documents achieves completeness. As there was a small knowledge (vocabulary), so their technique was fast but not for large vocabulary. Performance of their technique (algorithm) was slightly lower than expected. Comparison against the performance with SVM, Rocchio algorithm, Bayes, Naïve Bayes is mention but not comparison table or graph results are presented. A new technique using data mining principles and rule weights was proposed in [15]. It is an efficient fuzzy rule generation approach which contains rules of different lengths. As fuzzy rules may generate a large set of rules, so only small set of rules are accepted and redundant rules are removed from large set. By using minimum support and confidence, a proposed approach is avoided the exponential growth of the rules set. Proposed method learns the weights of the rules using a set of labeled training patterns. Initial weight of each rule is 1 and if rule maximize the classification rate then the calculated weight is optimal. A fuzzy rule based classification system is proposed in [16], which specifies weights for rules. In it two heuristic methods were

proposed for rule weight specification. First method is for rule evaluation measure using support and confidence and other is weight specification (maximum confidence) with respect to antecedent and consequent classes. For n-dimensional data, there simulation results show that proposed method outperform the existing one. In most of the systems only one topic or class can be assigned to a paper. Most of the approaches are generic to text classification. Very few approaches [3, 12] are proposed for the classification of scientific documents. Text classification proposed by Chin et al. is based on keywords, which can never achieve accurate classification as in many papers keywords are not provided [13]. Some time the user provided keywords may be incorrect which can lead to an incorrect classification. Similarly, another approach [12] is missing the detail information about the underlying approach, results and information about the multiclass approach is also missing. Experimental result in most of the existing approaches are not evaluated on scientific publications datasets.

References extracted from the user provided documents are matched with the all the references stored in the database in the matcher section. Our matching algorithm is based on the well know Levenshtein similarity [17]. Currently, we match each of the reference of the test document with each of the reference and the reference’s database. This process is time consuming, due to the large number of references in the references database. Our matcher efficiency can be increased by adding some heuristics like indexing the references according to their length. Organizing the references in the order of their length will make the matching process efficient by comparing each of the reference of test paper with similar length reference of the database. For example, if the length of the reference in test paper is L, we will compare this reference with only the references having length from L-E to L+E, where E is the error length in the references.

Test Paper

3. CITATION BASED CATEGORY IDENTIFICATION In our current work, we initially focus on the evaluation of using references for category identification. References of the papers are extracted and matched with the stored references in the database.

Meta Data Extractor

From each of the document, we extract some of the information; which are used for the classification of scientific documents. These information contain Author information, Category information and references as depicted in Figure 2. Author information contains the list of all the Authors provided in the document. Category information contains the list of all the categories provided by the author. The provided categories may belong to one category or more than one categories. References contain all the extracted references from the paper’s reference section. We use these information to classify a document. These information can be enriched in future to with other text provided in the paper, to increase the classification accuracy like keywords, abstract or frequent terms.

Categories, Authors, References

Scientific Publication Autho rs First Author

Categor y Last Author

Reference s Ref: 1

Figure 2. Extracted Metadata information

Ref: n

Paper 1 (Ref1, Ref2, …………, Refn) Self Matcher

Matcher

ACW

SCW

The process of evaluating references for identification of category of scientific publications is depicted in Figure 3. Citation Based Category Identification (CBCI) framework elaborates the idea of identifying category for a user provided paper. In CBCI, References extractor extracts all the references from all the papers in the repository provided by J.UCS dataset [19]. These references are stored in database along with the matching topic of the paper once for all the papers, to classify the new coming documents from user.

Reference Extractor

References

Paper n (Ref1, Ref2, …………, Refn)

References Database

Category (ies) Predictor

Figure 3. CBCI Flow Framework [ACw Auther Category Weight, SCw System generated Category Wieght.]

After matching each of the reference of the test paper against the references database, Matcher will return category (ies) with some weight termed as System generated Category Weight (SCW). SCW is calculated as SCW = ɳ * ϕ / ʈ

Eq (2)

Where ɳ is the number of references match in each category, ϕ is a threshold, which can be assigned any value determined by domain experts. According to our limited experiments we evaluated its value to 3, which depends on how many references are matched in each category. ʈ is the total number of the reference provided in the test paper. This weight represents the strength of the outcome classification category generated by the proposed system. We compare the proposed category weight generated by the system with the weight of the category provided by the author (ACW). Similarly, if initial categories are provided by the author, its Author Category Weight (ACW) is calculated by the self matcher module. Self citation is evaluated from author provided references and the author’s names. If any author name is matched with the author name in any reference of the test document, we call it as self citation. ACW is calculated as ACW = ɡ * ϕ / ʈ

Eq (3)

Where ɡ is the number of self citation, similarly, ϕ and ʈ is similar threshold provided as in Eq(2). After evaluating both SCW and ACW the category predicator provides the possible category(ies) to the user.

4. EXPERIMENTAL EVALUATION For the evaluation of the proposed scheme, we evaluated its accuracy on Journal of Universal Computer Science (J.UCS) dataset. Features of the J.UCS dataset are provided in Table 1. Number of research papers used for training and testing dataset are also provided in Table 1. Table 1: Features of the J.UCS dataset Total number of research papers Average number of research papers in each category Average number of multiclass research papers Total number of references Average number of references in each paper Average number of references in each category Number of research papers for training Number of research papers for testing

1460 112 234 16404 11 1232 1010 450

We presented accuracy of the proposed approach for four categories. The result evaluated against the user provided category in the test set. Accuracy may be low due to wrong classification of author provided categories. Currently, we manually investigate some portions of the testset by domain experts.

can be reevaluated by domain expert to which the papers belongs to our training and testing dataset.

5. CONCLUSION AND FUTURE WORK In scientific domain, classification of documents to predefined category (ies) is an important research problem and supports number of tasks such as: information retrieval, finding experts, recommender systems etc. This research is an effort to fill the gap between authors towards identifying correct document category, and suggest possible categorization to the author work. Our system achieved 70% accuracy and it also supports multiple categorizations for a single paper (multilevel-classification). In future, we are trying to optimize our matcher efficiency; by using some heuristics. Additionally, we will incorporate other metadata to improve the categorization accuracy.

References [1] [2]

[3] [4]

Table 2: Results percentage for the test dataset Category Information System (H) Software (D) General Literature (A) Computer Application (J)

True Result % 52 70 73 80

False Result % 48 30 37 20

The results are based on the assumption that user provided category is correct. Therefore, the current result only indicates True result and False result percentage. These two parameters are defined as: True result: System predicts the same category as given by the author in the test document. False result: System does not predict the category given by the author in the test document. We calculated the True result and False result percentage for the four categories. The results against each category with respect to these two measures are given in Table 2.

[5]

[6]

[7]

[8]

[9]

[10]

Overall accuracy is determined as;

Where n is number of individual categories. The overall accuracy of our system is 70%. This accuracy can be further increased by using more metadata or text in combination with the proposed scheme and removing the error percentage in both training and test dataset. Error percentage is the wrong classification of topic to some papers, as the existing topics are assigned by authors manually. The topic for each individual paper

[11]

[12]

[13]

Sebastiani, F. 2002. Machine Learning in Automated Text Categorization, ACM Computing Surveys, vol. 34, no. 1, 1- 47, 2002. Azeem, S, Asghar, S. 2009. Evaluation of Structured and UnStructured Document Classification Techniques, International Conference on Data Mining (DMIN’09), July 13-16, 2009, Las Vegas, Nevada, USA, 448-457. Kok-Chin K., Choo-Yee T. 2006. A Bayesian Approach to Classify Conference Papers. MICAI 2006, 1027-1036. Wang, Z, Zhang, Q, and Zhang, D. 2007. A pso-based web document classification algorithm. IEEE Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing. 30 July – 1 August 2007. Qingdao, China, 659-664. Wang, Z,G. 2003. Improving on latent semantic indexing for chemistry portal, The dissertation for the degree of Master of Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 2003. Porter, M. F. 1997. An algorithm for suffix stripping, Readings in information retrieval Morgan Kaufmann Publishers Inc., San Francisco, CA, April 1997, 313-316. Guerrero-Bote, V. P, Moya-Anegon, F, and Herrero-Solana, V. 2002. Document organization using Kohonen’s algorithm, information processing and Management, Elsevier Science, New York, January 2002, 79-89. Dumais, S. T. 1991. improving the retrieval of information from external sources, behavior research methods, instruments and computers, ingenta Connect publication, Cambridge, MA, February 1991, 229-236 Coulter, A. 1998. Computing Classification System 1998: Current Status and Future Maintenance. Report of the CCS Update Committee, Computing Reviews, Jan 1998, 1-5 Deng, Z. H., Tang, S. W., Yang, D. Q., Zhang, M., Wu, X. B., Yang, M. 2002. A Linear Text Classification Algorithm Based on Category Relevance Factors, digital libraries: people, knowledge, and technology, Lecture Notes in Computer Science, 2002, Volume 2555/2002, 88-98 De Falco, I, Della C. A., and Tarantino, E. 2006. Evaluation of particle swarm optimization effectiveness in Classification. Lecture Notes in Computer Science. 2006;3849:164–171. Baoping Z. 2004. Combining structural and citation-based evidence for text classification. Thirteenth ACM international conference on Information and knowledge management (CIKM '04). ACM, New York, NY, USA, 2004. 162-163. Ziqiang, W. Xia S. 2009. Document Classification Algorithm Based on NPE and PSO, E-Business and Information System Security, 2009. EBISS '09

[14] Senthamarai, K. Ramaraj, N. 2008. Similarity based technique for Text Document Classification, International Journal of soft computing, 3(1), 2008, 58-62 [15] Dehzangi, O., Zolghadri, M. J., Taheri, S., Fakhrahmad, S. M. 2007. Efficient Fuzzy Rule Generation: A New Approach Using Data Mining Principles and Rule Weighting, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) Vol.2, 2007. 134-139 [16] Guerrero-Bote, V.P., Moya-Anegon, F., Herrero-Solana, V. 2002. Document organization using Kohonen’s algorithm. Information Processing and Management, Elsevier Science, New York, January 2002, 79-89. [17] Levenshtein. I. V., Binary Codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10(8), 1966, 707–710 [18] Garfield, E. 1995. Citation Indexes for Science. Science, 122, 1995,108-111. [19] Afzal, M. T., Maurer, H., Balke, W., Kulathuramaiyer, N. 2009. Improving Citation Mining, International Conference on Networked Digital Technologies (NDT 2009), Ostrava, Czech Republic, July 2831, 2009.