Data Mining for Digital Libraries

5 downloads 14590 Views 29KB Size Report
[email protected]. ABSTRACT. With the widespread onus of Digital Libraries, the data mining technique is no longer restricted to business intelligence, ...
Proceedings of the 4th National Conference; INDIACom-2010 Computing For Nation Development, February 25 – 26, 2010 Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi

Data Mining For Digital Libraries – A New Paradigm Ajendra Isaacs1, Rajeev Paulus2, Navendu Nitin3 and Narendra Gupta4 1,4

Deptt. of Comp.Sc. & I.T. AAI-DU, 2,3Deptt. of Electronics AAI-DU E-mail: [email protected], [email protected], [email protected], 4 [email protected] 1

ABSTRACT With the widespread onus of Digital Libraries, the data mining technique is no longer restricted to business intelligence, trend analysis, or scientific & probabilistic research. The technique of data mining though acting as a pointer to databases and their like is far removed from the same , rather it is a combination of pattern matching , A.I. , machine intelligence and other statistical techniques and algorithms. When used in conjunction to finding documents , texts , images from a existing digital repository of data , known as a digital library , a paradigm change takes place , and we now analyze , extract , modify and tweak the results using the existing algorithms and techniques on the digital documents making them yield their secrets which is the knowledge required . This paper seeks to throw some light on the existing means and techniques of data mining in the context of digital libraries. KEYWORDS Digital library, text mining, tabular data mining, digitizing data, extraction of documents, names, text etc. 1. INTRODUCTION In recent times data mining as a technique has become very popular for extracting “appropriate information “ from raw data which has more value than when it was not mined , data which is usually present in two forms structured and unstructured in the raw form does not usually help in adding to the knowledge so the data mining techniques are used to work over this data and torture it till it yields the knowledge which was required in the first place , and the tools of torture are the algorithms , techniques of data mining usually comprising of pattern matching algorithms and Artificial Intelligence methods. usually business systems and scientific data , our aim is to use these very techniques in the environs of the digital libraries and see how these techniques are applied to the scenario of the same. When we shift to the digital libraries now , we ask ourselves the basic question of what are we mining for , which has a simplistic answer , when we talk about a digital library we are confronted with documents , text , multimedia and other forms of data , it can even be audio and video in nature. Any attempt to sift for a specific information , maybe a text , a speech , an image , a clip having a definite context can be mined , that means the data will be sifted in the same way that the miner sifts his material in the search for what he is seeking, here in data mining it is no different , the only change being in the tools employed , the material to be mined is obviously the text

in the data , and the mining tools are the algorithms and the various text matching software which yield the searches , by presenting the required data. This is just a very isolated example and by no means does this example cover the field of data mining, there are so many more facets Using these tools a user can thus extract information, patterns, infer rules and associations which will enable once again for the addition to the knowledge from simple data, Now this is a basic introduction to the data mining techniques observed in most systems to this paradigm,i.e. It is not so simple in the computer parlance we have to make a lot of steps and to use different techniques , methods and algorithms for the mining to be successful or to yield the proper results. We cannot for certain say that this technique is going to yield the result, data mining obviously has a stark side to it the probabilistic side which puts an end to all our certainty, demanding for the data miner to have his mental sleeves rolled up in advance for using this or the other or still the other algorithm, or maybe a combination of the same. 2. DATA MINING, HOW IT ALL BEGAN From times past mankind has been asking the basic questions , how , what , when , where and in the process new vistas have opened up , data mining is not limited to the computer age but hearkens back to the cave men who used to search for animals in order to survive they had to listen to the weather and also the natural habitat in order to find animals which were their diet , and in the process were actually already on the data mining path as they tried to understand weather patterns the animals response to them and their subsequent supply of available food. Talking about the scientific developments data mining started with the Bayes theorem – 1700 and Regression analysis – 1800, and finally on to data sets, genetic algorithms, decision trees, neural networks and so on to today’s latest software – RapidMiner , Weka , Rproject to name a few .Today as we are now heading towards a semantic web search which will dwarf all previous attempts for data mining as we move towards a much higher abstraction level of data retrieval. 3. DATA MINING PARADIGM Basically there are four classes of tasks [1] (i) Classification: This arranges the data in a class arranging data into predefined groups. That is making a model inferred from the training set For example a Word Procesing program might attempt to classify a HTML page as valid or invalid. Commonly used algorithms are Decision tree learning, Artificial Neural Networks [2] , Bayesian data.[3],[4],[5]

Proceedings of the 4th National Conference; INDIACom-2010

(ii) Clustering: Putting together of data of the same type, i.e. the data mining algorithm will try to put similar type of items together [6],[7] (iii) Regression: Attempting to put together the data with a function that most accurately defines the data (iv) Association rule learning : Using this kind of searches between relationships ,we can establish rules i.e. A customer who buys diapers is likely to buy baby powder so we can use this information to put these products together on the supermarket shelves (v) Prediction: Using the training sets and classification models to predict missing values. 4. MINING-TECHNIQUES – TABULAR TYPE Now a days , digital libraries have a host of information , this maybe of a tabular type or textual type , both types of data require different types of techniques or algorithms , if we talk about mining a tabular form of data , we have to search for rows and columns and fields and records , in which case the mining methods will use the particular mechanisms which are very different form simple searches , there are new algorithms which are devised and optimized just for these specific kind of queries for searching and extracting tabular data. We can perform prediction , that is predict certain kind of missing data by using these kind of specific methods next we can perform a different kind of activity that is we can classify information by setting together closely resembling information setting up a statistical model which can be used for prediction. We can perform clustering that is we can set together that kind of data which is similar in type. 5. MINING TECHNIQUES TEXTURAL TYPE Data mining techniques when applied to digital libraries were earlier of the simple textual search engine based now as data mining techniques have developed and refined, conceptual based data retrieval methods , which can retrieve data from digital libraries in a conceptual paradigm , that is the data may not be simply textually related rather that there may be a conceptual relationship , we can also have queries which have which are association based which can be used to retrieve information on a pattern based method that is we can attach weights to certain elements which we want to concentrate our search on , we can also check out data which seem out of place or seem to be erroneous , thus we can reasonably move out of our simplistic pattern or word based searches to a higher level of searching or how the proper word will be mined for data.[8],[9]. 6. STRATEGIES IN DATA MINING There has been a spate of methods or combinations of data mining techniques [10] the field of Data Mining. 7. ARTIFICIAL NEURAL NETWORKS They are used a lot in data mining where a number of nodes are used , i.e. input nodes , output nodes and intermediary nodes [6]are used by attaching weights to them and then a algorithm

is used to present a proper classification to data given to the input nodes. 8. CLASSIFIERS They are used to present data to the root of the trees which can finally classify data and present them to the leaves of the tree 9. A COLLECTION OF MODELS We use this method in the sense that is we may use several techniques and gather the results and interpret them as per our requirement 10. CONCLUSION In concluding we may say that data mining is now here to stay and especially in the context of Digital Libraries [11]we will see more and more applications of the same type , we can say definitely that Data mining will soon be inextricably linked to digital libraries that is we can safely conclude that digital libraries are here to stay and data mining techniques will have to be used because data in digital library will soon become more and more complex entailing complex data retrieval techniques of data mining 11. FUTURE SCOPE We are soon to embark on a new journey the synthesis of the field of digital libraries and that of data mining , which will give rise to new paradigms as the information present in more complex forms is that is unstructured or semi structured data is efficiently mined ,the likes of SPRINT,algorithm[12],and Artificial Neural Networks [13] along with a corresponding change in the ability of data mining to handle larger and more voluminous data in a more secure with private handling enhancements [14] . 12. REFERENCES [1]. Wikepedia.org [2]. Rakesh S.Patil et. al, “Data Mining:Needs & Applications Proceedings of the 2nd National Conference on Computing For Nation Development; INDIACom-2008, ISSN 0973 – 7529. [3]. D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. [4]. Sholom M. Weiss and Casimir A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991. [5]. Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925, December 1993 Continued on Page No. 386

Data mining For digital libraries – a new paradigm

Continued from Page No. 382 [6].

[7]. [8]. [9]. [10].

[11].

[12].

[13].

Sona Jandial, “Artificial Neural Network Applied to Data Mining: The Commercial Perspective”, Proceedings of the 2nd National Conference on Computing For Nation Development; INDIACom-2008, ISSN 0973 – 7529. Pavel Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, Inc. Robert L.Grossman ,University of Illinois at Chicago , Laboratory of Advanced Computing Adrianns P and Zantiuge P (1996) Data Mining. Sanjay T. Singh et al “A Data Mining Framework for B.I.” Proceedings of the 2nd National Conference on Computing For Nation Development; INDIACom-2008, ISSN 0973 – 7529. Ajendra Isaacs et. al., Automation of Libraries through ICT Applications – A Tool to Empower National Development”, Proceedings of the 2nd National Conference on Computing For Nation Development; INDIACom-2008, ISSN 0973 – 7529. Rakeeh Agrawal et .al. , SPRINT: A Scalable Parallel Classifier for Data Mining , IBM Almaden Research Center. Yehuda Lindell et al., Privacy Preserving Data Mining, Weizmann Institute of Science, Rehovot, Israel.