An Exploration of Knowledge Discovery from Data ...

7 downloads 580 Views 303KB Size Report
The data mining software uses this historical information to build a .... [17] SAS. Enterprise. Miner, http://www.sas.com/technologies/analytics/dat amining/miner.
An Exploration of Knowledge Discovery from Data (KDD) Tools Koushik Dutta* and Dr. T. V. Prasad** * Deputy Director, IT Services Department, Bureau of Indian Standards, New Delhi 110 002 ** Professor & Head, Dept. of Computer Science & Engg., Lingaya’s University, Faridabad 121 002, Haryana E-Mail: [email protected], [email protected]

Abstract- The rapid evolution of science and technology has led to the generation of huge volumes of data. We have an abundance of data, scientific data, medical data, demographic data, financial data and marketing data to name a few. With its ever increasing volume, the emphasis is on automatic analysis of data. Data analysis is important for extracting knowledge or information from huge data sets and utilise it for discovering patterns or for predicting the future trends. Thus the explosive growth of data has necessitated the development of new techniques and automated tools for transforming data into knowledge. Keywords: Knowledge mining, KDD tools

discovery,

knowledge

1. INTRODUCTION Data Mining, popularly known as Knowledge Discovery from Data (KDD), is the automated or convenient extraction of hidden predictive information implicitly stored or captured in massive data repositories. Since its beginning in the 1980s, the subject has made rapid and significant progress [1]. Data mining, like statistics, is not a business solution, it is just a technology. In the last two decades, numerous commercial data mining and data analysis tools have been built solving problems across fields such as financial services, life sciences, telecom and insurance. Data mining softwares allow users to analyze large databases to solve business decision problems. Data mining process operates on information contained in historical databases containing records of previous interactions with customers. The data mining software uses this historical information to build a model that will predict customer behaviour e.g., which customers are likely to respond to a new product [2]. This treatise attempts to compare some of the prevailing KDD tools which are being used by organisations in taking appropriate business decisions and making optimal use of resources for business development.

2. IMPORTANCE OF KDD TOOLS Numerous commercial data mining systems are available in the market today. The fact that data mining is used in a vast array of areas has necessitated the development of tools for recognizing and tracking patterns within the data. Such KDD and mining tools help organizations sift through volumes of real time data to extract meaningful relationships. This helps businesses in anticipating rather than simply reacting to customer needs. It allows business users to make informed business decisions with the available data that can put a company ahead of its competitors. Acquisition of new customers is the primary means of growth for many businesses. This involves wooing new customers who have never used the company’s products. KDD tools can help segment those prospective customers and increase the response rate that an acquisition marketing campaign can achieve. Thus, in the present day, use of efficient KDD tools is critical for business analysis and information operations [3].

3. PARAMETERS USED FOR COMPARISON While there are numerous tools available in the market, and each of them has a range of functionalities to offer, we have tried to analyze 15 important ones from those tools which apart from performing basic data mining tasks like classification, prediction and clustering, specializes in feature reduction, pattern recognition, anomaly detection etc. The tools make use of techniques like decision trees, neural networks, K-Means to name a few. The features of the tools have been highlighted. Most of the tools have been designed to work on Windows and Linux platforms. A few of them namely SAS Enterprise Miner, IBM Intelligent Miner, Oracle Data Miner, IBM SPSS Modeler work on other platforms like AIX, HPUX, Sun Solaris etc. Most of the tools have easy to use GUI while a few have adapted the MS Office based environment. The details have been mentioned at Annexure.

5. CONCLUSION It has taken human society more than 300,000 years to create 12 Exabyte (1 billion gigabytes) of data and the amount of data is expected to double in the next three years, according to the School of Information Management and Systems at the University of California Berkeley. With the ever increasing volume of data, nearly all of the data mining applications feature expanded analytics, userfriendly interfaces, and powerful algorithms that allow analysis of structured and unstructured data. The KDD tools accepts data from multiple sources like MS Excel, MS Access, MS SQL Server, Oracle and other relational databases. The KDD tools have interactive user interfaces. Few of the tools also offer complete end to end solution starting with data importing to data scoring and reporting. Text and web mining form an integral part of many of the tools, given the fact a huge chunk of data is unstructured, residing in websites and e-mails.

REFERENCES [1] Han Jiawei and Kamber Micheline, “Data Mining: Concepts and Techniques”, 2nd Edition, Morgan Kaufmann Publishers [2] Home page of Kurt Thearling, http://www.thearling.com/index.htm [3] Berson Alex, Smith Stephen and Thearling Kurt, “Building Data Mining Techniques”, New York McGraw Hill, 2000 [4] Salford Systems, http://www.salfordsystems.com. [5] IBM Cognos, http://www01.ibm.com/software/data/cognos.

[6] C 5.0. www.rulequest.com/see5-info.html [7] Olson Louis Davis and Dursun Delen, “Advanced Data Mining Techniques”, Berlin Heidelberg Springer, 2000 [8] Wizwhy, http://www.wizsoft.com/default.asp?Win=7 [9] Elder John and Abbott Dean, “A Comparison of Leading Data Mining Tools”, Elder Research [10] http://www.wizsoft.com/kmcon/Wizwhy/Unique Features.pdf [11] Superquery, http://www.azmy.com/superquery.htm [12] Statistica Dataminer, http://www.statsoft.com/products/statisticadata-miner/ [13] DBMiner, http://www.dbminer.com [14] IntelligentMiner, http://www01.ibm.com/software/data/iminer/ [15] Polyanalyst, http://www.megaputer.com/polyanalyst.php [16] Oracle Data Miner, http://www.oracle.com/technology/products/bi /odm/index.html [17] SAS Enterprise Miner, http://www.sas.com/technologies/analytics/dat amining/miner [18] IBM SPSS Modeler, http:// http://www.spss.com/software/modeling/mode ler [19] Data Engine, http:// http://www.dataengine.de/english/sp/index.htm [20] Knowledge Studio, http://www.angoss.com/analytics_software/Kn owledgeSTUDIO.php [21]AI Trilogy, http://www.wardsystems.com

Annexure

COMPARATIVE CHART OF VARIOUS “KNOWLEDGE DISCOVERY FROM DATA” (KDD) TOOLS Sl. No.

Parameter

CART

Cognos Business Intelligence

C 5.0/SEE 5.0

WIZWHY

SUPERQUERY

WIZSOFT

Azmy Thinkware Inc.

Statistica Inc.

DB Technologies Inc.

Prediction

Classification

Classification, Clustering, Prediction

Association Rules, Classification, Clustering Decision Trees, K-Means

Statistica Dataminer

1.

Logo

2.

Developer

Salford Systems

IBM Corporation

3.

Function

Classification

Prediction

Rule Quest Research Pty Ltd. Classification

4.

Technique

Decision Trees (Classification and Regression Trees)

Neural Networks, Rule Induction (if-then)

Decision Tree, Rule Induction (if-then)

Rule Induction (if-then and ifand-only-if)

Rule Induction

5.

Platform

Windows, Linux

Windows, Linux

Windows

Windows

6.

Features

Is a decision tree tool that uses the CART algorithm. Makes use of seven different splitting criteria. Specialized backup rule available to handle missing data. Rules

Windows, Linux, Solaris, AIX, HP Itanium, HP UX Deployment and architecture simplified by Web services architecture. Functionalities like reporting, analysis, scorecards,

Decision Trees, (CART, CHAID), Neural Networks (including Back propagation), Regression Windows

See5.0 runs on Windows machines and C 5.0 runs on Unix. Designed to analyze substantial databases containing

Can be used for data analysis, making predictions and revealing cases that deviate from the rules[8]. It is a rule induction data mining tool

Classifies data and discovers all the facts. Automatically draws graphs and calculates totals and statistics for any column and any filter. With

Makes use of statistical methods to address data mining issues. Processes remote databases without creating local copies which enhances

DBMINER

Windows

Uses intelligent and automated processes to analyze large volumes of detailed data from relational databases, data warehouses and

Sl. No.

7.

Parameter

Interface

CART

Cognos Business Intelligence

C 5.0/SEE 5.0

do not assume that the values for a missing attribute are the same. Data from 80 different file formats (including Excel, Lotus, and Oracle) can be used.

dashboards, business event management possible with the software. Single, open API enables integration with existing security, portals, and IT infrastructure.

Regular GUI version of CART available

Has a launch menu to access IBM Cognos 8

thousands to hundreds of thousands of records and tens to hundreds of numeric, time, date, or nominal fields. Easy to use and does not presume any special knowledge of Statistics or Machine Learning. Source Code is provided to embed classifiers generated by See 5.0/C5.0 in applications [6]. To maximize interpretability, classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than ANN Application has main window containing

WIZWHY

SUPERQUERY

that discovers the if then rules in the data and reveals the necessary and sufficient conditions [9]. Calculates the error probability of each rule and summarizes the data graphically by presenting the main rules and trends

SuperQuery, there is no need to know SQL or any statistical language

performance in case of large data repositories

web data. Also accepts data from multiple sources including MS SQL Server, Excel, OLEDB and other relational databases. DBMiner Insight Solutions provide association, sequence and differential mining capabilities for MS SQL Server Analysis Services Platform and they also provide market basket, sequence discovery and profit optimization for MS Accelerator for Business Intelligence

Has a main project view with multiple

Makes use of statistical methods to address data

Has a click icon based GUI to create a workflow

DMSQL interface or a GUI interface is

Statistica Dataminer

DBMINER

Sl. No.

Parameter

CART

Cognos Business Intelligence

C 5.0/SEE 5.0

WIZWHY

Administration and Studios. Has a unique “My Area” icon which grants the user access to customized workspace.

buttons for all applications.

components integrated into a single project window. A project is organised through the metaphor of creating and manipulating variety of objects including data objects, decision tree objects, cluster objects and so on. MS Windows 98/XP/Vista; 32 MB RAM

mining issues. Processes remote databases without creating local copies which enhances performance in case of large data repositories.

description of the tasks to be performed.

used. Allows a cube view of data by interfacing through MS SQL Server’s OLAP.

486 PC or better, MS-Windows 2000/XP/Vista, 8 MB RAM, and 8MB free space on HDD

MS Windows XP/Vista MS SQL Server’s OLAP Service MS Excel

[10]

[11]

MS Windowscompatible CPU, 32MB RAM, Windows 2000/XP/Vista/7, and a network server connected to workstations (an existing enterprise database application can be used or it can be provided by StatSoft) [12]

8.

System Requirements

MS Windows, 2.0 GHz P IV Processor, 2 GB RAM, 2 GB Free Hard Disk Storage Space Linux: 32 MB RAM (min.), Min. 40 GB of free storage space on Hard Disk

MS Windows, IE 6, 128 MB RAM (min.)

MS Windows 2000/XP/Vista UNIX: Linux/Iris/Solari s Processor: Intel Pentium IV RAM: 64 MB (Min)

9.

References

[4]

[5]

[7]

SUPERQUERY

Statistica Dataminer

DBMINER

[13]

COMPARATIVE CHART OF VARIOUS “KNOWLEDGE DISCOVERY FROM DATA” (KDD) TOOLS Contd… Sl. No.

Parameter

DB2 Intelligent Miner

PolyAnalyst

Oracle Data Miner

SAS Enterprise Miner

IBM SPSS Modeler

1.

Logo

2.

Developer

IBM Corporation

Megaputer Intelligence Inc.

Oracle Corporation

SAS Institute Inc.

IBM Corporation

3.

Function

Association Rules, Clustering, Classification, Prediction

Classification, Prediction, Anomaly Detection, Clustering, Association Rules, Feature Extraction

Market Basket Analysis, Predictive Modeling, Time Series Data Preparation and Analysis

Association Rules, Classification, Clustering, Prediction

4.

Technique

Decision Trees (CART), KMeans, Neural Networks, Linear Regression

Association Rules, Classification, Clustering, Prediction, Anomaly Detection, Pattern Recognition Decision Trees, Neural Networks

5.

Platform

6.

Features

Windows, Solaris, AIX, OS/390, OS/400 Family comprises three products namely DB2 Intelligent Miner for Data which mines

Decision Trees, Regression, Naïve Bayes, SVM, Enhanced KMeans, A priori Windows, Linux, Solaris

Neural Networks, Regression, Ensemble Methods, Decision Trees Windows, Solaris, Linux, AIX, HP-UX Creates descriptive and predictive models by analysing voluminous

Windows, Unix

Accesses data stored in relational databases using the ODFC interface. Can

Embedded in the Oracle database. It identifies patterns and key attributes and

Data Engine

Knowledge Studio

AI Trilogy

Management Intelligenter Technologien Gmbh Classification, Clustering, Decision Trees

Angoss Software Corporation

Ward Systems Group Inc.

Classification, Clustering, Prediction, Rules

Classification, Forecasting, Prediction

A priori, Decision Trees, K-Means Clustering, Neural Networks, Regression, Rule Induction Windows, Solaris, IBM AIX, HP/UX

Decision Trees, KMeans, Neural Networks, Linear Regression Windows

Decision Trees, K-Means Clustering, Neural Networks, Regression Windows, Solaris

Genetic Algorithms, Neural Networks

Has a number of descriptive icons where each icon represents steps like accessing data, preparing

Integrates statistical tools with neural networks. It has many methods for

Provides a set of scoring and deployment tools in a single workflow environment. It

Is a suite of three productsNeuroShell Predictor, NeuroShell Classifier and

Windows

Sl. No.

7.

Parameter

Interface

DB2 Intelligent Miner

PolyAnalyst

Oracle Data Miner

DB2 databases or flat files, DB2 Intelligent Miner for Text mines textual data including flat files and web pages and DB2 Intelligent Miner for Scoring documentation. IBM's indatabase mining capabilities integrate with existing systems to provide scalable, high performing predictive analysis without moving the data into proprietary data mining platforms Simple GUI interface is provided to for user convenience.

process flat files, MS Excel and DBF files. Enables data modeling and testing using different machine learning algorithms. Offers complete end to end solution from data importing, cleaning, manipulation, visualization, modeling, scoring and reporting

discovers associations and clusters. ODM moves the analytical functions into traditional mining servers [16].

Objected oriented GUI available. It is a self documenting system that provides visual tools for data analysis.

User interacts with the software through the Oracle DM GUI, PL/SQL and Java API, Predictive Analytics PL/SQL package, Oracle

SAS Enterprise Miner data. Apart from fraud detection, the software can be used for business based model comparisons, reporting and management. Data access, management and cleansing are integrated thereby making data analysis easier. Also supports scalable batch processing through GUI with access to more than 50 file structures Interactive GUI with easy to use Graphics Explorer Wizard and Graphics Explore Node

IBM SPSS Modeler

Data Engine

Knowledge Studio

AI Trilogy

data, data visualisation and modelling. It mines large data sets using a client/server model. Server converts data access requests into SQL queries which can then access a relational database [18].

data cleansing, transformation and for handling missing data. The Data Engine ADL generates C code or produces DLLs which can be incorporated in the application code for subsequent use

allows analysts to generate application code that can be exported to Visual Basic, C++, Java, XML, PMML and SAS generators thereby facilitating integration with all data sources within the organization

GeneHunter. Supports ASCII, CSV and Excel files. Application serves as a tool extract the prominent relationships among process variables

Makes use of descriptive icons to create a data flow description of the functions to be performed.

User interacts with the software through a project window that gives a survey of data, graphics and models

Familiar interface based on the MS Office environment. The package has an interface based on the MS Office environment. The package

Has a Windows icon driven user interface and a host of other utilities to provide users with a neural network experimental environment

Sl. No.

Parameter

8.

System Requirements

9

References

DB2 Intelligent Miner

MS Windows XP with SP2/ Vista with SP1, 1 GHz Processor, 512 MB RAM, Hard Disk 60 MB, MS Office 2003 with SP1/MS Office 2007 IE 6.0 with SP2/7.0/8.0 [14]

PolyAnalyst

MS Windows 2000/XP, 1.0 GHz Processor, 128 MB RAM, 50 MB free space on HDD, MS Internet Explorer 6.0

[15]

Oracle Data Miner Spread sheet Add-In for Predictive Analytics. MS Windows XP Prof./Vista Business, Enterprise and Ultimate/2000 server with SP1 and all editions of 2003, 512 MB RAM, 2.04 GB free space on HDD [16]

SAS Enterprise Miner

IBM SPSS Modeler

MS Windows, Solaris, Linux, SAS/STAT and Base SAS

MS Windows XP Prof./Vista/ Server 2003 Intel Pentium, AMD 64 & EM 64T, 1 GB RAM and 1 GB free space on HDD

[17]

[18]

Data Engine

Knowledge Studio has an intuitive GUI for easy deployment and ease of use. MS Windows 2000/XP, IBM compatible PC with 886/50MHz Processor or higher, 64 MB RAM, 25 MB free space on HDD, LabVIEW 6.x

[19]

[20]

AI Trilogy

MS Windows 2000 (with SP4)/XP/Vista/ 7, Intel Pentium compatible processor, 256 MB RAM

[21]