Gender Specific Classification of Road Accident ...

3 downloads 0 Views 326KB Size Report
Index Terms- Data Mining; Classification Algorithms;. AdaBoost; Meta Classifier; Road Accident Data. I. INTRODUCTION. Data Mining [2] has attracted a great ...
IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

374

Gender Specific Classification of Road Accident Patterns through Data Mining Techniques S. Shanthi1 and Dr. R. Geetha Ramani2 1

Senior Lecturer, Department of Computer Science and Engineering Rajalakshmi Institute of Technology, Kuthambakkam, Chennai, India [email protected] 2 Professor and Head, Department of Computer Science and Engineering Rajalakshmi Engineering College, Thandalam, Chennai, India [email protected] Abstract—Road accident analysis is very challenging task and investigating the dependencies between the attributes become complex because of many environmental and road related factors. In this research work we applied data mining classification techniques to carry out gender based classification of which RndTree and C4.5 using AdaBoost Meta classifier gives high accurate results. The training dataset used for the research work is obtained from Fatality Analysis Reporting System (FARS) which is provided by the University of Alabama’s Critical Analysis Reporting Environment (CARE) system. The results reveal that AdaBoost used with RndTree improvised the classifier’s accuracy. Index Terms- Data Mining; Classification AdaBoost; Meta Classifier; Road Accident Data

I.

Algorithms;

INTRODUCTION

Data Mining [2] has attracted a great deal of attention in the information industry and in society due to the wide availability of huge amounts of data and there is a need for converting such data into useful information and knowledge. The information and knowledge [2] gained can be used for applications ranging from market analysis, fraud detection, and customer retention, to production control and science exploration. Data mining techniques include association, classification, prediction, clustering etc. Classification algorithms are used to classify large volume of data and provide interesting results. Application of data mining techniques on social issues has been a popular technique of late. Fatal rates due to road accidents contribute more on the total death rate of the world. Over 1.2 million people [14] die each year on the world’s roads and between 20 and 50 million suffer non-fatal injuries. Many literature analyses the road related factors which increase the death ratio. The attribute Gender has been selected as the class attribute for our study. Ensemble methods such as bagging and boosting are used to improve the accuracy of the weak classifiers [2]. In this paper, we focus on gender based classification by applying various classification algorithms viz. Naïve Bayes, ID3, RndTree, C4.5, CART. Among these algorithms RndTree algorithm, C4.5 algorithm, boosting with RndTree algorithm and boosting with C4.5 algorithm give better results. The rest of this paper is organized as follows. Section I lists the

summary of the related works in classification and ensemble algorithms. Section II illustrates the methodology used which includes the training dataset description, system design, classification algorithms, and ensemble algorithm (AdaBoost) and classifier accuracy measures. In section IV we present and discuss the experiment results. Finally the section V concludes the paper. II.

RELATED WORK

The main reason to employ ensemble methods is to improve the accuracy of the weak classifiers. Various studies have been conducted to emphasize the use of ensemble methods such as bagging and boosting. Ensemble methods are very popular method for improving the performance of any weak learning algorithm [6]. The most popular weak learners are decision trees, for example C4.5 or CART, and decision stumps [6]. In [6] AdaBoost integrating with C4.5 is presented to classify missing data. The authors of [5] used AdaBoost as a feature selection method. The results revealed that the average performance of AdaBoost is better than logistic regression especially in dealing with missing values [5]. In 1997 the authors of [13] proposed the multiplicative weight-update technique to derive a new boosting algorithm which is used to solve the problem of learning functions whose range, rather than being binary [13], is an arbitrary finite set or a bounded segment of the real line. This boosting algorithm proposed in [13] does not require any prior knowledge about the performance of the weak learning algorithm. The combination of the AdaBoost and random forests algorithms were used for constructing a breast cancer survivability prediction model [4]. It was proposed to use random forests as a weak learner of AdaBoost for selecting the high weight instances during the boosting process to improve accuracy, stability and to reduce over fitting problems [4]. Performance measurements (e.g., accuracy, sensitivity, and specificity), Receiver Operating Characteristic (ROC) curve and Area Under the receiver operating characteristic Curve

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

(AUC) were used to measure the efficiency of the proposed classifier [4]. Random Projection Technique is used on various applications to speed up the training process of AdaBoost [1] especially when the input dimension of data is high. Various application domains are used [7] to study the different relationships and groupings among the performance metrics, thus facilitating the selection of performance metrics that capture relatively independent aspects of a classifier’s performance. Factor analysis is applied [7] to the classifier performance space. While evaluating credit risk Logistic Regression and SVM algorithms give best classification accuracy, and the SVM shows the higher robustness and generalization ability compared to the other algorithms [3]. The C4.5 algorithm is sensitive to input data, and the classification accuracy is unstable, but it has the better explanatory [3]. In our research work we focused on gender specific classification to find accident patterns in road accident data using various classification algorithms. Next section illustrates the methodology used in our research work which includes RndTree, C4.5 and AdaBoost algorithms. III.

METHODOLOGY

This research work focuses on gender specific classification of road accident patterns. The existing classification algorithms viz. Naïve Bayes, ID3, RndTree, C4.5, CART is adopted for the classification. The C4.5 algorithm produces the classification results with 26.81% misclassification rate and RndTree algorithm produces classification results with 14.3% misclassification rate. Since both RndTree and C4.5 had misclassification rate, ensemble method (AdaBoost) is incorporated with RndTree and C4.5 to improve the accuracy. The details of the work are given in the following sub sections. A. Training Dataset Description We carry out the experiment with road accident training dataset obtained from Fatality Analysis Reporting System (FARS) [15] which is provided by Critical Analysis Reporting Environment (CARE) system. This safety data consists of U.S. road accident information from 2005 to 2009. It consists of 272831 records and 23 attributes. To train the classifiers we have selected accident details for two states California and New York totally 63327 records with 17 attributes. The selected dataset with 63327 records is divided into training dataset which consists of 47761 samples and test dataset which consists of 15566 samples. Training data set is used to build the model and test dataset is used to evaluate the model. The list of attributes and their description is given in the Table I. TABLE I. Attributes Year

TRAINING DATASET ATTRIBUTES DESCRIPTION

Description Year of accident

Attributes Month Day Manner_of_Collision Person_Type Seating_Position Age_Range Gender Injury_Severity Transported_By AirBag Protection_System Dead_on_Arrival Year_of_Death Month_of_Death Drug_Test Related_Factors

375

Description Month of accident Day of accident Manner of collision Driver/Passenger Seating Position Age range of the person involved Male/Female Injury Severity Transported by emergency vehicle or not Location of the airbag Type of the protection system used Status of the person at the arrival to hospital Year of death Month of death Type of drug test Road related factors

We have applied the classification algorithms using Gender attribute as the class attribute. Next sub section deals about the system model used in this study. B. System Design This section describes the steps used in this research work. The steps used in this work are depicted in Fig. 1. ID3, CART, C4.5, Naïve Bayes, RndTree Training Dataset (Accident Data)

Data Preprocessing

Classification Algorithms

Error Rates Performance Analysis Best base classifiers RndTree, C4.5 Evaluate Accuracy (Precision, Recall) )

Knowledge Base (Trained Rules)

AdaBoost

Meta Classifier Classified Data using ensemble learners

Test Data Classifier

Test Dataset (Accident Data)

PREDICTED PATTERNS Figure 1. Steps involved in the study

After preprocessing the training set is given as input to the weak learners (CART, ID3, Naïve BayesRndTree, C4.5). The results are evaluated based on error rates and found that RndTree and C4.5 gives better results with 14.3% and 26.81%

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

misclassification rates respectively. To improve the accuracy of RndTree and C4.5, Meta learner (AdaBoost) is incorporated with them. The results are evaluated using accuracy measures such as precision, recall and ROC. It is found that Adaboost using RndTree significantly improves the accuracy. Test dataset is applied to evaluate the results. C. Classification Algorithms This section illustrates the decision tree algorithms viz. RndTree and C4.5. The accuracy of RndTree decision tree algorithm is better than that of other classification algorithms [10]. The advantage of decision tree algorithms is it is easy to derive the rules. 1) RndTree Random tree [8] can be applied to both regression and classification problems. The method combines bagging idea and the random selection of features in order to construct a collection of decision trees with controlled variation. Each tree is constructed using the following algorithm:  Let the number of training cases be N, and the number of variables in the classifier be M.  We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.  Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample).  Use the rest of the cases to estimate the error of the tree, by predicting their classes.  For each node of the tree, randomly choose m variables on which to base the decision at that node.  Calculate the best split based on these m variables in the training set.  Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).  For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in.  This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction [8]. 2) C4.5 Given a set S of cases, C4.5 grows an initial tree [12] using the divide-and-conquer algorithm as follows:  If all the cases in S belong to the same class or S is small, the tree is a leaf labeled with the most frequent class in S.  Otherwise, choose a test based on a single attribute with two or more outcomes. Make this test the root of the tree with one branch for each outcome of the test, partition S into corresponding subsets S1, S2, . . . according to the outcome for each case, and apply the same procedure recursively to each subset.

376

D. AdaBoost Algorithm It is a boosting algorithm which is used to improve the accuracy of some learning method [2]. Accuracy is achieved by iteratively building weak models. Weak models are combined to deliver better model. The steps involved in boosting are given below. Step 1: All instances are equally weighted. Step 2: A learning algorithm is applied. Step 3: The weight of incorrectly classified example is increased and correctly decreased. Step 4: The algorithm concentrates on incorrectly classified “hard” instances. Some “hard” instances become “harder” some “softer”. Step 5: A series of diverse experts are generated based on the reweighed data. E. Accuracy Measures The accuracy of a classifier on a given set is the percentage of test set tuples that are correctly classified by the classifier [2]. The confusion matrix is a useful tool for analyzing the efficiency of the classifiers [2]. Given two classes the contingency or confusion matrix can be given as in Table II. TABLE II.

CONTINGENCY TABLE FOR TWO CLASSES

Confusion Matrix Actual Class

Predicted Class Class Class 1 Class 2

Class 1

Class 2

True Positive (TP) False Positive (FP)

False Negativ (FN) True Negative (TN)

TP refers to the positive tuples TN refers to the negative tuples that were correctly classified by the classifier. FN refers to the positive tuples and FP refers to negative tuples that were incorrectly identified by the classifier [2]. The sensitivity or recall or True Positive Rate (TPR), specificity, Precision, False Positive Rate and Accuracy can be calculated using the following equations [2]:

Sensitivit y  TPR 

TP TP  FN

TN FP  TN TP Pr ecision  TP  FP FP FPR  FP  TN TP  TN Accuracy  TP  FP  TN  FN Specificit y 

Receiver Operating Characteristics (ROC) graphs have long been used in signal detection theory to depict the tradeoff between hit rates and false alarm rates over noisy channel [16]. ROC curve is a plot of TPR against FPR (False Positive

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

Rate) which depicts relative trade-offs between true positives and false positives [16, 2]. The ROC curve space for two classifiers is given in Fig.2 [9]. If the curve is closer to the diagonal line then the model is less accurate [2]. Area under receiver operating characteristic curve (AUC) was calculated to assert the prediction accuracy besides the sensitivity, specificity and accuracy. An area of 0.5 represents a random test; values of AUC0.8 represents good prediction [16].

377

Figure 4. Error Rate of C4.5

The comparisons between the accuracies of the RndTree and C4.5 classifiers are given in Fig. 5.

Figure 5. Base Classifier’s Accuracy- Training Dataset Figure 2. ROC Curve Space

IV.

Fig. 5 shows that among base classifiers RndTree results in high accuracy.

EXPERIMENT RESULTS

We have used Tanagra for our experimental study. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area [11]. We have divided the accident dataset into two parts: Training dataset which consists of 47761 records and test dataset which consists of 15566 records, totally 63327 records.

B. Experimental Results of AdaBoost using Base Classifiers In this phase we have applied ensemble methods to classify the training dataset. Fig.6 and Fig.7 illustrates the results of AdaBoost using RndTree and AdaBoost using C4.5 respectively.

A. Experimental Results of Base Classifiers In this phase we have applied basic decision tree algorithms RndTree and C4.5 to classify the training dataset. The results of these models are evaluated based on their error rates, precision and recall. The error rates of the RndTree and C4.5 algorithms are given in the Fig. 3 and Fig.4 respectively. Figure 6. Error Rate of AdaBoost using RndTree

Figure 3. Error Rate of RndTree Figure 7. Error Rate of AdaBoost using C4.5

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

The accuracies of the ensemble classifiers are depicted in Fig.8.

378

and test data are same. The accuracy measures of all the four algorithms are given in the Table III. TABLE III. Classifiers RndTree AdaBoost (RndTree) C4.5 AdaBoost (C4.5)

CLASSIFIERS-ACCURACY MEASURES

Sensitivity

Specificity

Precision

FPR Accuracy

Male

Female

Male

Female

Male

Female

0.965

0.638

0.638

0.965

0.844

0.899

0.362

0.8569

0.972

0.923

0.923

0.972

0.927

0.942

0.077

0.9559

0.924

0.342

0.342

0.924

0.741

0.688

0.658

0.7319

0.921

0.645

0.645

0.921

0.841

0.801

0.355

0.8302

Fig.10. shows that AdaBoost using RndBoost is highly sensitive and specific than other classifiers.

Figure 8. Ensemble Classifier’s Accuracy- Training Dataset

Fig. 8 reveals that the accuracy of AdaBoost. using RndTree higher than the other. Fig.9 gives the comparison of accuracies of all the four classifiers.

Figure 10. Classifier’s Accuracy- Test Dataset

Fig.11 explains the performance measures using ROC curves. The Score 3 (AdaBoost using RndTree) gives the curve which is nearer to the perfection point (i.e. 1).

Figure 9. Classifier’s Accuracy- Training Dataset

Fig.9 shows that the AdaBoost using RndTree Classfier outperforms all other algorithms to perform gender based classification in road accident data set. C. Experimental results of Accuracy Measures In this research work we used sensitivity, specificity, precision which should be high and FPR should be low to have high accuracy. Different classification algorithms may have their own characteristics on the same dataset [5]. In this work, the performance of AdaBoost using RndTree shows better results than that of RndTree, C4.5 and AdaBoost using C4.5. AdaBoost using RndTree is more specific and more sensitive. The results of training

Figure 11. Classifier’s Accuracy- Test Dataset

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

Table IV lists the AUC values of all classifiers depends on different sizes of the training data. Though AUC of all the classifiers are greater than 0.7, the AUC of AdaBoost using RndTree (0.9934) is higher than other classifiers which TABLE IV. Sample size : 47761

Score Attribute AUC Target size (%) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

RndTree (Score 1) 0.9412 Score FPR 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0.813 0.013 0.75 0.045 0.697 0.086 0.667 0.136 0.592 0.193 0.5 0.263 0.5 0.339 0.333 0.422 0 0.545 0 0.697 0 0.848 0 1

TPR 0 0.075 0.149 0.224 0.298 0.373 0.448 0.522 0.597 0.665 0.724 0.778 0.828 0.875 0.915 0.952 0.986 1 1 1 1

379

conforms that AdaBoost using RndTree is comparatively good classifier than other classifiers. We got same results for test data.

ROC CURVE RESULTS

Positive examples : 32014

C4.5 (Score 2) 0.7419 Score FPR 1 0 0.965 0.002 0.895 0.01 0.843 0.031 0.804 0.059 0.79 0.089 0.771 0.123 0.756 0.159 0.738 0.197 0.723 0.238 0.714 0.281 0.7 0.325 0.671 0.372 0.646 0.424 0.611 0.48 0.594 0.54 0.546 0.605 0.45 0.681 0.375 0.77 0.267 0.873 0 1

D. Evaluation of Experimental Results using Test Data The correctness of the results of the foresaid classifiers have been evaluated using test dataset which consists of 15566 records. The results of the same are depicted in Fig. 12.

Negative examples : 15747

AdaBoost (RndTree) (Score 3) 0.9934 TPR Score FPR TPR 0 1 0 0 0.074 0.842 0 0.075 0.144 0.796 0 0.149 0.208 0.765 0 0.224 0.27 0.738 0 0.298 0.329 0.716 0 0.373 0.387 0.695 0 0.448 0.444 0.676 0 0.522 0.5 0.657 0 0.597 0.555 0.639 0 0.671 0.608 0.618 0 0.746 0.661 0.594 0 0.821 0.712 0.558 0.003 0.894 0.761 0.517 0.04 0.95 0.808 0.486 0.116 0.987 0.853 0.451 0.243 0.999 0.896 0.426 0.393 1 0.933 0.405 0.545 1 0.964 0.384 0.697 1 0.988 0.354 0.848 1 1 0 1 1

AdaBoost (C4.5) (Score 4) 0.9125 Score FPR TPR 1 0 0 0.792 0 0.075 0.735 0 0.149 0.7 0 0.224 0.671 0 0.298 0.65 0.0006 0.373 0.63 0.003 0.446 0.613 0.01 0.517 0.596 0.023 0.586 0.581 0.044 0.65 0.566 0.074 0.71 0.552 0.115 0.764 0.539 0.167 0.813 0.523 0.229 0.857 0.51 0.3 0.897 0.495 0.381 0.931 0.477 0.478 0.958 0.457 0.587 0.979 0.429 0.709 0.994 0.388 0.849 1 0.099 1 1

algorithm with RndTree and C4.5 algorithms to find patterns using gender based classification. Among the algorithms AdaBoost using RndTree gives high accuracy. The accuracy is evaluated based on precision, recall and ROC curves. The results showed that the AdaBoost using RndTree improved accuracy from 85.7% to 95.59%.

REFERENCES [1]

[2] [3]

[4]

[5]

[6]

Figure 12. Classifier’s Accuracy- Test Dataset

V.

[7]

CONCLUSION

In this paper we analyzed road accident training dataset using RndTree, C4.5 algorithms and a combination of the AdaBoost

[8] [9]

Biswajit Paul, G. Athithan, M. Narasimha Murty, “Speeding up AdaBoost classifier with random projection”, Seventh International Conference on Advances in Pattern Recognition, pp.251-254, 2009. Han, J. and Kamber, M., “Data mining: concepts and techniques”, Academic Press, ISBN 1- 55860-489-8. Hong Yu, Xiaolei Huang, Xiaorong Hu, Hengwen Cai, “A comparative study on data mining algorithms for individual credit risk evaluation”, Int. Conference on Management of e-Commerce and e-Government, 2010. Jaree Thongkam, Guandong Xu and Yanchun Zhang, “AdaBoost algorithm with random forests for predicting breast cancer survivability”, International Joint Conference on Neural Networks, 2008. Jingran Wen, Xiaoyan Zhang, Ye Xu, Zuofeng Li, Lei Liu, “Comparison of AdaBoost and logistic regression for detecting colorectal cancer patients with synchronous liver metastasis”, International Conference on Biomedical and Pharmaceutical Engineering, December 2-4, 2009. Miao Zhimin, Pan Zhisong, Hu Guyu ,Zhao Luwen, “Treating missing data processing based on neural network and AdaBoost”, IEEE International Conference on Grey Systems and Intelligent Services, November 18-20, 2007, Nanjing, China. Naeem Seliya, Taghi M. Khoshgoftaar , Jason Van Hulse ,” A study on the relationships of classifier performance metrics”, IEEE International Conference on Tools with Artificial Intelligence, pp.59-66, 2009. Random Tree Algorithm, http://www.answers.com ROC Space, http://en.wikipedia.org/wiki/File:ROC_space-2.png

ISBN: 978-81-909042-2-3 ©2012 IEEE

IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012

[10] S.Shanthi, Dr.R.Geetha Ramani, ” Classification of Vehicle Collision Patterns in Road Accidents using Data Mining Algorithms”, Int. Journal of Computer Applications, Vol.35, No.12, pp.30-37. [11] Tanagra data mining tutorials, http://data-mining-tutorials.blogspot.com [12] Xindong Wu · Vipin Kumar · J. Ross Quinlan, et al.,” Top 10 algorithms in data mining”, Knowledge Information System, Vol.14, pp.1–37. [13] Yoav Freund and Robert E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting”, Journal of computer and system sciences, Vol.55, pp.119-139, 1997. [14] World Health Organization, Global status report on road safety: time for action, Geneva, 2009. [15] www.nhtsa.gov – FARS analytic reference guide. [16] www.cs.iastate.edu/~jtian/cs573/WWW/Lectures/lecture06ClassifierEvaluation-2up.pdf, Classifier Evaluation Techniques.

Mrs. S. Shanthi completed her M.C.A. from Madurai Kamaraj University and M.E. in Computer Science and Engineering at Arunai Engineering College, affiliated to Anna University, Chennai, India. She has 7 years of teaching experience. Presently she is working as Senior Lecturer in the Department of Computer Science and Engineering, Rajalakshmi Institute of Technology, Chennai and pursuing her Ph.D (Part Time) in Computer Science and Engineering at Rajalakshmi Engineering College, affiliated to Anna University, Chennai. Her areas of interest include Data Mining, Data Structures and Analysis of Algorithms and Network Security. She has published one paper in international journal and presented many papers at National and International Conferences. Dr. R. Geetha Ramani is working as Professor & Head in the Department of Computer Science and Engineering, Rajalakshmi Engineering College, India. She has more than 15 years of teaching and research experience. Her areas of specialization include Data mining, Evolutionary Algorithms and Network Security. She has over 50 publications in International Conferences and Journals to her credit. She has also published a couple of books in the field of Data Mining and Evolutionary Algorithms. She has completed an External Agency Project in the field of Robotic Soccer and is currently working on projects in the field of Data Mining. She has served as a Member in the Board of Studies of Pondicherry Central University. She is presently a member in the Editorial Board of various reputed International Journals.

ISBN: 978-81-909042-2-3 ©2012 IEEE

380