for Detecting Credit Card Fraud - IEEE Xplore

4 downloads 226 Views 3MB Size Report
Personalized Approach Based on SVM and ANN for Detecting Credit Card Fraud. Rong-Chang Chen, Shu-Ting Luol. Department ofLogistics Engineering and.
Personalized Approach Based on SVM and ANN for Detecting Credit Card Fraud Rong-Chang Chen, Shu-Ting Luol Department of Logistics Engineering and Management Graduate Institute of Business Administration National Taichung Institute of Technology Taichung, Taiwan 404, China E-mail: [email protected]

Xun Liang Institute of Computer Science and Technology Peking University Beijing 100871, China E-mail: liangxun(icst.pku.edu.cn

Abstract-A novel personalized approach has recently been presented to prevent credit card fraud. This new approach proposes to prevent fraud before initial use of a new card, even users without any real transaction data. This approach shows potential, nevertheless, there are some problems needed solving. A main issue is how to predict accurately with only few data, since it collects quasi-real transaction data via an online questionnaire system and thus respondents are commonly unwilling to spend too much time to reply questionnaires. This study employs both support vector machines (SVM) and artificial neural networks (ANN) to investigate the time-varying fraud problem. The performance of ANN is compared with that from SVM. Results show that SVM and ANN are comparable in training but ANN can have highest training accuracy. However, ANN seems to overfit training data and thus has worse performance of predicting the future data when data number is small.

personalized model based on personal data collected by an online questionnaire system. Since the illegal user's and the cardholder's transaction behaviors are different, the fraud can be avoided from the initial use of a credit card. This paper employs the personalized approach based on powerful learning tools, support vector machines (SVM) and artificial neural networks (ANN), to cope with the fraud problem. Since the behavior of a credit card user may change over time, we will use both SVM and ANN to handle this problem. We first collect the questionnaireresponded transaction (QRT) data of users by using an online questionnaire system. The data are then preprocessed and trained by the SVM and ANN and as a result, classifiers based on each individual are created. The classifiers are next used to predict new transaction data. When a new transaction is going, the classifier can predict whether the transaction is normal or not. If the prediction result is abnormal, the transaction is considered as fraud. The rest of this paper is organized as follows. In Section II, brief overviews of SVM and ANN are given. The proposed method is illustrated in Section III. Section IV describes and discusses the experimental results. Finally, we conclude with summarizing the findings in Section V.

I. INTRODUCTION

Credit cards have become a popular tool for transaction in many countries lately. However, the popular use of credit cards is accompanied by many fraudulent transactions, which cost hundreds of millions of dollars annually. It is, thus, very crucial to use an effective method to solve this problem and decrease losses caused by fraud. In handling the credit card fraud problem, conventionally, past real transaction data are used to create models for predicting a new case [14]. This approach gives a good solution in some conditions. However, there are no or few transaction data for new users. Under these circumstances, other users' data are used to predict individual consumer behavior, and this usually causes poor performance since consumer behavior varies with individuals. Besides, the consumer behavior varies with time. Good detection systems of credit card fraud should be able to adapt themselves to changing consumer behavior [5]. Therefore, it is very important to prevent credit card fraud with a better approach. Rather than detecting credit card fraud by past transaction data, Chen et al. [6-10] proposed a novel approach to solve the fraud problem. They suggested building up a

0-7803-9422-4/05/$20.00 C2005 IEEE

Vincent C.S. Lee Department of Accounting and Finance School of Business Systems Monash University, VIC 3800, Australia E-mail: [email protected]

HI. SUPPORT VECTOR MACHINE AND ARTIFICIAL NEURAL NETWORKS

A. Support Vector Machines (SVM) SVM was developed by Vapnik [11]. It is a newly developed technique in recent years. It is one of the best tools to be used in classification. SVM can separate negative samples from the set in which contains both positive and negative samples with complex distribution. When the test data and the training data are similar, the result for classification is usually good. SVM is developed from statistical learning theory [12]. The main idea of SVM comes from the binary classification, namely to find a hyperplane as a segmentation of the two types to minimize the classification error. This hyperplane maximizes the minimum distance between the hyperplane to the nearest

810

negative and positive points. In addition, SVM can also solve the problem with linear or non-linear segmentation [13,14], as illustrated in Figs. 1 and 2. SVM uses some local information to do the training, then calculates some support vectors from the training data to support the whole information while eliminates some outlier points. SVM has some attractive properties that make it a very popular technique to use: only a small subset of the total training set is needed to divide the different classes, computational complexity is reduced by use of the kernel trick, overfitting is avoided by classifying with a maximum margin, and so on. SVM has already been successfully used for a wide variety of problems, like pattern recognition [15], system intrusion detection [16], signal de-noising [17], bio-informatics [18], and more.

stock markets prediction [22], fraud detection [1,4], or pattern recognition [23], through a learning process. Neural networks are an incremental mining technique that allows new data be put into a trained neural network to replace the previous training result. Therefore, it is suitable to use ANN to deal with the detection of time-varying credit card fraud. There are many types of ANN models. Among them, back propagation networks (BPN) is one of the favorite models at present since it is easy to comprehend, and can be easily applied as a software simulation. In this investigation, we will employ BPN to predict new transaction data. III. APPORACH

The proposed approach for predicting fraud is depicted in Fig. 3.

y

4 0

0

A

A

Test data

Fig. 1. Linear classification

y

4

A A

A

A

y

Genuine

A

A A

A

N Fraudulent

Fig. 3. The procedure for predicting credit card fraud x

To begin with, we collect the transaction data of new users by using an online, self-completion questionnaire Fig. 2. Non-linear classification system. We identify this kind of transaction data as questionnaire-responded transaction (QRT) data. This method is very appropriate for new users who only have few transactions or haven't any transactions. After that, the data B. Artificial Neural Networks (ANN) are trained by SVM and ANN and hence personalized ANN is a popular tool for financial decision-making classifiers are created. Finally, these personalized classifiers [19-20]. It is an information-processing pattern that is are used to predict new transactions as fraudulent or genuine inspired by the biological nervous systems like the brain ones. [21]. The most important element of this pattern is the architecture of the information-processing system. It A. Data Collection includes a large number of highly interconnected processing elements (neurons) working harmonically to solve particular To examine the effectiveness of the proposed approach, problems. ANN, resembling people, can learn by examples. QRT data are collected and then personalized models are The ANN is constructed for a specific application, such as 811

built up for users. To get representative data for a better modeling of the reality, we let users select the priority of six main classes of transaction items, and collect different amount of data according to pre-specified ratios. The design platform of the online questionnaire system was Linux. The program we used was Java. The database was SQL 2000. The questions on the questionnaires are generated in keeping with the individual's consuming preference from surveys [24-27]. Consumer behavior changes considerably with each individual. It is accordingly practical to classify their behavior in relation to several main attributes. The collected personal data consist mainly of several parts: age, gender, transaction duration, transaction amount, and transaction items. The details are described as follows. 1) Transaction Intervals. Each day can be divided into some intervals. In general, 4 or 6 intervals are enough to characterize the consumer behavior. Therefore, we divide each day into 4 intervals. However, the transaction time should be divided into more intervals if the consumer behavior depends considerably on the transaction time. 2) Transaction Item. According to the survey of consumer behavior, each individual has its preferred consuming tendency. Thus, in this paper, we divide transaction items into six major classes: eating, wearing, housing, transporting, educating, and recreating. Each main class can be further divided into more detailed subclasses. In this paper, we collected the data weekly from students. The influence of the time-varying effect on the prediction performance is investigated.

B. Data Training In this study, we employed mySVM [14] and back-propagation network (BPN) to train all personalized data. The BPN tool we used is the SmartNeuron 0.42, which was developed by Professor C.C. Chang with Department of Logistics Engineering and Management, National Taichung Institute of Technology (NTIT), Taichung, Taiwan. SmartNeuron was built up by Visual C++. 1) Support Vector Machines: For mySVM, the training and testing of QRT data were running on a Pentium III 667 PC and Windows 2000 Professional operating system. To get better results with mySVM, different kernels were selected and different parameters were set to test so as to find better performance of prediction. Preliminary training was performed to get optimal parameters for good results. Three types of kernels were investigated: dot, polynomial, and radial. The degree ofpolynomial kernel was varied from 1 to 5. The first-round results demonstrate that a radial kernel can have better testing accuracy. Therefore, the base classifier we use for further studies is the radial kernel.

2) Back Propagation Networks: Some parameters are needed to train data using BPN. Among the most important parameters are the numbers of hidden layer, hidden nodes, and training epochs, learning rate, and momentum rate. The setting of the parameter values remains as an art rather than

a science. Complicated problems can be increasingly better modeled by adding hidden layers, but the improvement generally comes with a related cost in terms of training time and data overfitting. To improve the above problems, we evaluated some parameter values in preliminary training, which decided the number of hidden layer, the number of nodes, and the number of epochs, based on recommendations from previous literature and our past experience. Besides, the effect of the ratio of the number of training to total data is studied since the collected data of personalized approach are finite.

C. Data Overlapping Consumer behavior usually changes over time, and thus may cause data overlapping, i.e., many genuine transactions may be similar to fraudulent transactions. The opposite also occurs, as a fraudulent transaction appears to be normal. Figure 4 displays an overview of the data distribution of collected personal data by self-organizing maps (SOM). The data in white regions are normal, while in black regions are fraudulent. A darker color indicates a stronger tendency to the fraudulent behavior. As illustrated in this figure, there are some overlapping data; i.e., some genuine data are similar to fraudulent data. Consequently, to get higher detection rates, it is very important to choose an appropriate classifier to classify the genuine and fraudulent behaviors.

Fig. 4. An overview of the distribution of personalized credit card data

In order to effectively analyze the dataset of contradiction data among the transaction data, we use a simple method to do the calculation. Let 1 denote the normal condition and -1 the abnormal condition, as shown in Fig. 5. Consider 2 adjacent samples from a specified person. When the transaction items, the transaction amounts, and the transaction intervals are the same and both of them have different signs, or when the transaction items and the transaction intervals are the same, the transaction amounts are different but the sequential information alters from positive to negative or the opposite (low transaction amount is abnormal, higher transaction amount is normal), under such conditions, we know that the data collected are contradictory. From Figure 5 we can see that there are 4

812

pieces of contradictory data, which can be calculated by adding 1 to the number of times that the information changes signs. Thus the ratio of contradict data, Rc, can be calculated. The formula is as follows:

TABLE I COMPARISON OF TESTED ACCURACY BETWEEN SVM AND BPN

MYSYM 10-fold

Weeks

LOO

Rc = number of overlapping data/ number of total data

1 2 3 4 5 6

1 Normal

Contradiction ratio, Rc

0.98 0.92 0.94 0.82 0.88 0.86

0.21 0.24 0.19 0.31 0.28 0.31

cv

0.95 0.94 0.97 0.83 0.89 0.83

0.94 0.94 0.96 0.82 0.89 0.83

The comparison of predicted accuracy on future data between SVM and BPN is depicted in Table II. In this experiment, the data in the former week are best trained and then used to predict the data in the following week. For example, the first-week data are optimally trained and next the second-week data are regarded as test data and predicted. The results show that the predicted accuracy also depends strongly on the contradiction ratio, Rc. The accuracy difference between the tested accuracy and the predicted accuracy with BPN seems to be larger than that with mySVM, as compared with Table I and Table II.

12000

Abnormal

BPN

-1 Transaction amount Fig. 5. A schematic diagram of contraction points IV. RESULTS AND DISCUSSION

For the convenience of discussion, let us denote the number of the data as N, the ratio of contradiction as Rc, the number ratio of the training to total data as Rt, and the prediction accuracy as P. For BPN, we used the gradient descent method to minimize the total squared error of the output. According to the results from the preliminary training, we determined to use 2 hidden layers, 3 nodes in hidden layer 1, 6 nodes in hidden layer 2 in the following experiments. To investigate the performance of prediction by SVM and ANN, three-stage evaluation is used. In the first stage, the data are divided into two datasets: the training data and test data. The training data are trained to come up with classifiers by SVM and BPN, respectively. In the second stage, the test data are used to optimize parameters of these classifiers, or to select a particular one. The prediction is performed in the final stage to investigate the performance of a classifier on future data. For convenience, we call the accuracy in the second stage the tested accuracy and the accuracy in the third stage the predicted accuracy. The comparison of the tested accuracy between SVM and BPN is shown in Table I. The results are trained and tested on the weekly data. For mySVM, the results are based on leave-one-out cross validation (LOO) and 10-fold cross validation (10-fold CV). There are 245 data for each week. For BPN, the results are based on 2 hidden layers with 3, 6 nodes, respectively. Rt is 0.8. We can see from this table that BPN and SVM are comparable, but there is a larger variation of accuracy for BPN. However, BPN can have maximum tested accuracy. The tested accuracy P depends strongly on the contradiction ratio, Rc, as we can see from Table I.

TABLE II

COMPARISON OF PREDICTED ACCURACY BASED ON THE DATA OF THE LAST WEEK Weeks

trained & tested 1 2 3 4 5

predicBPN

mysvm

BPN

0.8 0.91 0.73 0.79 0.74

0.74 0.82 0.72 0.69 0.69

predicteddata 2 3 4 5 6

Contradiction ratio, ~~Rc, based on 2 adjacent weeks' 0.15

0.13 0.21 0.22 0.23

Table III shows the results of predicted accuracy. The data before the next week are first accumulated, then trained and tested. After an optimized classifier is selected, the data on the next week are predicted. BPN outperforms SVM in this experiment, as displayed in Table III. TABLE HI

COMPARISON OF PREDICTED ACCURACY BASED ON THE ACCUMULATED DATA OF THE PREVIOUS WEEKS

WeeksMYM M predicted

trained & tested

1+2 1+2+3 1+2+3+4 1+2+3+4+5

3 4 5 6

0.91 0.73 0.72 0.67

BP

BPN 0.90 0.91 0.89 0.78

A good success rate on the current training set does not mean a good success rate on future data, since the trained data may be overfitted. To investigate the relation between

813

the overfitting and the tested accuracy, 30 different tests are performed. The base case is performed using BPN with 2 hidden layers, having 3 and 6 nodes, respectively. The accuracy difference, DP, is defined as the difference between the tested accuracy on the data of the first week and the predicted accuracy on the data of the second week. Figure 6 shows that the tested accuracy and the accuracy difference have a high correlation coefficient, indicating that higher tested accuracy may cause a higher degree of overfitting. Thus, there exists a compromise between tested accuracy and predicted accuracy, depending on the purpose of the prediction. *

0.

Ce)

16% 14% 12% 10% 8% 6% 4% 2% 0% 80%

Experimentaldata

Linear regression

To investigate the effect of the data number on the accuiracy of prediction, the number of training data is varied to see their influence on the prediction accuracy. The number of questions that users are willing to reply differ with people. Some users want to answer more to have higher prediction accuracy and thus they can reduce the fraud risk. On the contrary, some others accept loose accuracy or are unwilling to spend too much time to answer. Consequently, different users have different amount of the QRT data. As the contradiction ratio increases, the accuracy

decreases. Figure 8 shows this trend. The influence of the data number on the accuracy is not significant, as we can see from this figure.

y = 0.8533x - 0.6933 * R2 = 0.7851 ,

-4

-

N=100

---

N=200

- -

-

N=400

100% 90% 80%

_ /5

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I 70%

U

85%

90% 95% Tested accuracy

100%

60%

50%

40%

Fig. 6. The correlation between the tested accuracy and the difference of accuracy DP (BPN with 3 nodes in the first hidden layer and 6 nodes in the second layer)

0%

10%

20% 30% Contradition ratio, Rc

40%

50%

The influence of the network architecture on prediction Fig. 8. The effect of the data number N on the accuracy of prediction with different contradiction ratio, Rc (BPN with 3 nodes in the accuracy is displayed in Fig. 7. 32 different architectures are first hidden layer and 6 nodes in the second layer) tested, including 1 hidden layer with 1, 3, 5, 6, 7 nodes and 2 hidden layers with lxl to 7x7 nodes. The accuracy difference DP ranges from 0.01 to 0.18, showing that the IV. CONCLUSIONS extent to overfitting varies considerably with architectures. In this study, we employed a new personalized approach * Linear regression Experimental data to detect credit card fraud and designed a series of experiments to test the prediction performance of support 20% vector machines and neural networks on credit card fraud. 18% The approach is to prevent fraud from users' first use of 16% their cards. Unlike the traditional way, we come up with a 14% C:l model before use of new cards. First, we collect the C) 12% personalized questionnaire-responded transaction (QRT) 10% data of new users by using an online questionnaire system. 8% 6% After that, the QRT data are trained by using support vector 4% machines (SVM) and back propagation networks (BPN) and 2% personalized classifiers are generated. 0% Results from this study show that both SVM and BPN can 80% 85% 90% 95% 100% have good tested accuracy. However, higher tested accuracy Tested accuracy may have a higher degree of tendency to overfitting, which in turn causes worse prediction on the future behavior. In Fig. 7. The correlation between the tested accuracy and the difference of addition, the prediction accuracy depends strongly on the accuracy DP (BPN with 1 hidden layer having 1, 3, 5, 6, 7 nodes, contradiction ratio. Further studies are encouraged to reduce respectively, and 2 hidden layers with 27 different architectures of the influence of the contradicted data on prediction accuracy nodes) 0c

814

Machine and Genetic Algorithm," Lecture Notes in Computer Science and to find an optimal solution with both good tested (LNCS), Vol. 3498, pp. 409-414,2005. accuracy and predicted accuracy. [17] B.Y. Sun, D.S. Huang, and H.T. Fang, "Lidar Signal De-noising Using

Least Squares Support Vector Machine," IEEE Signal Processing Letter, Vol. 12, No. 2, pp. 101-104, 2005. [18] M.P.S. Brown, W.N. Grundy, D. Lin, C. Cristianini, W. Sugnet, T.S. Furey, M. Jr. Ares, D. Haussler, "Knowledge-based Analysis of Microarray Gene Expression Data by Using Support Vector Machines," Proceedings ofNational Academic Science USA, Vol. 97, pp. 262-267, 2000. [19] M. Lam, "Neural Network Techniques for Financial Performance Prediction Integrating Fundamental and Technical Analysis," Decision Support Systems, Vol. 37, pp. 567-581, 2004. [20] W. Cheng, B.W. McClain, and C. Kelly, "Artificial Neural Networks Make their Mark as a Powerfil Tool for Investors," Review ofBusiness, pp. 4-9, summer, 1997.

ACKNOWLEDGMENT

The authors would like to thank Professor C.C. Chang, Department of Logistics Engineering and Management, National Taichung Institute of Technology and Miss Fang Liu, Department of Probability and Statistics, School of Mathematical Sciences, Peking University, for their help during the course of this paper. This work was supported by the National Science Council under grant no. NSC [21] http://www.doc.ic.ac.uk/-nd/surprise _96/journal/vol4/csI I/report.html 94-2213-E-025-010. [22] X. Liang, "Inpacts of Internet Stock News on Stock Markets Based on REFERENCES

[1] F.S. Maes, K. Tuyls, B. Vanschoenwinkel, and B. Manderick, "Credit Card Fraud Detection Using Bayesian and Neural Networks," Proceedings ofNeuro Fuzzy, Havana, Cuba, 2002. [2] P. Chan, and S. Stolfo, "Toward Scalable Learning with Nonuniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection," Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164-168, AAAI Press, Menlo Park, California, 1997. [3] P.K. Chan, W. Fan, A. L. Prodromidis, and S.J. Stolfo, "Distributed Data Mining in Credit Card Fraud Detection," IEEE Intelligence Systems, pp. 67-74, Nov.-Dec., 1999. [4] R. Brause, T. Langsdorf, and M. Hepp, "Neural Data Mining for Credit Card Fraud Detection," Proceeding of IEEE International Conference on Tools with Artificial Intelligence, 1999. [5] S. J. Hong and S. M. Weiss, "Advances in Predictive Models for Data Mining" Pattern Recognition Letters, pp. 55-61, 2001. [6] R.C. Chen, M.L, Chiu, Y.L. Huang, and L.T. Chen, "Detecting Credit Card Fraud by Using Questionnaire-Responded Transaction Model Based on Support Vector Machines," Lecture Notes in Computer Science (LNCS), Vol. 3177, pp. 800-806,2004. [7] R.C. Chen, T.S. Chen, Y.E. Chien, and Y.R. Yang, "Novel Questionnaire-Responded Transaction Approach with SVM for Credit Card Fraud Detection," Lecture Notes in Computer Science (LNCS), Vol. 3497, pp. 916-921, 2005. [8] R.C. Chen, C.J. Lin, L.J. Lai, and Y.E. Chien, "Employing Support Vector Machines to Detect Credit Card Fraud for New Card Users," Asian Journal of Information Technology, Vol. 4, No. 2, pp. 223-228, 2005. [9] R.C. Chen, C.C. Chang, and S.T. Luo, and S.S. Li, "Detection of Credit Card Fraud by Using Support Vector Machines and Neural Networks," Proceedings of the Fourth International Conference on Information and Management Science, Kunming, China, pp. 310-315, 2005. [10] RC. Chen, T.S. Chen, and C.C. Lin, "A New Binary Support Vector System for Increasing Detection Rate of Credit Card Fraud," accepted by International Journal of Pattern Recognition and Artificial Intelligence, 2005. [11] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [12] http://www.kernel-machines.org/ [13] C. Corinna and V. Vladimir, "Support Vector Networks," Machine Learning, Vol. 20. pp. 273-297, 1995. [14] S. Ruping, mySVM-Manual, Al Unit University of Dortmund, October 30, 2000. [15] C.J.C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge Discovery, Vol. 2, No.2, pp. 955-974, 1998. [16] R.C. Chen, J. Chen, T.S. Chen, C.H. Hsieh, T.Y. Chen, and K.Y. Wu, "Building an Intrusion Detection System Based on Support Vector

Neural Networks," Lecture Notes in Computer Science (LNCS), Vol. 3497, pp. 897-903, 2005. [23] D.S. Huang, Systematic Theory of Neural Networks for Pattern Recognition, Publishing House of Electronic Industry of China, Beijing, 1996. [24] http://www.104pool.com/ [25] http://www.sino21.com/ [26] http://isurvey.com.tw/ [27] http://b-times.com.tw/

815