An Empirical Evaluation of Intelligent Machine ... - Science Direct

9 downloads 0 Views 486KB Size Report
of two emerging machine learning platforms under big data processing systems namely, H2O and .... Both platforms are evaluated against a publicly available .... rage. V alu e. Accuracy. AUC. F1. Precision. Recall. Specificity. Time. 0.15. 0.20.
Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Procedia Computer Science 00 (2017) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2017) 000–000

ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia Computer Science 113 (2017) 539–544

The 4th International Symposium on Emerging Information, Communication and Networks The 4th International Symposium on Emerging Information, Communication and Networks (EICN 2017) (EICN 2017)

An Empirical Evaluation of Intelligent Machine Learning An Empirical Evaluation of Intelligent Machine Learning Algorithms under Big Data Processing Systems Algorithms under Big Data Processing Systems Dima Suleimana,b,* , Malek Al-Zewairiaa, Ghazi Naymataa a,b,* Dima Suleiman , Malek Al-Zewairi , Ghazi Naymat

Computer Science Department, King Hussein Faculty of Computing Sciences, Princess Sumaya University for Technology, Amman 11941 Computer Science Department, King Hussein Faculty of Computing Sciences,Amman, PrincessJordan Sumaya University for Technology, Amman 11941 Jordan P.o.Box 1438 Al-Jubaiha, at theAl-Jubaiha, University of JordanJordan JordanbTeacher P.o.Box 1438 Amman, b Teacher at the University of Jordan

a a

Abstract Abstract The rapid increase in the magnitude of data produced by industries that need to be processed using Machine Learning algorithms The rapid increase the magnitude data produced industries that need to is bedue processed using Learning algorithms to generate businessinintelligence has of created a dilemmabyfor data scientists. This to the fact thatMachine traditional machine learning to generatesuch business intelligence a dilemma datawith scientists. This is Velocity due to theand factVariety. that traditional platforms as Weka and R arehas notcreated designed to handlefordata such Volume, Several machine learning platforms Weka andtoolkits R are not designed to handle data with such Volume, and Variety. machine learning algorithmssuch and as associated have been built specifically to work with bigVelocity data; however, their Several performance is yet to be algorithms associated toolkits beenofbuilt to this work withthe bigauthors data; however, their performance yet to be evaluated toand allow researchers to gethave the most thesespecifically platforms. In paper, intend to provide an empiricalisevaluation evaluated to allowmachine researchers to getplatforms the most under of these In this paper, authorsH2O intend toSparkling provide anWater, empirical evaluation of two emerging learning bigplatforms. data processing systemsthe namely, and by performing of emerging machine learning platforms under big data systems namely, andgeneralization Sparkling Water, bymetrics performing an two experimental comparison between the two platforms inprocessing terms of performance overH2O several error and an experimental comparison between the twoDataset. platforms in the terms of performance several generalization errorismetrics and model training time using the Santander Bank Up to authors’ knowledge,over this is the first time such a study conducted. model training time using the Santander the authors’ knowledge, thisthe is the first time such platform a study isin conducted. The evaluation results showed that the Bank H2O Dataset. platformUp hastosignificantly outperformed Sparkling Water terms of The evaluation results showed thatpercent, the H2Owhile platform has significantly outperformed the Sparkling Water platform in terms of model training time almost by fifty achieving convergent results. model training time almost by fifty percent, while achieving convergent results. © 2017 The Authors. Published by Elsevier B.V. © 2017 The Authors. Published by Elsevier B.V. © 2017 The under Authors. Published by B.V. Program Chairs. Peer-review responsibility of Elsevier the Conference Conference Peer-review under responsibility of the Program Chairs. Peer-review under responsibility of the Conference Program Chairs. Keywords: Big Data; H2O; Sparkling Water; Prediction; Spark; Santander Bank Dataset; Keywords: Big Data; H2O; Sparkling Water; Prediction; Spark; Santander Bank Dataset;

* Corresponding author. Tel.: +962-6-5359949; fax: +962-6-5347295. * E-mail Corresponding Tel.: +962-6-5359949; fax: +962-6-5347295. address:author. [email protected] E-mail address: [email protected] 1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review©under the Conference Program 1877-0509 2017responsibility The Authors. of Published by Elsevier B.V. Chairs. Peer-review under responsibility of the Conference Program Chairs.

1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. 10.1016/j.procs.2017.08.270

540 2

Dima Suleiman et al. / Procedia Computer Science 113 (2017) 539–544 Suleiman D., Al-Zewairi M. & Naymat G. / Procedia Computer Science 00 (2017) 000–000

1. Introduction It has been predicted that the amount of data that will be available in 2020 will be ten times more than what was available in 2013, and it might reach forty-four zettabytes according to the International Data Corporation’s annual Digital Universe study1. This proliferation of data has led to the blossom of Big Data science. In general, Big Data concept refers to the data that is huge, complex and/or has several formats to be processed by a single computing machine. For a problem to be contemplated as a big data problem, it must have at least one of the three features of big data, that is, Volume, Velocity and Variety, which often referred as the 3Vs. The first V (i.e. volume) means that the data come in huge size; thus, it cannot be processed using simple tools or commodity computers. The second V is the Velocity and it refers to the speed in which the data is collected. Often, big data comes with high velocity. Finally, the third V, which refers to Variety, and it means that data comes in different format, i.e. structured data, such as data in tabular format, semi-structured data, such as an XML and unstructured data, such as multimedia files2. The rapid increase in the magnitude of data produced by many industries (e.g. IoT sensors, user’s click stream, etc.) that need to be processed by machine learning algorithms creates a dilemma for researchers since traditional machine learning tools are not designed to handle this amount of data. The main purpose of machine learning is to use the knowledge from the past in order to learn how to make educated guesses in the future. In general, machine learning workflow consists of building a model, then making the necessary tuning after making evaluations to achieve the appropriate results, finally using the model to make predictions3,4. Moreover, Deep Learning algorithms are closely related to Artificial Intelligence. It aims to analyze and learn complex problems in order to make decisions similar to what the human brain can do5. Although, big data combined with machine learning have opened the door for unique research opportunities in several areas such as healthcare, users’ behavior analysis and threat intelligence; it has been proven that traditional machine learning toolkits such as Weka and R cannot handle the large proliferation of data that came with big data. Therefore, a new generation of data processing systems has emerged to handle big data, such as Hadoop, Spark, H2O, Sparkling Water and Steam. Hadoop (Apache Hadoop) is an open source implementation of MapReduce processing engine designed to distribute the processing of large datasets using clusters of commodity computers6. On the other hand, MapReduce is a programming model, which takes large task and divides it into subtasks. However, the reason for this division is to produce the results faster by enabling the subtasks to be done in parallel7. Similar to Hadoop, Spark also supports iterative computations in cluster environment. However, it features inmemory computations, making it process data much faster than its competitor technologies. In Spark, the main abstraction is Resilient Distributed Datasets, which used to store data in memory2. H2O is an open source platform that provides libraries for machine learning, parallel processing engines, scalable and fast deep learning, math, and data analytics, in addition to providing tools to facilitate the processing of data and building evaluations. On the other hand, Sparkling Water takes advantages of H2O and Spark by making combination between them. In addition, Sparkling water can provide fast, ideal and scalable machine learning platform of H2O to developers in order to use them in their applications8. One of hot research area is the personalized product recommendations where the individual user shopping behavior, habits and activities are tracked, recorded and analyzed to provide service providers with a better understanding of their users’ preferences. However, it is a challenge to create the prefect prediction model available online. Therefore, this allows service providers to deliver customized, targeted ads based on the users liking, which improves the overall user experience and increases the likelihood of purchasing extra products. These problems mandate the intervention of intelligent machine learning algorithms such as deep learning accompanied with big data processing systems. Several machine learning algorithms and associated toolkits were built specifically to work with big data problems such as the personalized product recommendations problem. Nonetheless, their performance is yet to be evaluated. In this paper, the authors provide an empirical evaluation study of two machine learning platforms under big data processing systems (namely; H2O and Sparkling Water). Both platforms are evaluated against a publicly available



Dima Suleiman et al. / Procedia Computer Science 113 (2017) 539–544 Suleiman D., Al-Zewairi M. & Naymat G. / Procedia Computer Science 00 (2017) 000–000

541 3

prediction problem provided by Santander Bank on Kaggle Website† to build a recommendation system to predict which products their existing customers might purchase based on their past behavior and that of similar customers9. The rest of this paper is structured as follows: Section 2 describes the dataset and its preparation process. Section 3 presents the evaluation results. Finally, the paper is concluded and the future work is presented in Section 4. 2. Santander Bank Dataset The Santander Bank dataset is publicly available dataset as part of a competition published by the Santander Bank‡ on Kaggle website to build a recommendation system to predict which products their existing customers might purchase in the future based on their past behavior and that of similar customers9. It contains anonymized information about the bank’s customers for one and a half year starting from January 28th, 2015 to May 28th, 2016 and it is split into two sub-datasets (i.e. training and testing) in a CSV format. The training dataset consists of forty-eight features in total and more than thirteen million labeled records. The first twenty-four features contains personal information about the individual customer (e.g. age, gender, etc.). The remaining features are the financial products (i.e. services) that the customer has already obtained as of May 28th, 2016 (e.g. current accounts, mortgage, loans, etc.) encoded as a binomial class of either zero or one, while the testing dataset contains less than one million unlabeled records. For the purpose of this study, the testing dataset was omitted and only the training dataset is used. Interested researchers may refer to the competition website for full description of the different features/columns of the dataset9. 2.1. Dataset Preprocessing In order to prepare the dataset for this study, we needed to split the training dataset into three sub-datasets, which are training dataset with 60% ratio, validation dataset with 10% ratio, and testing dataset with 30% ratio. Moreover, the dataset has some anomalies that required preprocessing and cleansing prior using it. For instance, both the Payroll and the Pensions columns had a third value “NA”, which was replaced with the value “0”. Also, the products columns (i.e. the last twenty-four columns) had to be converted to categorical (i.e. “Enum”) datatype in order to be selected as response feature in the H2O and the Sparkling Water platforms. Table 1 shows some statistics about the aforementioned datasets. Table 1. Statistics about the Santander Bank dataset. Dataset

Ratio

Number of Records

Compressed Size (MB)

Training

60%

8,188,476

303MB

Testing

30%

4,094,256

158MB

Validation

10%

1,364,577

60MB

3. Results and Evaluation In this section, the results of the experimental comparison between the two machine learning platforms; namely, H2O and Sparkling Water, are presented and discussed. Because the dataset has twenty-four products that have to be predicted separately, we had to build, train, cross validate and test a separate model for each product. Therefore, twenty-four different models were used to evaluate each platform and forty-eight in total. The models have the same configurations that were set experimentally as follows:  10-folds cross-validation using the validation dataset.

† Founded in April 2010, Kaggle is a website specialized in data science competitions. https://www.kaggle.com ‡ Santander Bank is a subsidiary of the Spanish Santander Group. Based in Boston, Massachusetts, United States. https://www.santanderbank.com

Dima Suleiman et al. / Procedia Computer Science 113 (2017) 539–544 Suleiman D., Al-Zewairi M. & Naymat G. / Procedia Computer Science 00 (2017) 000–000

542 4

 5 hidden layers with 10 neurons each.  Shuffle training data. The rest of the configurations were left on the default value. The evaluation environment runs on a multi-nodes virtual cluster consisting of twelve nodes as follows:  Name Node: 2x2.6GHz Xeon E5-2690 v3 CPU, 24 vCPU, 32GB RAM, 4x600GB 10K SAS HDD/ RAID 5.  Data Nodes: 10 nodes (1x3.4GHz Intel Core i7 CPU, 8 vCPU, 16GB RAM, HDD: 1x1TB 7200rpm SATA).  Secondary Name Node with similar specifications as the name node. Table 2 shows the version of the platforms and software used. Table 2. Summary of the used software version. Software

Version

Python

2.7.13

Spark

2.0.2

H2O

3.10.5.4

Sparkling Water

2.1.8

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35

Avarage Value

Average Value

The evaluation process was divided into two main parts; the first part includes building, training and validating the model on seen data using both the training and the validation datasets respectively, while the second part includes testing the model on unseen data using the testing dataset. This process is repeated for both platforms and for each product of the twenty-four products. All results were recorded but only the average, minimum and maximum values are reported as shown in Figure 1. For the model training part, the following measures were used to compare the two platforms: Accuracy, Area Under the Curve (AUC), F1-score, Precision, Recall, Specificity, and training Time. On the other hand, the Threshold was added as a measure to evaluate the platforms in model testing; while, the time and the AUC measures were omitted. Since the testing results vary based on the value of threshold, we only reported the results with the highest F1-score. It is worth mentioning that the time measure was normalized to a value between zero and one on the interval (0, 1].

H2O

SparklingWater

Accuracy F1 Recall Time

AUC Precision Specificity

(a)

1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15

H2O

SparklingWater

Accuracy Precision Specificity

(b)

F1 Recall Theshold

Dima Suleiman et al. / Procedia Computer Science 113 (2017) 539–544 Suleiman D., Al-Zewairi M. & Naymat G. / Procedia Computer Science 00 (2017) 000–000

1.00

1.00

0.90

0.90

0.80

0.80

0.70

0.70

0.60

0.60

Minimum Value

Minimum Value



0.50 0.40 0.30 0.20 0.10 0.00

0.50 0.40 0.30 0.20 0.10

H2O

0.00

SparklingWater

Accuracy F1 Recall Time

AUC Precision Specificity

H2O

F1

Precision

Recall

Specificity

Theshold

(d)

Maximum Value

Maximum Value H2O

SparklingWater

Accuracy F1 Recall Time

AUC Precision Specificity

(e)

SparklingWater

Accuracy

(c) 1.000 0.998 0.996 0.994 0.992 0.990 0.988 0.986 0.984 0.982 0.980 0.978 0.976

543 5

1.00 0.96 0.92 0.88 0.84 0.80 0.76 0.72 0.68 0.64 0.60 0.56 0.52 0.48 0.44

H2O

SparklingWater

Accuracy Precision Specificity

F1 Recall Theshold

(f)

Fig. 1. (a) Average model training results; (b) Average model testing results; (c) Minimum model training results; (d) Minimum model testing results; (e) Maximum model training results; (f) Maximum model testing results.

Figure 1 (a) shows the average model training/cross-validation results for all twenty-four products. It also shows that on average, the H2O platform has significantly outperformed the Sparkling Water platform in terms of model training time almost by half. This is also true for the minimum model training results as shown in Figure 1 (c). However, both platforms have scored comparable results for all other metrics as shown in Figures 1 (a), (c), and (e).

Dima Suleiman et al. / Procedia Computer Science 113 (2017) 539–544 Suleiman D., Al-Zewairi M. & Naymat G. / Procedia Computer Science 00 (2017) 000–000

544 6

On the other hand, Figure 1 (b) shows the average model testing results in which the Sparkling Water platform has slightly surpassed the H2O platform in terms of model accuracy and specificity. However, it greatly surpassed it when comparing the minimum model testing accuracy and specificity metrics as shown in Figure 1 (d). However, they both have scored similar results on the other metrics. Figures 1 (e) and (f) compare the maximum model training and testing results for both platforms respectively. From the aforementioned evaluation results, one can conclude that as both platforms have achieved convergent results but with significant edge for the H2O platform in model training time, it is same safe to assume the superiority of the H2O platform over the Sparkling Water platform. However, since only one dataset was used in the evaluation the results can be biased towards one platform, which requires introducing more tests using several datasets. The source-code, built models, evaluation results and all the related files are available for interested researchers at the project repository on GitHub§. 4. Conclusion The need for analyzing big data whether for business, medical, military or scientific applications has led to the creation of several big data analysis tools and platforms that claims to be fast, accurate and highly scalable; however, their performance is yet to be evaluated. H2O and Sparkling Water are both cutting-edge in-memory big data analysis platforms with support to several machine learning algorithms. In this paper, the authors performed experimental comparisons between the two platforms by comparing their accuracy, AUC, f1-score, precision, recall, specificity and training time in solving a public prediction challenge. The experiments were made using the Santander Bank dataset, which is a publicly available dataset on Kaggle website. Twenty-four models were built for each platform, one for each product and 10-fold cross validation was used to assess the model evaluation. The experimental results showed that the two platforms have achieved convergent results in terms of accuracy, f1-score, precision, recall and specificity with Sparkling Water platform has slightly surpassed the H2O platform in terms of model accuracy. Nevertheless, H2O achieve a significant result in terms of model training time. For the future work, more experiments will be conducted using several datasets in order to avoid biased results. References 1. 2. 3. 4. 5. 6. 7. 8. 9.

§

The Digital Universe and Big Data - EMC. Retrieved February 7, 2017, from https://www.emc.com/leadership/digital-universe/index.htm Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), p.p. 24. https://doi.org/10.1186/s40537-015-0032-1 Gupta, S. (2016, August 22). Deep Learning vs. traditional Machine Learning algorithms used in Credit Card Fraud Detection (masters). Dublin, National College of Ireland. Retrieved from http://trap.ncirl.ie/2495/ Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), p.p. 1. https://doi.org/10.1186/s40537-014-0007-7 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), p.p. 436–444. https://doi.org/10.1038/nature14539 White, T. (2015). Hadoop: The Definitive Guide (4th Edition). Retrieved from http://shop.oreilly.com/product/0636920033448.do Abhishek, S. (2015). Big Data and Hadoop. Java COE www.marlabs.com. Retrieved from http://www.marlabs.com/sites/default/files/Marlabs-WhitePaper-BigData-Hadoop.pdf Arora, A., Candel, A., Lanford, J., LeDel, E., & Parmar, V. (2015, August). Deep Learning with H2O. H2O.ai, Inc. Retrieved from https://h2o-release.s3.amazonaws.com/h2o/master/3190/docs-website/h2o-docs/booklets/DeepLearning_Vignette.pdf “Santander Product Recommendation | Kaggle. (2016, December)”. Retrieved February 7, 2017, from https://www.kaggle.com/c/santanderproduct-recommendation

https://github.com/alzewairi/H2O_vs_SparklingWater