Paper Title (use style: paper title)

0 downloads 1 Views 262KB Size Report
years, on the other hand following Moore's law the processing power is doubling every 18 ..... learning." Keynote I PowerPoint presentation, Jul 28 (2010).

Distributed Machine Learning A Review of current progress Karim Ouda School of Computing, University of Leeds , LS2 9JT, United Kingdom

Abstract—The need for solving Machine Learning problems at scale using the power of distributed computing is evident due to the increasing amount of daily untapped data where lots of insights can be discovered, many solutions has emerged in the last 5 years to help tackling this problem, this paper is a review of the current status and future direction of distributed machine learning followed by a performance comparison of two popular products in this field Mahout and Spark+MLlib



Machine Learning is one of the old key research and application fields in Computer Science that is rapidly becoming part of our daily life, think of song and movie recommendation system, cell phone and web personalization, computer vision and CCTV applications and so forth, one of the main drivers of the current boom in Machine Learning demand is the huge amount of data produced after the web 2.0 era, Facebook alone used to process 500TB of new data per day in 2012 [1] it is also estimated that the Digital Universe will reach 44 zettabytes in 2020 which is 50-fold growth since 2010 [2] having such pile of untapped data at hand, companies will then need to make use of it by finding patterns and insights which in turn can lead to business performance improvement and more user understanding and this is where Machine Learning comes for rescue Machine Learning (ML) is a problem solving technique in which the Machine (Computer) learn from data automatically mimicking the way the brain works, the process starts by extracting Features (a unique machine friendly representation for each data item) then running one of the several ML algorithms to produce a learned Model (a representation of the collective patterns and rules in the data) which can then be used to predict, classify and perform functionalities on new unseen data A real life practical application for Machine Learning in BigData is Recommendation Systems for Songs, Videos and Movies, companies like NetFlix can earn more money by recommending the right film for the right person based on his previous history and the collective patterns of other users, another application is Display Advertising targeting, companies like Google can gain more revenues if the Ad system is targeting users of interest showing the right Ad to the right person which is also based on patterns to be learned from huge amount of data held by a company like Google

Trying to apply Machine Learning on Big Data, companies faced many difficulties, ML algorithms and tools are processing and memory intensive and were not designed to handle such amount of data efficiently, and that makes sense because such rapid growth of data happened only in the last 10 years, on the other hand following Moore's law the processing power is doubling every 18 months with more cores per system, also the existence of the cloud computing concept where commodity distributed machines are used instead big super computers, finally the existence of distribute software frameworks and parallel programming models like MapReduce inspired academics and big companies to find a solution to parallelize Machine Learning tasks on multicore systems The next sections of this paper will discuss approaches and solutions to the problem explained above, the common architecture will be explained and also the result of performance comparison between Hadoop/Spark will be demonstrated as well as the result of a classification task - PoC which was run on Spark+MLlib on a could platform, finally I will be discussing the future trends and direction. II.


I believe the first effective endeavor to tackle the lack of distributed framework for machine learning problem was a paper from Stanford in 2007 co-authored by Andrew Ng titled “Map-reduce for machine learning on multicore” [3] which was later implemented by Apache community and now known as Mahout project [4] In that paper the authors showed that if any algorithm can fit in a specific “Statistical Query Model“ can then be written in a “summation form” that can be fed to a map-reduce like framework for processing, they claimed that for the class of algorithms that depends on statistics or gradients can fit into the model and that such calculations can be spread across cores and aggregated at the end of processing, they experimented on 10 algorithms and showed linear speedups proportional to the number of cores Following is the review of the current solutions starting with Mahout

A. Mahout A scalable machine learning library on the top of Hadoop [5] which is an open source framework for distributed computing, Mahout started originally as part of Apache Lucene project in 2008 with the goal of implementing the paper mentioned above, the first version with the name Mahout was released in April 2010 [6] the library supports many ML algorithms [7] such as

context means it supports many data processing models such as Graphs, Streaming, Machine Learning and SQL Unfortunately the current version (1.3.0) shipped with Spark does not include the ML language and algorithm optimization parts, only the support for different ML Algorithms listed below [11]

Naive Bayes

Linear SVM and logistic regression

Hidden Markov Models

Classification and regression tree

Logistic Regression

K-means clustering

Random Forest

Recommendation via alternating least squares


Singular value decomposition


Linear regression with L1- and L2-regularization


Multinomial naive Bayes

Fuzzy k-Means

Basic statistics

Streaming KMeans

Feature transformations

Spectral Clustering

B. MLlib (MLBase) MLbase is a product of a paper titled “MLbase: A Distributed Machine-learning System.” by Tim Kraska et al. In 2013 [8] the aim of the paper was to create an optimized scalable ML framework for users (ML Researchers) with basic background in distributed systems, in addition to that they provided a declarative way to specify ML tasks with an optimizer that can auto-tune and dynamically choose algorithms

Illustration 1: MLBase Highlevel ML Language example [8]

As we can see in Illustration 1, MLBase aimed to provide the user with high-level ML language which abstracts and keep the user away from thinking of other non-ML issues such as handling distributed computing or even which algorithm to choose, in this example the user specifies the training data (X) and labels (y) and the system will try different algorithms and train on the best algorithm and return the final model (fnmodel) & summary information for the user Currently MLBase is part of Apache Spark [9] since September 2013 [10] with the name MLlib, Apache Spark is a fast, scalable and general engine for data processing, general in this

C. GraphLab GraphLab [12] is a python library for graph processing as well as distributed machine learning, it provides integration with Spark/Hadoop and the ability for parameter auto tuning, the product originally initiated as a research work in CMU in 2012 [13] D. SystemML A System proposed by researchers at IBM to provide a Declarative Machine Learning (DML) language that can be transformed into optimized MapReduce scripts to be executed on distributed environment [14] E. Parameter Server An opensource project [15] by researchers from CMU, Google and Baidu [16] which aim to provide a flexible scalable distributed machine learning framework with all nodes sharing global parameters F. Sibyl Google's distributed machine learning system [17] utilizing MapReduce and Google GFS and other proprietary optimizations G. Other Systems In addition to the ones listed above, there are many other relevant systems such as YahooLDA [18] for topic modeling, Petnum [19], DistBelief [20] a framework for distributed Deep

Learning, Google Predict [21] a commercial PaaS service for ML tasks III.

centroid, this also be done in a separate machine for each centroid as shown in Illustration #3 below


Since most Distributed Machine Learning solutions are built on the top of MapReduce or MapReduce-like systems, I will start by showing the general map-reduce concept and architecture A. Apache Hadoop

Illustration 3: MapReduce Example for Machine Learning KMeans [23] B. As shown in illustration #2 above, a typical distributed processing system require a Distributed Filesystem which store the data and replicate it in a distributed fashion so that it can be processed near the core and also to handle failures The NameNode holds a centralized index of the distributed filesystem, while the DataNode store and provide access to the real files

C. Apache Spark Spark architecture [24] is simple and flexible, each application (Job in Hadoop terminology) has its own executor process and it can also run on the top of Hadoop or in its own cluster by sending data and initiating new executors on slave nodes (Worker nodes) as shown in Illustration #4

Illustration 2: Hadoop Architecture [22] C. MapReduce

The MapReduce processing engine uses the DFS to process data in an efficient and reliable manner, the JobTracker receives MapReduce jobs and distribute tasks (ex: Map task for data subset) to TaskTrackers according to the location of the data in the cluster MapReduce is based on the fact that many computations can be reduced to simple mapping and/or arithmetic operations thus can be split across different machines and aggregated afterwards, this model can also be applied for Machine learning problems

Illustration 4: Spark Architecture [34]



B. MapReduce & Machine Learning

A. Cluster Setup

An application example for MapReduce in Machine Learning is k-means algorithm execution, where list of points can be split into partitions, each partition is sent to a different machine for processing, the distance between each point in the partition and the centroids is calculated and then assigned to one of the centroids in a map operation, once all partitions are processed and each point is assigned to a centroid, a reduce function can be applied to calculate the new position of each

Two virtual machines were used to setup a simple cluster for the experiment provisioned by OpenNebula cloud manager, both machines are Ubuntu 12.04.1 (server edition) x86_64 with kernel version 3.2.0-69-generic, the Master machine has 6GB of memory, 4 cores and 12 VCPUs, the Slave machine contains 4GB of memory, 4 cores and 10 VCPUs

Software versions: 

Apach Saprk 1.3.0 [27]

Mahout 0.9 [28]

Hadoop 2.6.0 [29]

B. DataSets


In all experiments Spark showed better performance compared to the Hadoop counterpart, below is the final results for each experiment A. Simple MapReduce comparison between Hadoop/Spark

Dataset #1 is a corpus of labeled tweets used for sentiment analysis by Sentiment140 [25] there are 2 files in the dataset “training.1600000.processed.noemoticon.csv” is the file used for testing the map-reduce efficiency for word counting, it is 288 MB in size and contains 1,600,000 rows

TABLE I. Hadoop/Spark MapReduce performance Time

Dataset #2 is a Bag-Of-words set provided by University of California, Irvine [26] the “NYTimes news articles” were chosen for the machine learning clustering experiment, the file docword.nytimes.txt was reduced to 680 MB and 50,000,000 records before being used for clustering C. Simple MapReduce comparison between Hadoop/Spark To test the efficiency of both systems, Dataset #1 was passed to Hadoop and Spark clusters to count all words in all tweets and sort them in descending order, for hadoop a modified version of [30] example was used, the map function was changed to parse only the last column in the dataset which includes the tweet text, also an additional map function and a new job were added to sort the final counts by value instead of by key, for Spark experiment a modified version [31] example was used and additional mapping/sorting was added and an additional line of code to save the output to the filesystem

D. Spark + MLLib Distributed Machine Learning Experiment To demonstrate the concept of Distributed Machine Learning, an experiment was conducted to cluster a big data file (Dataset #2) on the Spark/MLlib cluster, a modified version of [32] was used, some changes were done so that the file works on the new MLlib version and an additional prediction function was added to predict a new point based on the learned model The following command/settings were used to run the experiment, number of clusters k = 2 and max_iterations = 10 ~/spark-1.3.0-bin-hadoop2.4/bin/spark-submit --master spark://hadoopmaster:7077 --driver-memory 5G --class MyJavaKMeansNYDS my-mllib-kmeans-nyds.jar ~/spark1.3.0-bin-hadoop2.4/data/mllib/docword.nytimes.splitaa2 2 10




66 seconds

20 seconds a.

B. Spark + MLLib Distributed Machine Learning Experiment

TABLE II. Spark/MLlib Clustering Performance Clustering time

10 minutes b.



As the amount of data continues to increase over time, current solutions are moving towards more optimized code, better architectures, auto-tuning, improved scheduling and distributed communications, and better compression techniques, also some researchers are working on introducing high-level abstract languages [14][33] and constructs for Machine Learning to make machine learning application more simple for the end users VII. CONCLUSION Machine learning is the present and the future of problem solving in computing, along with the increasing trends in Data and Processing power Distributed Machine learning solutions are evolving to cater for the needs and challenges in both scientific and business worlds, in this paper I discussed the concept of Distributed Machine Learning, the problem it solves, reviewed the current solutions and showed the results of the practical experiments conducted and a glimpse on the future direction


[2] [3]

[4] [5] [6] [7] [8] [9] [10]

[11] [12] [13]


[15] [16]

CNet, “Facebook processes more than 500 TB of data daily” IMC, New Digital Universe Study Reveals Big Data Gap Chu, Cheng, et al. "Map-reduce for machine learning on multicore." Advances in neural information processing systems 19 (2007): 281. Apache Mahout: Scalable machine learning and data mining What Is Apache Hadoop? Where can I find the origins of the Mahout project? Mahout Algorithms Kraska, Tim, et al. "MLbase: A Distributed Machine-learning System." CIDR. 2013. Apache Spark, Ameet Talwalkar, “MLlib: Spark’s Machine Learning Library”, Presentation df Apache Spark MLLib, GraphLab Low, Yucheng, et al. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment 5.8 (2012): 716-727. id=2212354 Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce." Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011. arnumber=5767930&tag=1 Parameter Server Li, Mu, et al. "Scaling distributed machine learning with the parameter server." Operating Systems Design and Implementation (OSDI). 2014.

[18] [19] [20]

[21] [22] [23]

[24] [25] [26] [27] [28] [29] [30] [31]



[34] Chandra, Tushar, et al. "Sibyl: a system for large scale machine learning." Keynote I PowerPoint presentation, Jul 28 (2010). YahooLDA Petnum Dean, Jeffrey, et al. "Large scale distributed deep networks." Advances in Neural Information Processing Systems. 2012. Google Predict Hadoop Designing algorithms for Map Reduce Hadoop multi node cluster setup Tweets Dataset Bag of Words Dataset Spark Download Mahout Download Hadoop Download Hadoop WordCount example Spark JavaWordCount example file /apache/spark/examples/ MLLib JavaKMeans example file /apache/spark/examples/mllib/ Sparks, Evan R., et al. "MLI: An API for distributed machine learning." Data Mining (ICDM), 2013 IEEE 13th International Conference on. IEEE, 2013. arnumber=6729619 Spark Architecture