Paper Title (use style: paper title)

2 downloads 0 Views 963KB Size Report
Data analytics systems, e.g. Hadoop [1], Spark [2], and Flink. [3]. During the ..... [3] Flink: https://flink.apache.org/, accessed on 10 Feb 2016. [4] J. Dean and S.
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: https://doi.org/10.1109/ISCC.2016.7543832

Doopnet: An Emulator for Network Performance Analysis of Hadoop Clusters Using Docker and Mininet Yuansong Qiao†, Xueyuan Wang†,‡, Guiming Fang*, Brian Lee† †Software

Research Institute, Athlone Institute of Technology, Dublin Road, Athlone, Co. Westmeath, Ireland ‡East China University of Technology, China *Institute of Software, Chinese Academy of Sciences, China [email protected], [email protected], [email protected], [email protected]

Abstract—Hadoop is one of the most important Big Data processing and storage systems. In recent years, a lot of efforts have been put to enhance Hadoop’s performance from networking perspectives. However, there are limited tools that can help researchers to verify their networking algorithm design in terms of Hadoop’s performance. This paper proposes Doopnet which is a framework and toolset for creating Hadoop clusters in a virtualized environment and for monitoring/analysing of Hadoop’s networking characteristics under different network configurations. Doopnet enables users to automatically set up a Hadoop cluster over Docker containers running inside Mininet. The Hadoop traffic is collected inside the containers and virtual switches through network flow monitors. The users can easily modify network topologies or configurations through Mininet, observe the networking behaviour through network flow monitors, and analyse the effects of different network settings on Hadoop’s performance. Examples are presented to demonstrate how to setup the Doopnet testbed and analyse Hadoop traffic. Keywords—Hadoop; Mininet; Docker; Big Data; Emulator; Simulator

I. INTRODUCTION The development of Social Networking, Mobile Computing, and Internet of Things has generated a large volume of data, which has resulted in creations of various Big Data analytics systems, e.g. Hadoop [1], Spark [2], and Flink [3]. During the computation processes in these systems, data movements occur very frequently between computation nodes in a data centre. An analysis from Facebook shows that, on average, data transmissions take 33% of the whole execution time in MapReduce [4] jobs with reduce phases [5]. Furthermore, some skew effects, e.g. partitioning skew, in MapReduce applications may lead to non-uniformed data distributions between different reducers and consequently prolong the execution times significantly [6]. In recent year, researchers try to optimize Big Data analytics systems from system/networking perspectives. Especially with the advent of Software Defined Networking (SDN), e.g. OpenFlow [7], it is possible to dynamically reconfigure the network according to the applications’ Acknowledgements: This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 13/SIRG/2178 with assistance from Chinese 973 Programme Grant No. 2013CB329106 and Enterprise Ireland Technology Gateway Programme COMAND.

requirements and run-time status. Some examples of these solutions are Pythia [8], FlowComb [9], and Orchestra [10]. Most of the work are evaluated through real-world experimental testbeds, e.g. Pythia organized 10 servers into 2 racks and Orchestra ran the experiments in Amazon’s Elastic Compute Cloud (EC2). These evaluation approaches can accurately reflect the performance of the proposed solutions. However, they are very expensive to setup and difficult to reorganize the network topologies. This paper proposes Doopnet which is a network performance evaluation and analysis framework for Hadoop clusters. Doopnet complements the real-world experimental testbeds and aims to enable researchers to evaluate their algorithms in the early stage with a low cost and flexible testbed setup. Doopnet supports setting up a Hadoop cluster (Hadoop version 2, YARN) over Mininet [11] with Docker [12] containers as the virtual hosts. Mininet is a network emulator which can rapidly build a large scale network on a single server. It natively supports SDN and can work with external OpenFlow controllers. Docker is an open-source project that can create software containers using operating-system level virtualization technologies on Linux. Applications running inside a Docker container is like running in a traditional virtual machine. Mininet creates virtual hosts using Linux namespaces. Each virtual host runs inside a separate network namespace with its own set of network interfaces, IP addresses, and routing table. A virtual host connects to one or more virtual switches through network links. The network link properties, e.g. bandwidth, delay, and loss rate, can be configured through Mininet’s Python APIs. However, all virtual hosts share the same file system of the host machine by default. This creates a challenge for running a Hadoop cluster inside Mininet virtual hosts, e.g. Hadoop requires that each node has a separate hostname. Docker isolates the container resources using the same technologies as Mininet does for its virtual hosts, e.g. based on Linux cgroups and kernel namespaces. They both use veth (Virtual Ethernet) pairs to connect the containers/virtual hosts to the virtual switches. Therefore, replacing Mininet virtual

hosts with Docker containers is technically feasible. Dockernet [13] is an active open-source software project that is trying to use Docker containers as Mininet virtual hosts. The project is under development. We took a snapshot of Dockernet and built Doopnet based on it.

Dockernet [13] is the only work we have found that is trying to integrate Docker into Mininet. It targets for a generic emulation environment, whereas Doopnet is based on Dockernet and focuses on creating an emulation environment of Hadoop clusters.

In Doopnet, a Hadoop cluster is configured and runs in Docker containers similar to running the cluster in separate physical servers. The network traffic inside each switch and container is monitored and collected so that different types of network traffic including those across racks, inside a rack, and inside a host are all captured for future analysis. Hadoop Counters and Metrics provides statistics of the MapReduce tasks and the Hadoop daemons. However, the information is in very high level. For example, Hadoop generates the total bytes that a reducer has retrieved from all mappers after the task has finished. There is no information on the traffic from each mapper to each reducer. Through monitoring network traffic in Doopnet, it is possible to exam the traffic in a much greater detail.

SDN controllers, e.g. Floodlight [22] and OpenDaylight [23], are a necessary component of Doopnet while evaluating Hadoop performance in SDN contexts. As Mininet can work with any SDN controllers, we choose Floodlight in tests of this paper.

This paper will introduce the design of Doopnet, the work flow of Doopnet, and various tools used in implementing Doopnet. Some examples will be presented to demonstrate how Doopnet is used to analysis Hadoop traffic. To the best of our knowledge, this is the first work addressing this problem.

YARN (Yet Another Resource Negotiator) assigns and manages computation resources (e.g. CPU and memory) of the cluster nodes for its applications (e.g. MapReduce). HDFS (Hadoop Distributed File System) is a distributed file system that provides permanent and reliable storage services. MapReduce is a YARN based parallel data processing framework. It requests computation resources from the YARN ResourceManager and then launches map or reduce tasks in containers. A MapReduce application normally uses HDFS to store input and output files.

The rest of the paper is organized as follows: Section II presents related work. Section III illustrates the Doopnet design details. Section IV offers some examples of using Doopnet. The conclusions and future works are presented in Section V. II. RELATED WORK Modelling and simulation of Big Data application performance is a very active area. The European Cooperation in Science and Technology (COST) framework has newly formed an Action in April 2015 called cHiPSet (HighPerformance Modelling and Simulation for Big Data Applications, IC1406) [14]. Many simulators for modelling cloud environments have been proposed, e.g. GreenCloud [15], CloudSim [16], and CloudAnalyst [17]. As these tools are all simulators, it is difficult to emulate the characteristics of real-world big data applications and evaluate the performance. Traditional network simulators, e.g. NS-2 and NS-3 [18], face the same challenges. Some researchers propose to setup emulation environments using Mininet for specific purposes. In [19], the authors propose a network flow simulator – xSDN – which enables the users to define network flows with configuration files or through programmes. As the purpose of Doopnet is to study real-word application performance under different network configurations, it is hard to model Hadoop traffic pattern through xSDN. SDDC [20] proposes a Software Defined Data Centre framework based on Mininet. It integrates Mininet with Software Defined Storage, Software Defined Security, and Software Define Computation. CCGVS [21] proposes to use Mininet to build a lightweight virtual machine farm to evaluate the performance of big data systems. Both SDDC and CCGVS concentrate on high level architecture and don’t address specific issues while deploying Hadoop cluster over Mininet.

III. DOOPNET FRAMEWORK AND DESIGN A. Hadoop System Overview Hadoop is an Apache open source project that offers distributed data processing through a computer cluster. It scales from a single server up to thousands of machines. It contains 3 key components, i.e. YARN, HDFS, and MapReduce.

To run a MapReduce application in a YARN cluster, the HDFS and YARN services need to be started first. The HDFS daemons include NameNode, DataNode and SecondaryNameNode. The YARN daemons include ResourceManager, NodeManager, WebAppProxy, and the JobHistory server (for MapReduce jobs). The configuration files for Hadoop daemons are placed in hadoop-/etc/hadoop by default. The hadoop-env.sh and core-site.xml are for global settings. The yarn-env.sh and yarn-site.xml are for YARN related settings. The hdfs-site.xml is for HDFS settings. The mapred-env.sh and mapred-site.xml are for MapReduce settings. B. Doopnet Architecture During the start-up stage, Hadoop daemons need to read correspondent configuration files. By default, Hadoop daemons also check the hostname and lookup the correspondent IP address to bind on. After the Hadoop daemons has started, they need to access various run-time folders/files on the file system, e.g. for saving logs or for accessing files in the HDFS. While deploying a Hadoop cluster on Docker containers, the run-time folders/files and the hostnames/IP addresses are separated naturally. The Hadoop configuration files on different machines can be the same. However, the default hostname of a Docker container is the container ID that is automatically generated by Docker while the container starts. Therefore, the configurations related to hostnames in the Hadoop configuration files need to be updated after the hostnames have been decided.

In Doopnet, users define the hostnames of all the containers through Python APIs. Doopnet automatically generates the hostname to IP address mapping in the containers (by modifying /etc/hosts). The Hadoop configuration files are automatically updated with the defined hostnames.

provides a Python class to help this step. Host specific settings is passed to the containers in this step, e.g. modifying memory settings (Section D) and specifying the container that each Hadoop daemon will be started in; (4) & (5) Start the network traffic collectors and monitors for virtual switches. Section E will present more detailed information on step (4) and (5).

Hadoop Cluster

Docker Container

Hadoop Processes

Flow Monitor

Flow Collector Process File System

Hadoop Binaries, Configurations & Run-time Folders

Virtual Switch

Flow Collector Folders

Flow Monitor

Doopnet Tools

VM-to-Host Folder

QoS Controller

Mininet Virtual Nodes Host Machine Host Machine

Mininet Process

OpenFlow Controller

Virtual Switch Flow Collector Process File System

Docker Images

Flow Collector Folders

Doopnet Tools

VM-to-Host Folder

Fig. 1. Doopnet Run-Time Modules

The run-time system architecture of Doopnet is shown in Fig. 1. It works in two layers. In the host machine layer, the Mininet process starts Docker containers from pre-created Docker images. The OpenFlow controller generates routing and QoS settings for virtual switches. The Flow Collectors are responsible for collecting and saving flow statistics from the Flow Monitors of each virtual switch. The virtual nodes layer contains Docker containers and virtual switches. These virtual nodes are created by Mininet. Hadoop daemons, Flow Monitors and Flow Collectors run on every Docker container. The run-time data from these processes are saved in the folders of the container. Each Docker container contains a VM-to-Host folder which is connected to the same folder on the host machine so that it is easy to exchange data between the containers and the host. In the virtual switches, settings for flow monitoring and QoS controls are configured from the host machine. C. Doopnet Workflow Doopnet has encapsulated different tools and configurations into scripts to simplify the testbed creation process. This paper presents the raw tools and configurations to illustrate the design details. The Doopnet working flow is shown in Fig. 2. The first part is executed in the host machine and the second part is executed in Docker containers. The host machine part contains the following steps: (1) Create the required Hadoop images with pre-defined configurations, e.g. Hadoop memory settings (more details will be provided in Section D); (2) If SDN is required, an SDN controller, e.g. Floodlight, should be started; (3) Create the Mininet virtual network and Docker containers. Doopnet

In Host Machine

In Docker Containers

Create Hadoop Docker Images with Pre-Defined Configurations

Start NetFlow Collector in Virtual Hosts

Start OpenFlow Controller (If Required)

Start NetFlow Monitor in Virtual Hosts

Start Mininet and Docker Containers

Start Hadoop Daemons in each Mininet Virtual Host

Start NetFlow Collector for Switches

Run MapReduce Applications in a Mininet Virtual Host

Start NetFlow Monitor in Virtual Switches

Backup and Analyse Collected NetFlow Data

Fig. 2. Doopnet Workflow

The Docker container part contains the following steps: (1) & (2) Start network traffic collectors and monitors in all the Docker containers (Section E); (3) Start HDFS and YARN daemons in the Docker containers with host specific configuration files. The daemons are started in the pre-defined containers; (4) Run MapReduce applications within a Docker container with the host specific settings. (5) Backup and analyse the collected network traffic records (Section E). D. Hadoop Resource Configuration and Calculation Doopnet enables users to specify the memory settings of the Hadoop cluster while creating the Docker images or starting the Docker containers. TABLE I.

HDFS PROCESS NUMBERS

Process Name NameNode SecondaryNameNode DataNode TABLE II.

Number/Cluster 1 per-cluster 1 per-cluster 1 per-node

YARN PROCESS NUMBERS

Process Name ResourceManager ProxyServer HistoryServer NodeManager

Number/Cluster 1 per-cluster 1 per-cluster 1 per-cluster 1 per-node

Doopnet runs a Hadoop cluster in one physical machine. It is important to estimate the number of virtual hosts that the machine(s) can support and the memory that the MapReduce application can use. The process names and numbers for the HDFS and YARN services are listed in TABLE I and TABLE II. Experiments show that each Hadoop process roughly takes 200 MB RAM (between 190 MB and 200 MB) in an empty cluster. Therefore,

given the number of Docker container n, the minimal required RAM m (MB) for running Hadoop services is: m = ( n × 2 + 5 ) × 200

(1)

Assume that the physical memory of the machine is p (MB), the memory required by the Operating System is o, and the maximum memory that each YARN node can allocate to its containers is y (MB): y=(p–m–o)/n TABLE III.

MEMORY SETTINGS FOR HADOOP DAEMONS Parameter hadoop-env.sh

HADOOP_HEAPSIZE HADOOP_NAMENODE_INIT_HEAPSIZE yarn-env.sh YARN_HEAPSIZE YARN_RESOURCEMANAGER_HEAPSIZE YARN_NODEMANAGER_HEAPSIZE YARN_TIMELINESERVER_HEAPSIZE mapred-env.sh HADOOP_JOB_HISTORYSERVER_HEAPSIZE TABLE IV.

Example 500 500 500 500 500 500 250

MEMORY SETTINGS IN YARN-SITE.XML

Parameter yarn.nodemanager.resource.memory-mb yarn.scheduler.minimum-allocation-mb yarn.scheduler.maximum-allocation-mb yarn.nodemanager.vmem-pmem-ratio TABLE V.

(2)

Example 6114 256 6114 2.1

MEMORY SETTINGS IN MAPRED-SITE.XML

Parameter mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.map.java.opts mapreduce.reduce.java.opts

Example 1024 1024 -Xmx819m -Xmx819m

If we want to create a testbed with 10 Docker containers running full Hadoop services, the Hadoop processes consume at least 5 GB RAM. For a machine with 24 GB RAM, presume that the Operating System requires 4 GB RAM, there will be 15 GB RAM for the MapReduce applications and each YARN NodeManager can allocate 1.5 GB RAM to containers. This estimation is based on an empty HDFS system. If the data size stored in the HDFS is very large, extra memory is required for the NameNode. The memory configurations in Hadoop contains two parts, i.e. daemon related and container related. The Hadoop daemon related memory settings are shown in TABLE III. The Hadoop container related memory settings are listed in TABLE IV and TABLE V. Each MapReduce application requires different memory sizes for running its Map and Reduce tasks. If the container memory is too small for running a task, the container may fail and the task is re-launched. In this situation, the memory settings need to be adjusted. E. Flow Monitoring and Collection The network traffic monitoring and collection are implemented by running network flow monitors and collectors

in the host machine and the Docker containers (Fig. 3). This paper employs NetFlow [24] as one implementation example. Other flow monitoring technologies, e.g. sFlow [25], can be implemented in a similar way. Mininet uses Open vSwitch [26] as the default switch emulator. Open vSwitch supports both NetFlow and sFlow protocols for monitoring traffic. The NetFlow or sFlow records can be sent to a specified flow collector using the ovs-vsctl command. In Mininet, virtual switches are created and runs in the host machine. Consequently, the ovs-vsctl command should be run in the host machine. Host Machine

Docker Container

NetFlow Monitors for Virtual Switches

NetFlow Monitor for Ethernets

NetFlow Monitor for Loopback

NetFlow Collectors for Virtual Switches

NetFlow Collectors

NetFlow Collector

Flow Collector Folders

Flow Collector Folders

Fig. 3. Network Traffic Monitoring in Doopnet

Doopnet employs nfcapd as the NetFlow collector. Each switch sends NetFlow records to a separate nfcapd process and each nfcapd process saves the records to a separate local directory so that it is easy to separate the collected results from different switches. The nfcapd commands runs on the same host machines that contain the correspondent NetFlow monitors (the virtual switches). This configuration can avoid NetFlow traffic across different host machines. To monitor the traffic inside each Docker container, Doopnet employs fprobe as the Netflow monitor. It starts one fprobe process for each network interface including the loopback interface. The traffic on the Ethernet interfaces shows the inter-hosts communications, which may overlap with the traffic monitored on virtual switches. The NetFlow monitor on the loopback interface shows the traffic between YARN containers within one Docker container, which complements the NetFlow data from virtual switches. The NetFlow records from the fprobe processes are collected by nfcapd processes. Each nfcapd is responsible for one fprobe. After the MapReduce application has finished, we can use any NetFlow analysis tools to study the traffic. The tool we used is nfdump. IV. DOOPNET USAGE EXAMPLES A. Experiment Setup The Doopnet design has been verified on an HP server (DL380-G7) with 96GB memory, 24 Intel Xeon CPU cores and 1 TB hard drive. The Mininet virtual network topology is shown in Fig. 4. Ten Docker containers are organized into two racks. In each rack, five Docker containers connect to a Top of Rack (ToR) virtual switch (s1 or s2). The two ToR switches are connected by an inter-rack virtual switch (s3). The switch’s port numbers that the Docker containers and switches connect to are also shown in the figure. The link speed are all set to 100 Mbps.

Rack 2

Port 6 ToR Switch s1

Port 6 ToR Switch s2

Port 1

d1 (10.0.0.1)

d6 (10.0.0.6)

Port 1

Port 2

d2 (10.0.0.2)

d7 (10.0.0.7)

Port 2

Port 3

d3 (10.0.0.3)

d8 (10.0.0.8)

Port 3

Port 4

d4 (10.0.0.4)

d9 (10.0.0.9)

Port 4

Port 5

d5 (10.0.0.5)

d10 (10.0.0.10)

Port 5

Fig. 4. Virtual Network Setup for Doopnet Evaluations

Floodlight are used as the OpenFlow Controller to configure the above virtual switches. Floodlight Static Flow Pusher API is used to setup the queuing policies in virtual switches. Hadoop 2.7.1 is used in the test. The Hadoop application executed in the tests is wordcount provided in the Hadoop example jar. Four reducers are specified while starting the application. The Hadoop settings are the same as those in TABLE III, TABLE IV, and TABLE V. The input files are IETF RFCs. The total size of the files is 429 MB. Three sets of tests have been executed: 1) Test 1: Hadoop Traffic Only There is no background traffic in the cluster. 2) Test 2: Hadoop with Background Traffic (without QoS Control) All traffic share the switch links equally. The iperf tool is used to generate background traffic. The iperf servers run in Docker container d1, d3, d5, d7, and d9 respectively. The iperf clients run in Docker container d2, d4, d6, d8, d10 and send traffic to d1, d3, d5, d7, and d9 respectively. 3) Test 3: Hadoop with Background Traffic (with QoS Control) The iperf traffic are put into a low priority queue in the switches. The queues are set up using min-rate and max-rate through the Open vSwitch tool ovs-vsctl. The queuing policies are set up with the Floodlight Static Flow Pusher API. The policy is that the Hadoop traffic use 99% of a link while competing with the background traffic. If there are no contentions, both the Hadoop traffic and the iperf traffic can fully use the bandwidth. B. Test Results and Analysis Fig. 5 shows the execution time of each test. In Test 1 (Hadoop Traffic Only), the wordcount task takes 5675 seconds to finish. In Test 2 (No QoS Control), the background traffic significantly delay the task execution time which is 32.90% higher than that of Test 1. In Test 3 (with QoS Control), the task execution time is 16.46% higher than that in Test 1, and it is lower than the time in Test 2 by 12.3%. Fig. 6 shows the impacts of the QoS policy on the iperf throughput. It presents the average inbound iperf traffic in Test 2 and Test 3, i.e. 95.84 Mbps and 91.5 Mbps respectively. The difference is 4.53%. Fig. 5 and Fig. 6 reveal that the QoS policy can significantly improve the performance of the MapReduce application without compromising the iperf throughput a lot.

8000 7000 6000 5000 4000 3000 2000 1000 0 Test 1: Hadoop Traffics Only

Test 2: Hadoop + Test 3: Hadoop + Background Background Traffics, No QoS Traffics, with QoS

Fig. 5. Execution Time of the 3 Tests.

96

iperf average throughput (Mbps)

Rack 1

Execution Time (s)

Inter-Rack Switch s3 Port 2 Port 1

94 92 90 88

86 84 82 80 Test 2 (No Qos)

Test 3 (with QoS)

Fig. 6. Iperf Average Throughput in Test 2 and 3

1) Test Results Description Fig. 7, Fig. 8 and Fig. 9 show more detailed traffic statistics for Test 1 (Hadoop Traffic Only), Test 2 (Hadoop + Background Traffic, without QoS Control), and Test 3 (Hadoop + Background Traffic, with QoS Control) respectively. Chart (a) in Fig. 7, Fig. 8 and Fig. 9 shows the distribution of the shuffle traffic across the reducers, allowing for example the maximum and minimum reducer host traffic to be identified. It is calculated from (each reducer received shuffle data) / (all reducers received shuffle data). Chart (b) in Fig. 7, Fig. 8 and Fig. 9 shows the distribution of the shuffle traffic across the mappers, allowing for example the maximum and minimum mapper host traffic to be identified. It is calculated from (each mapper host transmitted shuffle data) / (all mappers transmitted shuffle data). Chart (c) in Fig. 7, Fig. 8 and Fig. 9 shows the statistics of the received shuffle traffic for a chosen reducer host, allowing for example the maximum and minimum traffic from the mapper hosts to the chosen reducer host to be identified. It is calculated from (the shuffle data that the chosen reducer host received from each mapper host) / (the total shuffle data that the chosen reducer host obtained). Chart (d) and (e) in Fig. 7, Fig. 8 and Fig. 9 show the shuffle traffic statistics from 2 chosen mapper hosts to each reducer host, allowing for example the maximum and minimum traffic from the chosen mapper hosts to the reducer hosts to be identified. It is calculated from (the shuffle data that the chosen mapper host transmitted to each reducer host) / (the total shuffle data that the chosen mapper host transmitted). 2) Test Results Analysis Comparing chart (a) of Fig. 7, Fig. 8 and Fig. 9, the test results show that the shuffle traffic received by different reducers in the three sets of tests follow a similar distribution. This is because the MapReduce partitioners in the three tests

are the same and consequently the same output from a mapper will be mapped to the same reducer. However, the traffic patterns between the mappers and reducers in the three sets of tests are different, as shown in chart (b, c, d and e) of Fig. 7, Fig. 8 and Fig. 9. For example, in chart (b) of Fig. 7, Fig. 8 and Fig. 9, the mappers in different tests have transmitted different amounts of shuffle data to reducers. One interesting observation is that chart (b) and (c) in Fig. 8 and Fig. 9 has a similar distribution in each test. Chart (b) is the shuffle data from each mapper host to all reducers, whereas chart (c) is for the shuffle data from each mapper host to a chosen reducer host. This means that the partition ratios on different mappers follow a similar linear relationship statistically. This can be verified by comparing chart (d) and (e) in Fig. 8 and Fig. 9 for each test, which are for the shuffle data from two chosen mapper hosts to each reducer. The chart (d) and (e) in both Fig. 8 and Fig. 9 have a similar distribution, i.e. the two chosen mapper hosts have transmitted a similar proportion of shuffle data to each reducer. The chart (b, c, d and e) in Fig. 7 doesn’t follow the same distribution and need to be analysed further. As this paper focuses on how to setup the Doopnet testbed and analyse traffic using Doopnet, detailed analysis of Hadoop performance is outside the scope.

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12] [13] [14]

V. CONCLUSIONS AND FUTURE WORK This paper proposes Doopnet which is a Hadoop emulation system based on Mininet and Docker for analysing Hadoop network traffic. The aim of Doopnet is to provide insightful network traffic information of a running Hadoop cluster so that researchers can evaluate their algorithms and network configurations in terms of Hadoop performance. Doopnet is designed to complement real-world experimental testbeds so that researcher can evaluate their algorithms in the early stage with a flexible and economical testbed. This paper presents the Doopnet framework, the workflow of Doopnet, and detailed configurations and tools for creating the Doopnet testbed. Experiments have been performed to demonstrate how Doopnet is used in analysing Hadoop network traffic. Only MapReduce shuffle traffic are shown as an example. Doopnet can be used to analyse other types of Hadoop networking traffic. Future work is to perform more tests to evaluate Doopnet’s performance, release the Doopnet implementation as open source and incorporate more big data analytics systems, e.g. Spark and Flink, into Doopnet. REFERENCE [1] [2] [3]

Hadoop: http://hadoop.apache.org/, accessed on 10 Feb 2016. Spark: http://spark.apache.org/, accessed on 10 Feb 2016. Flink: https://flink.apache.org/, accessed on 10 Feb 2016.

[15]

[16]

[17]

[18] [19]

[20]

[21]

[22] [23] [24] [25] [26]

J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” in Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, ser. OSDI’04. Berkeley, CA, USA: USENIX Association, 2004, pp. 10–10. M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data transfers in computer clusters with orchestra,” ACM SIGCOMM Computer Commun. Rev., vol. 41, no. 4, p. 98, Oct 2011. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “A study of skew in mapreduce applications,” in 5th Open Cirrus Summit, 2011. N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. "OpenFlow: enabling innovation in campus networks," SIGCOMM Comput. Commun. Rev. vol. 38, issue 2, pp. 69-74, Mar 2008. M. Veiga Neves, C.A.F. De Rose, K. Katrinis, H. Franke, "Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime," IEEE 28th International Parallel and Distributed Processing Symposium, pp. 82 - 90, May 2014. A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu, "Transparent and flexible network management for big data processing in the cloud," In 5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’13), 2013. M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, "Managing data transfers in computer clusters with orchestra," ACM SIGCOMM Computer Commun. Rev., 41, no. 4, pp 98-109, 2011. B. Lantz, B. Heller, and N. McKeown, "A network in a laptop: rapid prototyping for software-defined networks," In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, pp. 19, 2010. Docker home page, http://www.docker.com/, accessed on 10 Feb 2016. Dockernet source code, https://github.com/mpeuster/dockernet, accessed on 10 Feb 2016. EU COST Action cHiPSet: http://chipset-cost.eu/, accessed on 10 Feb 2016. D. Kliazovich, P. Bouvry, and S. U. Khan, "GreenCloud: a packet-level simulator of energy-aware cloud computing data centers," The Journal of Supercomputing, vol. 62, no. 3 pp. 1263-1283, 2012. R. N. Calheiros, R. Ranjan, A. Beloglazov, C. Rose, and R. Buyya, "CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms," Software: Practice and Experience, vol. 41, no. 1, pp. 23-50, 2011. B. Wickremasinghe, R. N. Calheiros, and R. Buyya. "Cloudanalyst: A cloudsim-based visual modeller for analysing cloud computing environments and applications." 24th IEEE International Conference on Advanced Information Networking and Applications (AINA), pp. 446452, 2010. NS-2 & NS3: https://www.nsnam.org/, accessed on 10 Feb 2016. P. Kathiravelu, and L. Veiga, "An Expressive Simulator for Dynamic Network Flows," IEEE International Conference on Cloud Engineering (IC2E), 2015. A. Darabseh, M. Al-Ayyoub, Y. Jararweh, E. Benkhelifa, M. Vouk, and A. Rindos, "SDDC: A Software Defined Datacenter Experimental Framework," IEEE 3rd International Conference on Future Internet of Things and Cloud (FiCloud), pp. 189-194, 2015. Y. Yu, H. Zou, W. Tang, and L. Liu, "A CCG virtual system for big data application communication costs analysis," IEEE International Conference on Big Data (Big Data), pp. 54-60, 2014. Floodlight: http://www.projectfloodlight.org/, accessed on 10 Feb 2016. OpenDaylight: https://www.opendaylight.org/, accessed on 10 Feb 2016. B. Claise, Ed., "Cisco Systems NetFlow Services Export Version 9", IETF RFC, Oct 2004. Peter Phaal, Marc Lavine, "sFlow Version 5", sFlow.org, Jul 2014. Open vSwitch: http://openvswitch.org/, accessed on 10 Feb 2016.

25.2 25 24.8 24.6 24.4 24.2

25.2 25 24.8 24.6 24.4

(a) Shuffle Traffic Distribution between Reducers Percentage in overall shuffle traffics (%)

14 12 10 8 6 4 2 0

14 12 10 8 6 4 2 0

Mappers in each host

10 8 6 4

2 0

14 12 10 8 6 4 2 0

40

30 20 10 0 10.0.0.1 10.0.0.2 Reducers in each host

Percentage in d9's shuffle traffics (%)

(d) From the Mappers in Host d3 to Reducer Hosts

60 50 40 30 20 10 0 10.0.0.1 10.0.0.2 Reducers in each host

(e) From the Mappers in Host d9 to Reducer Hosts

60 50 40 30

20 10

0 10.0.0.4

10.0.0.9 Reducers in each host

10 8 6 4 2 0

60

50 40 30 20 10 0 10.0.0.4

10.0.0.9 Reducers in each host

60 50

40 30 20

10 0

10.0.0.1

10.0.0.10

(d) From the Mappers in Host d1 to Reducer Hosts

80

70

12

(c) From Each Mapper to the Reducers in Host d3 Percentage in the d5's shuffle traffics (%)

50

(b) From Each Mapper Host to all Reducers

10.0.0.10

(e) From the Mappers in Host d9 to Reducer Hosts

10.0.0.3 Reducers in each host

10.0.0.6

(d) From the Mappers in Host d5 to Reducer Hosts Percentage in d9's shuffle traffics (%)

60

14 12 10 8 6 4 2 0

Mappers in each host

(c) From Each Mapper to the Reducers in Host d9 Percentage in d1's shuffle traffics (%)

70

(a) Shuffle Traffic Distribution between Reducers

Mappers in each host

Percentage in d9's shuffle traffics (%)

Percentage in d3's shuffle traffics (%)

80

24.4

Mappers in each host

(b) From Each Mapper Host to all Reducers

Mappers in each host

(c) From Each Mapper to the Reducers in Host d2

24.6

10.0.0.1 (r3) 10.0.0.3 (r1) 10.0.0.3 (r2) 10.0.0.6 (r0) Reducers in each host

Percentage in d3's shuffle traffics (%)

12

25 24.8

Mappers in each host

Percentage in d9's shuffle traffics (%)

Percentage in d2's shuffle traffics (%)

(b) From Each Mapper Host to all Reducers

25.2

10.0.0.4 (r0) 10.0.0.9 (r2) 10.0.0.9 (r3) 10.0.0.10 (r1) Reducers in each host

Percentage in overall shuffle traffics (%)

(a) Shuffle Traffic Distribution between Reducers

25.4

24.2

24.2

10.0.0.1 (r2) 10.0.0.2 (r0) 10.0.0.2 (r1) 10.0.0.2 (r3) Reducers in each host

Percentage in overall shuffle traffics (%)

Percentage in overall shuffle traffics (%)

25.4

Percentage in overall shuffle traffics (%)

Percentage in overall shuffle traffics (%)

25.4

60 50 40 30 20 10 0 10.0.0.1

10.0.0.3 Reducers in each host

10.0.0.6

(e) From the Mappers in Host d9 to Reducer Hosts

Fig. 7. Shuffle Traffic Distributions of Test 1 Fig. 8. Shuffle Traffic Distributions of Test 2 Fig. 9. Shuffle Traffic Distributions of Test 3 (Hadoop Traffic Only) (Hadoop with Background Traffic, without QoS (Hadoop with Background Traffic, with QoS Control) Control)