Paper Title (use style: paper title)

5 downloads 40217 Views 647KB Size Report
Microsoft, IBM and Google, offer Big Data as a managed service on a SaaS service model. .... distributed architecture where components are hosted by different cloud ... Amazon AWS [39], Google Cloud [40] and IBM [41], we shall focus on it ...
PaaS-BDP: Platform-as-a-Service for Big Data Processing Thalita Vergilio

Dr Muthu Ramachandran

School of Computing, Creative Technology and Engineering Leeds Beckett University Leeds, UK

School of Computing, Creative Technology and Engineering Leeds Beckett University Leeds, UK

Abstract— With the popularisation of the cloud, Big Data analytics is now an accessible service to many SMEs. The top four cloud providers in terms of market share, Amazon, Microsoft, IBM and Google, offer Big Data as a managed service on a SaaS service model. This model, however, comes with associated risks of low inter-cloud portability and reduced intercloud interoperability between the processing components developed. This paper presents a contribution to the field of Cloud Software Engineering Design, namely a new and unifying approach to big data processing from a cloud consumer’s perspective. PaaS-BDP (Platform-as-a-Service for Big Data Processing) is based on a microservices architecture and uses container cluster technology on a PaaS service model to overcome common shortfalls of current big data solutions offered by major cloud providers such as low portability, lack of interoperability and the risk of vendor lock-in. We introduce a new UML profile to represent the deployment of distributed containers using Docker Swarm technology. This profile is not restricted to big data processing microservices, but can be utilised to model the deployment of any container-based artifacts. Keywords—big data; containers; docker; MDA; UML; swarm

I. INTRODUCTION Big data is an area of technological research which has been receiving increased attention in recent years. As the Internet of Things (IoT) expands to different spheres of human life, a large volume of structured, semi-structured and unstructured data is generated at very high velocity. To derive value from big data, businesses and organisations need to detect patterns and trends in historical data. They also need to receive, process and analyse streaming data in real-time, or close to real-time, a challenge which current technologies and traditional system architectures find difficult to meet. Cloud computing has also been attracting growing interest lately, with prominent research being carried out on topics such as cloud standardisation and cloud federations [2]. While the focus of these works is on cloud providers, our research takes a different approach by adopting the perspective of the cloud consumer. With different service models available such as infrastructure as-a-service (IaaS), platform as-a-service (PaaS) and software as-a-service (SaaS), it is no longer This research was supported in part by Microsoft Azure for Research.

essential that companies host their IT infrastructure onpremises. Consequently, an increasing number of small and medium-sized enterprises (SME) has ventured into big data analytics utilising powerful computing resources, previously unavailable to them, without having to procure their own hardware or maintain an in-house team of highly skilled IT professionals. With the advance of cloud computing, distributed architectures such as service-oriented architecture (SOA) and microservices have gained prominence. Monolithic applications are no longer practical, as it is sometimes more cost-effective to host some components in the cloud, whilst keeping, for example, strategic or security-critical components on-premises. This flexibility is critical in defining a company’s technological strategy, as is the prevention of a vendor lock-in situation, should they opt to host their resources in the cloud. The risk of vendor lock-in is particularly prominent in the big data solutions offered by major cloud providers. These generally follow the managed software as-a-service model (MSaaS), using either opensource or proprietary software to enable the development of data processing pipelines which are not guaranteed to be portable or interoperable across different clouds. Current MSaaS big data solutions attempt to abstract the complexities of deploying and managing clusters of specialised big data processing software away from the user. While these solutions are simpler to implement, they carry considerable risk of vendor lock-in once in place. This research presents an alternative solution to the problem: PaaSBDP (Platform-as-a-Service for Big Data Processing), based on a PaaS service model, containerisation, and a systematic, model-driven approach to software engineering. To ensure accessibility and backwards compatibility with existing application designs, we have opted to extend the Unified Modelling Language (UML) with the creation of a profile capable of effectively representing the proposed containerised big data service components. Given UML’s pervasiveness in the industry, the extensions introduced should be immediately comprehensible to software engineers, and should provide a layer of abstraction to facilitate the design and development of big data processing components for a PaaS service model.

II. LITERATURE REVIEW A. Big Data 1) Volume, Velocity and Variety: Big data can be defined as data which somehow challenges the processing capabilities of current technology. These challenges are usually categorised around the three Vs: volume, velocity and variety [1-3] and others, with the occasional mention of additional Vs such as veracity, value and viability [2]. The volume of data which lead internet companies such as Facebook and Netflix have accumulated has reached hundreds of petabytes [4-5], and it has been estimated that the largest big data company in the world, Google, holds over 10 exabytes of data [6]. This data is kept on disk, stored in data centres all over the world, posing significant architectural challenges when it comes to processing it in order to extract valuable information with an acceptable level of latency. The velocity at which data is generated is also a significant factor when it comes to engineering applications which will consume and process this data. Netflix’s data pipeline, for example, receives approximately 500 billion events a day, which amounts to 1.3 petabytes of incoming data that needs to be processed each day, in real time [7]. Facebook processes hundreds of gigabytes per second across hundreds of real-time data pipelines [8]. These companies have invested in architectures which can consume incoming data at high velocity. The most accepted classification for big data variety separates it into structured (usually stored in relational databases), semi-structured (data stored in NoSQL databases) and unstructured [9], [1]. Assunção et al. add a mixed category to this classification [2]. Data which originates from surveillance cameras, social networks or tracking devices, for example, is diverse in structure from data stored in NoSQL databases, which is again diverse from data stored in relational databases. An architecture designed to cater for all types of big data needs to take the variety factor into consideration, as one can no longer assume that data will be stored in a single relational database. This research uses the term big data to refer to data characterised by the three Vs defined above. 2) Batch and Stream Processing: Batch processing was the first and is the most solidly established approach to big data processing. It is based on the MapReduce algorithm and was designed at Google [10], before going open-source as the Hadoop software framework. Hadoop is a distributed system for processing large volumes of data which is easy to use and extremely powerful [10]. It abstracts the complexities of parallelisation and inter-machine communication away from the user, who only needs to specify the map and reduce functions [10]. Hadoop can handle terabytes of data [11], but one of its main criticisms is its high latency and high start-up overhead [12], which renders it ineffective for real-time systems. Stream processing architectures evolved from the need to process real-time data. While batch processing is

related to the volume of big data, stream processing relates to velocity. Real-time information such as which topics are trending on Twitter needs to come from data which is constantly being updated, with minimal latency. The notion of a data stream is an abstraction used to convey the nature of the data source: continuous and potentially infinite, and the way in which it is processed: in real-time (or close to realtime), before it is persisted to storage [3, p. 47]. Many reactive architectures were designed to cater for big data streams, such as Yahoo’s S4 [13], Apache Storm [14], Borealis [15], StreamCloud [16], Stormy [17], TelegraphCQ [18] and Cyclops-React [19]. Unlike the batch processing architectures, which are mostly centred around MapReduce, when it comes to stream-based architectures, there is no prevailing technology, although Apache Storm is a strong candidate [1]. The main criticism when it comes to real-time stream-based systems is that they tend to compromise veracity or precision for velocity or ultra-low latency [20]. 3) The Lambda Architecture: The Lambda Architecture was presented by Marz & Warren to describe systems which would adopt a “best of both” approach to the batch/stream dichotomy. These architectures would use the stream layer for real-time data and the batch layer for historical data. The data would then be merged at query level [21]. Examples of this type of architecture can be found in Twitter’s SummingBird [22], Yahoo’s Storm-Yarn [23], IBM’s Big Data platform [3], Lambdoop [24], AllJoynLambda [25] and Facebook’s integration between its Puma, Swift and Stylus stream systems and its data stores [8]. The main criticism to this approach is having to maintain two different complex architectures and having duplicate code in the two different layers [26]. The Hadoop ecosystem, offered as a solution for batch and stream big data processing by many cloud providers, is a type of Lambda Architecture. Table 1 describes the managed services offered by different providers as part of their Hadoop ecosystem packages. BigData (Oracle)

BigInsight (IBM)

Dataproc (Google)

EMR (Amazon)

HDInsight (Microsoft)

Flink Hadoop HBase Hive Kafka Pig Spark Storm Zookeeper

Table 1. Open-Source Technologies Included in Cloud-Based Big Data Service Packages

Because the Hadoop ecosystem is based on opensource technologies, vendor lock-in is not a major problem for consumers of these cloud services. They do however have to maintain big data processing code in different places, using different technologies, resulting in low reusability and maintainability.

B. Vendor Lock-In One of the biggest drawbacks of deploying complex applications in the cloud is the risk presented by vendor lockin [27-32] and others. Assis & Bittencourt define vendor lockin as: “technical and monetary costs faced by the customer for migrating from one cloud provider to another when some advantage is observed (e.g. more attractive prices).” [32, p. 55]. This definition emphasises the low portability aspect of vendor lock-in, defined as the ability to transfer an application from the cloud where it is currently deployed to a cloud from a different provider [29]. Another important aspect of vendor lock-in highlighted in the literature is its effect on interoperability [29], [33-35]. Interoperability occurs when an application or component deployed to a given cloud server exchanges information harmoniously with applications or components deployed to other cloud servers [29]. A distributed architecture where components are hosted by different cloud providers, for example, would rely extensively on some guarantee of interoperability between these providers. Additionally, interoperability between clouddeployed artefacts and those hosted on-premises must also exist, as companies may choose not to deploy all their resources to the cloud for strategic or security reasons [34]. The following solutions to the vendor lock-in issue have been proposed. 1) Standardisation: Standardisation of cloud resource offerings is considered a way of dealing with the vendor lockin issue. No universal set of standards has yet been identified which would successfully solve the issues of inter-cloud portability and interoperability [33], and the standards that do exist have not been widely adopted by the industry [28]. 2) Cloud Federations: Another alternative solution to the vendor lock-in issue is the establishment of cloud federations [31]. In a cloud federation, providers voluntarily agree to participate and are bound by rules and regulations. This however places the focus on the cloud provider, rather than on the consumer of cloud services. As this research approaches the vendor lock-in issue from the cloud consumer’s perspective, cloud federations are excluded from its scope. 3) Middleware: The introduction of a layer of abstraction to enable distribution and interoperability between different cloud providers has also been proposed as a possible solution to the cloud lock-in problem [28-29]. One criticism to this type of approach, however, is that the lock-in problem is not resolved, it is simply shifted to the enabling middleware layer [28]. 4) Unified Models: A model-driven approach to development, combined with a unifying framework for modelling cloud artefacts, has been suggested as a possible solution to the vendor lock-in problem. In fact, the “model once, generate everywhere” precept of MDA (Model Driven Architecture) suggests that software can be cloud platformagnostic, provided that the necessary code generating engines are in place [33]. In reality, however, it is difficult to find

concrete examples of perfectly accurate code generation engines capable of producing all of the source code exclusively from the models [28]. MULTICLAPP was proposed as an architectural framework that separates the application design from cloud provider-specific deployment configuration [28]. Although this approach ensures the perpetuation of the models in case of cloud provider migration, application implementation code would still need to be re-written. 5) Virtualisation: The use of containers or hypervisor technology (virtual machine managers) to deploy software in the cloud is a pattern which minimises the effects of vendor lock-in, as the environment configuration and requirements are packaged together with the deployed application. a) Virtual Machines: The use of VMs to deploy applications is generally associated with the IaaS cloud service model. Together with the code for the developed application, a VM also contains an entire operating system configured to run that code. b) Containers: Containers are a lighter alternative to VMs [36]. They are a more recent development, with the most widely-accepted technology, Docker, having been open-sourced in March 2013 [37]. As Docker shows evidence of being the de-facto container technology, with support from all the major cloud providers such as Microsoft Azure [38], Amazon AWS [39], Google Cloud [40] and IBM [41], we shall focus on it when discussing containerisation. When cloud hosting is considered, containers are generally associated with the PaaS service model because containers run on the host operating system. Unlike VMs, they do not need to be shipped with their own operating system. For this reason, containerisation is also known as oslevel virtualisation [42-43]. If an application is not dependent on a specific operating system, preference should be given to containers instead of VMs [37], as the latter would add unnecessary bulk and complexity to the deployment. The benefits of using containers become more apparent when it comes to implementing distributed architectures [36], as their small size and relative ease of deployment allow for better elasticity across different clouds. c) Containers and Big Data Processing: If we focus on the volume aspect of big data, it becomes evident that processing cannot be satisfactorily performed by a single machine. In fact, today’s lead batch processing technology, Hadoop, is based on a distributed multi-clustered architecture which enables it to spread the processing load across different machines [44]. A study comparing the performance of Hadoop deployed to containers to that of Hadoop installed on a cluster of physical machines concluded that they perform similarly [45], demonstrating that the overhead added by containers is minimal. This would make containers the virtualisation technology of choice for Hadoop-based installations, given that Hadoop already adds significant start-up overhead to batch processing jobs [12]. As cloud deployments rely on the principle of elasticity, with customers only paying for

resources they actually use, the use of container technology appears to be the most suitable for this type of scenario. A different study explored the use of containers in IoT devices by comparing the response times of a Constrained Application Protocol (CoAP) server implementation running directly on a Raspberry Pi B+ device to the same implementation deployed to a container within the device [42]. Whereas the implementation running directly on the device’s hardware was always faster, the overhead introduced by containerisation was deemed acceptable [42].

2) ContainerDeploymentSpecification: The ContainerDeploymentSpecification stereotype extends the DeploymentSpecification UML metaclass and is based on deployment specification requirements needed by Docker, such as the base image the container is to be built from and the ports it is to expose.

III. PROPOSED SOLUTION The proposed solution is a microservices architecture for big data processing using containers on a PaaS service model. Fig. 1 illustrates the recommended architecture, using the Microsoft Azure and Open Science Data Cloud (OSDC) clouds as examples. Interoperability between components deployed within the same cloud, as well as between components deployed to different clouds, is achieved by using container orchestration technology, e.g. Mesos, Kubernetes or Docker Swarm.

Azure

A

A

A

OSDC

Fig. 2. Proposed Container Profile

    

Runner Read Batch Subscribe to Stream Call A Call B Output Results

B Batch Data

Fig. 1. The Proposed Architecture

The programming language utilised for building the big data processing microservices must be capable of processing batch and stream data. This research has identified the Apache Beam SDK as, currently, the most suitable technology for this purpose. Other languages could be utilised in the future, provided that they conform to the proposed architecture. A new UML profile is introduced to represent the deployment of microservices to a distributed environment using containers and container clusters. The profile was created using Papyrus, an Eclipse-based modelling environment, and is based on the Docker container technology. Fig. 2 shows the proposed Container UML Profile, which contains seven new stereotypes. These new stereotypes are detailed in the following sections. A. Stereotypes 1) Container: The Container stereotype extends the ExecutionEnvironment UML metaclass. It represents an execution environment for an artifact which, in our proposed architecture, consists of a compiled and packaged big data processing pipeline.

3) SwarmCluster: The SwarmCluster stereotype extends the Node UML metaclass. In our architecture, it represents a Docker Swarm cluster of containers distributed within the same cloud or across different clouds. 4) SwarmNode: A SwarmCluster can have many SwarmNodes. The SwarmNode stereotype also extends the Node UML metaclass. It represents “an instance of the Docker engine participating in the swarm” [46]. A SwarmNode can have several containers, and is discoverable via a discovery token which is represented in the SwarmNode stereotype as a tagged value. 5) SwarmManager: The SwarmManager stereotype extends SwarmNode. It represents a manager node in a Docker Swarm cluster, which is the node responsible for performing orchestration and management tasks [46]. 6) DiscoveryService: Different discovery service technologies can be used with a Docker Swarm cluster, so the DiscoveryService stereotype is defined as an extension of the Artifact UML. It has a type tagged value, which can be selected from a DiscoveryServiceType enumeration. 7) Scheduler: Different schedulers can be used with a Docker Swarm cluster, so the Scheduler stereotype is also defined as an extension of the Artifact UML. It has a type tagged value, which can be selected from a SchedulerType enumeration. IV. EVALUATION Given the vendor lock-in problem which pervades current Big Data solutions offered by cloud providers, the products of this research are evaluated in terms of inter-cloud portability

and interoperability. Additionally, as existing solutions tend to combine a range of technologies to enable processing of different types of big data, leading to code duplication and low maintainability, these factors will also be considered in our evaluation.

duplicated, leading to low maintainability. Both Google Dataflow and the solution proposed in this research are based on a unified data processing pipeline for batch and stream data, so code duplication is low and maintainability is improved.

Table 2 shows a comparison between the proposed solution and existing big data architectures offered as managed services in the cloud. We have grouped under Lambda Architecture those solutions which are based on separate processing for stream and batch data (Google Cloud Dataproc, Microsoft Azure HDInsight, Amazon EMR, Oracle Big Data Cloud Service and IBM BigInsights).

Usability is defined here as the ease with which a developer can approach, learn and work with the technologies involved in a given solution. As the Lambda Architecture involves a number of different technologies such as Hadoop, Spark and Hive, the learning curve is high. The Google Dataflow solution is based on well documented open-source SDKs which are easily accessible to software developers. It does not however offer a systematic approach to software design and development, leading to code disparity and reducing the potential for collaboration and reuse. The UML profile presented in this paper bridges this gap by providing a means for designing and modelling container-based microservices using a notation that is instantly recognisable, not only by developers, but also by architects and business users.

Evaluation Metric

Lambda Archit.

Google Dataflow

PaaSBDP

Inter-Cloud Portability Inter-Cloud Interoperability Code Duplication/Maintainability Usability

Medium Medium High Low

Low Medium Low Medium

High High Low High

Table 2. Comparison Between PaaS-BDP and Existing Big Data Architectures in the Cloud

Inter-cloud portability is defined as the ease with which a service hosted in one cloud can be migrated to a different cloud. Services based on the Lambda Architecture are generally portable, provided that the technologies used are open-source and available in both the source and the destination clouds. The Dataflow solution is based on an open-source programming language, but it would require a change in service model from SaaS to PaaS in order to be easily portable across clouds. Our proposed solution is highly portable, as it is based on containers and a PaaS service model from the start. Irrespective of which cloud provider is utilised, the containers will be built from the same base image, and will guarantee an identical execution environment for the service. Inter-cloud interoperability is defined as the ease with which a service hosted in one cloud can operate harmoniously with services hosted on different clouds as part of the same architecture. Both the Lambda Architecture and the Dataflow solution were evaluated as medium, as they allow for distribution, although additional setup would need to be carried out in order to provide service discovery and orchestration. The proposed solution is ranked as high, as it is based on Docker Swarm technology, which provides service discovery and orchestration as part of the Docker distribution, with no need to install, configure and manage an external tool for this purpose. Because big data services can self-register and automatically become part of an orchestrated swarm, irrespective of where the container node is deployed, they are highly interoperable. Code Duplication and Maintainability are related metrics, so they are evaluated together. Maintainability is defined as the ease with which the big data processing code can be changed. Code duplication leads to low maintainability, as any change in the processing logic needs to be implemented twice, increasing the amount of development work involved, and the risk of bugs being introduced. Lambda Architecturebased services use different technologies to process batch and stream data, so the logic within the processing pipelines is

V. CONCLUSION This paper presented a contribution to the field of Software Engineering Design of a new and unifying approach to big data processing in the cloud. This approach is based on a microservices architecture for the development and deployment of big data processing pipelines, using container cluster technology on a PaaS service model. We demonstrated how the issues of low inter-cloud portability and lack of intercloud interoperability, identified as common shortcomings of current cloud-based solutions, are overcome by our proposed solution. The issues of code duplication and low maintainability which are known to affect the Lambda Architecture [26] are also addressed by our solution. By adopting a unifying programming model for processing batch and stream data, we demonstrate how these metrics are improved. We introduced a new UML extension for representing distributed container-based deployments. Other relevant UML extensions have been proposed in the literature such as SoaML [47], CloudML [48], MODAClouds [27] and MULTICLAPP [28]. SoaML extends UML by adding six new stereotypes: Participants, Service Interfaces, Service Contracts, Service Architectures, Service Data and Capabilities [47]. It focuses on the relationship between the service provider and the service consumer, whereas our solution focuses on the application deployment from the perspective of the service consumer. CloudML is an extension of SoaML that incorporates hardware and network resource requirements into the models [48]. It focuses on how instances are provisioned from the point of view of the cloud provider, whereas we focus on the cloud consumer. MODAClouds is a framework that can be used for selecting and comparing different cloud providers [27]. It incorporates cloud-provider specific information in the models, and allows the designer to model applications which are distributed across different clouds. It is not, however, container-based, and uses its own runtime environment to monitor application deployment.

Finally, MULTICLAPP is another framework offering an UML profile for modelling applications distributed across different clouds. Differently from MODAClouds, MULTICLAPP models are platform-agnostic. The design is kept separate from the development and from the deployment plan. It diverges from the current research in that uses a model transformation engine to assign developed artifacts to specific cloud platforms, whereas we capture this information in the deployment diagram. The new UML profile represents a contribution to the field of Cloud Software Engineering Design, allowing generic container cluster deployments to be modelled in UML deployment diagrams. REFERENCES [1]

[2]

[3]

[4]

[5] [6]

[7]

[8] [9] [10] [11]

[12] [13] [14]

[15] [16] [17] [18] [19]

[20]

[21] [22]

R. Casado and M. Younas, “Emerging Trends and Technologies in Big Data Processing,” Concurr Comput Pr. Exper, vol. 27, no. 8, pp. 2078–2091, Jun. 2015. M. D. Assunção, R. N. Calheiros, S. Bianchi, M. A. S. Netto, and R. Buyya, “Big Data computing and clouds: Trends and future directions,” J. Parallel Distrib. Comput., vol. 79–80, pp. 3–15, May 2015. P. Zikopoulos, D. deRoos, K. Parasuraman, T. Deutsch, J. Giles, and D. Corrigan, Harness the Power of Big Data The IBM Big Data Platform, 1 edition. McGraw-Hill Education, 2013. S. Krishnan and E. Tse, “Hadoop Platform as a Service in the Cloud,” The Netflix Tech Blog, 10-Jan-2013. [Online]. Available: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html. [Accessed: 30-Oct-2016]. N. Bronson, T. Lento, and J. L. Wiener, “Open data challenges at Facebook,” 2015, pp. 1516–1519. A. Lederman, “Let’s take a look at some really BIG ‘big data,’” 21-Jun-2016. [Online]. Available: http://www.deepwebtech.com/2016/06/lets-take-a-look-atsome-really-big-big-data/. [Accessed: 30-Oct-2016]. “The Netflix Tech Blog: Evolution of the Netflix Data Pipeline,” 15-Feb-2016. [Online]. Available: http://techblog.netflix.com/2016/02/evolution-of-netflixdata-pipeline.html. [Accessed: 30-Oct-2016]. G. J. Chen et al., “Realtime Data Processing at Facebook,” 2016, pp. 1087–1098. A. F. Mohammed, V. T. Humbe, and S. S. Chowhan, “A review of big data environment and its related technologies,” 2016, pp. 1–5. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, p. 107, Jan. 2008. O. O’Malley, “Apache Hadoop Wins Terabyte Sort Benchmark | hadoopnew Yahoo,” 02-Jul-2008. [Online]. Available: https://developer.yahoo.com/blogs/hadoop/apache-hadoop-wins-terabyte-sortbenchmark-408.html. [Accessed: 30-Oct-2016]. R. Stewart and J. Singer, “Comparing fork/join and MapReduce,” in Department of Computer Science, Heriot-Watt University, 2012. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,” 2010, pp. 170–177. N. Marz, “History of Apache Storm and lessons learned,” 06-Oct-2014. [Online]. Available: http://nathanmarz.com/blog/history-of-apache-storm-andlessons-learned.html. [Accessed: 28-Oct-2016]. D. J. Abadi et al., “The Design of the Borealis Stream Processing Engine.,” in CIDR, 2005, vol. 5, pp. 277–289. V. Gulisano, R. Jimenez-Peris, M. Patino-Martinez, and P. Valduriez, “StreamCloud: A Large Scale Data Streaming System,” 2010, pp. 126–137. S. Loesing, M. Hentschel, T. Kraska, and D. Kossmann, “Stormy: an elastic and highly available streaming service in the cloud,” 2012, p. 55. S. Chandrasekaran et al., “TelegraphCQ: continuous dataflow processing,” 2003, p. 668. J. McClean, “Plumbing Java 8 Streams with Queues, Topics and Signals,” Medium, 11-Feb-2015. [Online]. Available: https://medium.com/@johnmcclean/plumbing-java-8-streams-with-queuestopics-and-signals-d9a71eafbbcc. [Accessed: 27-Oct-2016]. T. Akidau, “Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda Architecture,” 24-Jan-2015. [Online]. Available: https://www.infoq.com/presentations/millwheel#downloadPdf. [Accessed: 28Oct-2016]. N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data systems, 1st ed. Manning Publications, 2015. M. Hausenblas, “Twitter Open-Sources its MapReduce Streaming Framework Summingbird,” 16-Jan-2014. [Online]. Available: https://www.infoq.com/news/2014/01/twitter-summingbird. [Accessed: 28-Oct2016].

[23]

[24] [25] [26]

[27]

[28]

[29]

[30]

[31]

[32]

[33] [34]

[35]

[36] [37] [38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

B. Evans and A. Feng, “Storm-YARN Released as Open Source | YDN Blog Yahoo,” 11-Jun-2013. [Online]. Available: https://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source143745133.html. [Accessed: 28-Oct-2016]. R. C. Tejedor, “Lambdoop, a framework for easy development of big data applications,” 03-Dec-2013. M. Villari, A. Celesti, M. Fazio, and A. Puliafito, “AllJoyn Lambda: An architecture for the management of smart environments in IoT,” 2014, pp. 9–14. J. Kreps, “Questioning the Lambda Architecture - O’Reilly Media,” 02-Jul-2014. [Online]. Available: https://www.oreilly.com/ideas/questioning-the-lambdaarchitecture. [Accessed: 28-Oct-2016]. D. Ardagna et al., “MODAClouds: A Model-driven Approach for the Design and Execution of Applications on Multiple Clouds,” in Proceedings of the 4th International Workshop on Modeling in Software Engineering, Piscataway, NJ, USA, 2012, pp. 50–56. J. Guillén, J. Miranda, J. M. Murillo, and C. Canal, “A UML Profile for Modeling Multicloud Applications,” in Service-Oriented and Cloud Computing, 2013, pp. 180–187. G. C. Silva, L. M. Rose, and R. Calinescu, “A Systematic Review of Cloud LockIn Solutions,” in 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, 2013, vol. 2, pp. 363–368. G. C. Silva, L. M. Rose, and R. Calinescu, “Towards a Model-Driven Solution to the Vendor Lock-In Problem in Cloud Computing,” in 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, 2013, vol. 1, pp. 711–716. D. G. Kogias, M. G. Xevgenis, and C. Z. Patrikakis, “Cloud Federation and the Evolution of Cloud Computing,” Computer, vol. 49, no. 11, pp. 96–99, Nov. 2016. M. R. M. Assis and L. F. Bittencourt, “A survey on cloud federation architectures: Identifying functional and non-functional properties,” J. Netw. Comput. Appl., vol. 72, pp. 51–71, Sep. 2016. B. D. Martino, “Applications Portability and Services Interoperability among Multiple Clouds,” IEEE Cloud Comput., vol. 1, no. 1, pp. 74–77, May 2014. J. Opara-Martins, R. Sahandi, and F. Tian, “Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective,” J. Cloud Comput., vol. 5, no. 1, p. 4, Dec. 2016. R. Yasrab and N. Gu, “Multi-cloud PaaS Architecture (MCPA): A Solution to Cloud Lock-In,” in 2016 3rd International Conference on Information Science and Control Engineering (ICISCE), 2016, pp. 473–477. D. Bernstein, “Containers and Cloud: From LXC to Docker to Kubernetes,” IEEE Cloud Comput., vol. 1, no. 3, pp. 81–84, Sep. 2014. I. Miell and A. H. Sayers, Docker in Practice, 1 edition. Shelter Island, NY: Manning Publications, 2015. K. Mouss, “Introducing Docker in Microsoft Azure Marketplace,” 08-Jan-2015. [Online]. Available: https://azure.microsoft.com/en-us/blog/introducing-dockerin-microsoft-azure-marketplace/. [Accessed: 16-Feb-2017]. S. Yegulalp, “Docker for AWS: Who’s it really for?,” 01-Dec-2016. [Online]. Available: http://www.infoworld.com/article/3145696/applicationdevelopment/docker-for-aws-whos-it-really-for.html. [Accessed: 16-Feb-2017]. C. Metz, “Google Embraces Docker, the Next Big Thing in Cloud Computing,” 09-Jun-2014. [Online]. Available: https://www.wired.com/2014/06/eric-brewergoogle-docker/. [Accessed: 16-Feb-2017]. L. Dingan, “IBM launches Docker-based containers for its cloud,” 22-Jun-2015. [Online]. Available: http://www.zdnet.com/article/ibm-launches-docker-basedcontainers-for-its-cloud/. [Accessed: 16-Feb-2017]. A. Celesti, D. Mulfari, M. Fazio, M. Villari, and A. Puliafito, “Exploring Container Virtualization in IoT Clouds,” in 2016 IEEE International Conference on Smart Computing (SMARTCOMP), 2016, pp. 1–6. N. Naik, “Migrating from Virtualization to Dockerization in the Cloud: Simulation and Evaluation of Distributed Systems,” in 2016 IEEE 10th International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Environments (MESOCA), 2016, pp. 1–8. A. Bhardwaj, V. K. Singh, Vanraj, and Y. Narayan, “Analyzing BigData with Hadoop cluster in HDInsight azure Cloud,” in 2015 Annual IEEE India Conference (INDICON), 2015, pp. 1–5. S. Radhakrishnan, B. J. Muscedere, and K. Daudjee, “V-Hadoop: Virtualized Hadoop using containers,” in 2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), 2016, pp. 237–241. “Swarm mode key concepts,” Docker Documentation, 02-Mar-2017. [Online]. Available: https://docs.docker.com/engine/swarm/key-concepts/. [Accessed: 02Mar-2017]. B. Elvesæter, C. Carrez, P. Mohagheghi, A.-J. Berre, S. G. Johnsen, and A. Solberg, “Model-driven Service Engineering with SoaML,” in Service Engineering, Springer Vienna, 2011, pp. 25–54. N. Ferry, A. Rossini, F. Chauvel, B. Morin, and A. Solberg, “Towards ModelDriven Provisioning, Deployment, Monitoring, and Adaptation of Multi-cloud Systems,” in 2013 IEEE Sixth International Conference on Cloud Computing, 2013, pp. 887–894.