39 An Efficient Multidimensional Big Data Fusion ... - ACM Digital Library

30 downloads 136988 Views 2MB Size Report
Oct 7, 2009 - analyzing the massive volume of data. Therefore, to address these issues, this article proposes an efficient, multidimensional, big data ...
An Efficient Multidimensional Big Data Fusion Approach in Machine-to-Machine Communication AWAIS AHMAD, ANAND PAUL, and MAZHAR RATHORE, Kyungpook National University, Korea

HANGBAE CHANG, Chung-Ang University, Korea

Machine-to-Machine communication (M2M) is nowadays increasingly becoming a world-wide network of interconnected devices uniquely addressable, via standard communication protocols. The prevalence of M2M is bound to generate a massive volume of heterogeneous, multisource, dynamic, and sparse data, which leads a system towards major computational challenges, such as, analysis, aggregation, and storage. Moreover, a critical problem arises to extract the useful information in an efficient manner from the massive volume of data. Hence, to govern an adequate quality of the analysis, diverse and capacious data needs to be aggregated and fused. Therefore, it is imperative to enhance the computational efficiency for fusing and analyzing the massive volume of data. Therefore, to address these issues, this article proposes an efficient, multidimensional, big data analytical architecture based on the fusion model. The basic concept implicates the division of magnitudes (attributes), i.e., big datasets with complex magnitudes can be altered into smaller data subsets using five levels of the fusion model that can be easily processed by the Hadoop Processing Server, resulting in formalizing the problem of feature extraction applications using earth observatory system, social networking, or networking applications. Moreover, a four-layered network architecture is also proposed that fulfills the basic requirements of the analytical architecture. The feasibility and efficiency of the proposed algorithms used in the fusion model are implemented on Hadoop single-node setup on UBUNTU 14.04 LTS core i5 machine with 3.2GHz processor and 4GB memory. The results show that the proposed system architecture efficiently extracts various features (such as land and sea) from the massive volume of satellite data. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design General Terms: Design, Algorithms, Performance Additional Key Words and Phrases: M2M, data fusion, Big Data, Hadoop processing server ACM Reference Format: Awais Ahmad, Anand Paul, Mazhar Rathore, and Hangbae Chang. 2016. An efficient multidimensional Big Data fusion approach in machine-to-machine communication. ACM Trans. Embed. Comput. Syst. 15, 2, Article 39 (May 2016), 25 pages. DOI: http://dx.doi.org/10.1145/2834118

This work supported by the Brain Korea 21 Plus project (SW Human Resource Development Program for Supporting Smart Life) funded by Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (21A20131600005). This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) [No. 10041145, Self-Organized Software platform (SoSp) for Welfare Devices]. Authors’ addresses: A. Ahmad, A. Paul, and M. Rathore, The School of Computer Science and Engineering, Kyungpook National University Daegu, Korea, 702-701; email: [email protected], paul.editor@ gmail.com, [email protected]; H. Chang, Department of Industrial Security, College of Business and Economics, Chung-Ang University, Korea; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 1539-9087/2016/05-ART39 $15.00  DOI: http://dx.doi.org/10.1145/2834118

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39

39:2

A. Ahmad et al.

1. INTRODUCTION

Recently, a great deal of interest in the field of big data and its analysis has risen [Agrawal et al. 2011; Cohen et al. 2009; Dean and Ghemawat 2008], mainly driven from extensive numbers of research challenges strappingly related to bona fide applications, such as modeling, processing, querying, mining, and distributing large-scale repositories. The term big data comprising formless data, which dwells in data layers of technical computing applications [Herodotou et al. 2011] and the web [Michael and Miller 2013]. The data stored in the underlying layer of all these technical computing application scenarios have some precise individualities in common, such as (i) largescale data, which refers to the size and the data warehouse, (ii) scalability issues, which refers to the application’s likelihood to be running on a large scale (e.g., Big Data), (iii) sustaining extraction transformation loading (ETL) methods from low, raw data to well-thoughtout data up to a certain extent, and (iv) development of uncomplicated, interpretable analytics over big data warehouses with a view to deliver an intelligent and momentous knowledge for them [Cuzzocrea et al. 2013]. Big data are usually generated by the earth observatory system, social networks, and networking applications [Eaton et al. 2012; Schneider 2012]. These data, accumulated in databases, grow extraordinarily and become complicated to confine, form, store, manage, share, process, analyze, and visualize via typical database software tools. In the meantime, Machine-to-Machine (M2M has been recognized in Information and Communication Technology (ICT) since it started at the beginning of the 21st century. M2M technology provides the possibility to connect sensors, actuators, things involved in Internet of Things (IoT), and other devices. Such devices generated overwhelming amounts of datawhich caused the coupling between M2M and the big data communities to be strong [Zhang et al. 2013; Ramaswamy et al. 2013; Zaslavsky et al 2013]. As such, there is no widespread approach that supports data acquisition, data aggregation, and data analysis from numerous objects and their exploitations. Based on the aforementioned needs, recent research efforts are focused on the data acquisition from the data generation tiers [Haderer et al. 2013], the aggregation tiers [Mosser et al. 2012], or the exploitation [Mosser et al. 2013], and lastly, the data analysis tiers. Such approaches are a somewhat more challenging task in M2M communication than in locating, identifying, understanding, and citing data [Labrinidis et al. 2012]. Having a large-scale data, all of this has to happen in a mechanized manner because it requires diverse data structure as well as semantics to be articulated in forms of computer-readable format. However, by analyzing simple data having one dataset, an intelligently designed database is required. Apparently, there might be alternative ways to store all of the same information. In such conditions, the existing designs might have an advantage over others for an individual process and possible drawbacks for some other purposes. In the current scenarios of big data analytics, various platforms have been provided by relational database vendors that are used for data aggregation and data analyses. Such platforms are either software only or they just provide analytical services which run in a third-party hosted environment. Furthermore, these platforms are meant for new technologies which are used to analyze the massive volume of data, such as, web traffic (e.g., social media) and global positioning system (GPS) data. Nowadays, various analytical platforms are available on the market that could be used for specific applications (i.e., each of these platforms is designed for a particular goal). The incredible growth in the data also poses new challenges, such as how to aggregate massive volumes of data,,how to store such data in a limited amount of memory allocated for the particular task, and how to process and analyze these data when there is no such intelligent algorithm available. Moreover, large-scale data

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:3

cannot be tackled by standard reduction techniques since their runtime becomes impractical. Several other approaches have been developed that help in enabling data reduction techniques, which deal with this problem. In the case of Prototype Reduction (PR), the data-level partitioning is based on a distributed partitioning model that sustains the class distribution. Such type of reduction splits the original data into various subsets that could be individually addressed. Afterward, it combines each partially reduced set into a global solution. Furthermore, torrents of event data are required to be distributed over various databases, and large process mining problems need to be distributed over a network of computers. Also, various data fusion could also be employed that considers the abstraction level. Such techniques are categorized based on the abstraction level of input and output data [Dasarathy 1997], such as, data in-data out (Dai-Dao), data in-features out (Dai-FEO), features in-features out (FEI-FEO), features in-decision out (FEI- DEO), and decision in-decision out (DEI-DEO). Moreover, several other approaches could be found in the literature [Cecchinel et al. 2014; Ramaswamy et al. 2013; Marchal et al. 2013; Paul and Rho 2015; Paul 2013, 2014, 2015]. However, the generic divide-and-conquer approach based on the fusion technique could be the optimal solution for the said challenges. Hence, in a nutshell, the following two main problems appear when we increase the dataset size during analysis: —The existing big data architectures face various computational challenges, such as processing and analyzing a large amount of data, i.e., the data that is generated by the various remote sensor satellites in real-time, social network applications, and networking applications. —Continuous feature extraction, such as river or highway detection from remote sensory big data is a challenging issue. Furthermore, the hidden correlation among various input files and obtaining the actual meaning of the data are challenging tasks. Such schemes require efficient algorithms in handling large- scale earth observatory datasets on a limited timescale. Therefore, in this work, we propose a system architecture designed for analyzing big data in M2M using a data fusion model that welcomes real-time and offline data. To do so, various machines are used for earth observatory systems (i.e., satellites and sensors), social network data (i.e., offline data stored in a server, networking application), which are used to collect data, and directly transmit the data to the ground base station or data collection point. The ground base station is composed of data collection points which pre-process the raw data in and extract useful information. Afterward, these data blocks are sent to the message queue, where each data block is waiting for the processing server to be uploaded. Furthermore, a fusion algorithm is employed, which uses five levels of algorithms. These levels help in better understanding the data results and analyzing it in an efficient manner. Such techniques enhance the computational efficiency as well as give better results to predict future data. Finally, the partitioned data blocks are sent to the Decision Making Unit (DMU), which are used for analysis, as well as storing the results. The Decision Making server can utilize those results depending on the requirement of the user. The proposed architecture welcomes both real-time as well as offline data (e.g., GPRS, xDSL, or WAN). The contribution of the proposed system architecture is summarized as follows: —The data aggregation technique concatenates all the data being generated by various machines in larger blocks. —The larger data blocks are arranged in a sequential manner. Afterward, these data blocks are partitioned into smaller data blocks. ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:4

A. Ahmad et al.

—These subdata blocks are then sent to the Hadoop processing server, in which each subdata block is sent to the server for final processing using the Fusion domain. —The resulting storage device helps the user to get their desired results at any time, which can be used for future comparison, if needed. The proposed analytical architecture for big data in M2M has several advantages. At the data acquisition stage, the data is concatenated to form a big data block that helps the system to combine the same data type. The fusion domain helps in enhancing the efficiency of the Hadoop processing server by dividing the data into smaller data blocks. Each subblock is then sent to a server for further processing, which helps in increasing the processing efficiency. Finally, users can use the desired results for comparison purposes. 2. RELATED WORK

Big data and its analysis are at the verge of modern science and business, where authors highlight the identity of a number of sources on big data, such as online transactions, emails, audios, videos, search queries, health records, social networking interactions, images, click-streams, logs, posts, search queries, health records, social networking interactions, mobile phones and applications, scientific equipment, and sensors [Zhang et al. 2013]. The proposed model is using conventional database tools. The challenge is to capture, form, store, manage, share, analyze, and visualize the big data. Also, the characteristics of big data, such as a variety, volume, and velocity, the three top characteristics of big data, are elaborated briefly in Section 1. The concept of big data is stimulating a broad range of curiosity in the industrial sector [Lin et al. 2006]. The report provides concrete examples of big data generated by sensors. For instance, manufacturing companies use various machines (e.g., sensors) that are embedded in monitoring usage patterns, predicting maintenance problems, and enhancing the product quality in their machinery equipment. Analyzing data streams generated by the embedded machines allow manufacturers to improve their products in their machinery. A massive volume of data is generated by numerous machines deployed in the supply lines of utility providers, which are constantly monitoring the production quality, safety, maintenance, and so forth. The electronic sensors that are frequently monitoring the mechanical and atmospheric conditions are a highquality example of sensors generating a bulk of big data. Furthermore, sensors which are used for healthcare sectors (for monitoring biometrics on human body, health care diagnoses, patients’ conditions, treatment phases, etc.) are a huge source of information for big data presented in Lin et al. [2006]. However, gathering the sensors’ data from numerous sensors in an energy-efficient method remains, beyond the expertise of the report. Cloud-based federated framework for the sensor services objective is to enable a seamless exchange of feeds from large numbers of heterogeneous sensors [Ramaswamy et al. 2013]. In the literature, another discussion is the densely distributed wireless sensor network is used for various applications generating big data. In health care sectors, the information (e.g., blood pressure and heart rate) provided by numerous sensors are used to understand remote medical care services [Lin et al. 2006; Ross 2004]. Also, for arranging prompt dispatch of ambulances, the use of patients’ location information has also been carried out. In Zhan et al. [2010] and Baumgartner et al. [2009], researchers observed various animal habitats by gathering massive volumes of data from location sensors attached to animals, since densely deployed wireless sensor networks in large geographic locations yield in devastating amounts of data. Due to a limited range of wireless communication, the wireless sensor network may

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:5

be divided into subnetworks, since it is hard to aggregate big data generated by the densely distributed wireless sensor network. Software architecture is proposed for in-network aggregation in a sensor network environment, as described in Heidemann and Silva [2001]. In this scheme, the aggregation scheme significantly reduces the network overhead by using special means of low-level meaning. Also, to establish an optimal local path between sources and sink nodes, a directed diffusion method is proposed. Such techniques help the formation of the aggregation tree rooted at sink nodes. Moreover, if similar data come from a leaf node in the tree, the copies of the similar data are fused into a single message. While employing such techniques, the energy of the node drastically decreases. To improve the energy constraints in such opportunistic aggregation, a scheme called greedy incremental tree is proposed [Intanagonwiwat et al. 2002]. The previous work, however, does not address the problem of synchronization among aggregation nodes. Also, it does not examine the associated credibility of the aggregation contents. One of the fundamental tasks of data analysis is to access the statistical features of remote sensing data. For instance, in the image clustering technique, there is a need for estimating the statistical resemblance measurements [Yang et al. 2010]. However, in the majority of the classification algorithms, there is a need for spatial statistics-based expressions that help in creating decision boundaries between different classes [Mishra and Singh 2013]. Furthermore, researching of the end member detection revealed there is a need for spatial distribution of the end members [Plaza et al. 2004]. In this case, using statistical characteristics of the data sets are mentioned in [Thompson et al. 2010]. For big data, there is also a need to estimate a vector of model parameters given a training dataset. Such kinds of statistical feature estimations give us better and correct information rather than simple analysis and could be used to expand human clarification of inferential outputs, perform bias correction, hypothesis correction, and many other potential uses [Wang et al. 2014]. Various studies have been performed on statistical features of the big data [Lu and Li 2013; Kleiner et al. 2012; Li et al. 2012; Wu et al. 2014; Cormode and Garofalakis 2009; Yang et al. 2009, Portilla and Simoncelli 2000]. In the majority of the cases, we do not rely on their statistical measurements. With a view to manifest the statistical measurements of the big datasets, there is a need to represent them in some other form, i.e., how to characterize big datasets as majority of the data processing tasks relying on some appropriate data representation. For instance, in the case of image processing scenario, the wavelet transform [Mallat 1989]; in the case of a remote sensing scenario, multi-resolution representation, such as image segmentation [Shah et al. 2010], image de-noising [Liu et al. 2012], image restoration [Liu and Eom 2013], image fusion [Audicana et al. 2004], change detection [Celik and Ma 2010], and feature extraction [Bruce et al. 2002]; and in the case of handover in M2M communication [Ahmad et al. 2016], optimized data transmission in device-to-device communication [Ahmad et al. 2015] and image interpretation. Hence, assessment of statistical measurements of big data in the wavelet transform domain in real-time or offline is a key challenging area. The majority of the work has been conducted on the statistical features of the wavelet field of an image signal or small dataset. Such techniques have been implied in various image processing techniques; however, apart from the sparse characteristics, there are some other characteristics, as well, that deal with the wavelet transform field. Such techniques are inter-scale constraints, referred to as tree structure [Crouse et al. 1998], whereas, the second one is the inter-scale constraints, referred to as neighborhood coefficient [Liu and Moulin 2001].Various datasets are transformed by suing above mentioned techniques. However, they have failed to reach the correlation among various datasets.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:6

A. Ahmad et al.

Recent research techniques for analyzing big data in M2M can be classified into two types—namely, real-time big data analysis and offline big data analysis. Additionally, to find the hidden correlation among various datasets in real-time, as well as offline, is also a challenging area. Therefore, to achieve both, the designed system architecture provides a useful remedy for delivering desirable results using both types of data analysis. 3. DATA FUSION IN M2M

In recent times, various machines are impeccably integrated into information networks with a view to providing advanced services to the community. Since M2M is associated with a large number of heterogeneous devices, such as, sensors, satellites, and the cellular network, which generate a massive volume of heterogeneous and dynamic data, the data fusion approach could be a useful remedy for manipulation, as well as management of such type of data that improve the processing efficiency. According to the definition of data fusion, “a framework that is associated with expressed means as well as tools for the Alliance of data originating from different sources.” [Wald 2010; Nakamura et al. 2007]. The rationale behind the data fusion approach is to take the information of greater quality in various networks, such as the wireless sensor network, IoT, and so on. Data fusion is used as a tool that is composed of theories, methodologies, and algorithms. Such composition factors are used to integrate the heterogeneous data that is generated from various sources (sensors, satellite, etc.). Aggregating and mining the resultant data from the databases results in achieving performance accuracy [Zhou et al. 2013]. According to Zhou et al. [2013], various popular data fusion techniques are available in the literature, such as interpretability and integration [Pawlak 2002], resource search discovery [Rinne et al. 2012], semantic reasoning and interpretation, and so on. The mentioned studies are widely based on semantic web technologies [Zhou et al. 2013]. Moreover, big data management and mining for gathering meaningful information for a high volume of data being generated from multiple sources. The mentioned studies are based on the data fusion theories and algorithms [Zhou et al. 2013]. Therefore, in this article, the proposed big data architecture incorporates a data fusion approach, which is based on partitioning and aggregation technique for big data. The algorithm focuses on improving the computational efficiency of the system using higher dimensional data. 4. PRELIMINARIES

This section comprises some of the preliminary assumptions being made for the better understanding of the proposed system architecture for big data analytics using M2M. These assumptions are not considered as strict, as they can be changed according to the data inputs, as well as system requirements. Some of the scenario-related definitions are also given. 4.1. Assumptions and Definitions

—Assumption 1 (Heterogeneous Devices) All the devices (e.g., sensors, satellites, etc.) have a different configuration, i.e., their computational capabilities, mobility pattern of the satellites, and battery requirements are different from each other. —Assumption 2 (Communication Radius Model) The communication range of a device “A” (satellite) has the radius “R” that is centered at “c,” approximately 800km away from the surface of the earth. It can be defined as CR(c, R) = {A, q ∈ S : |D(A − q) ≤ RA}, where CR represents ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:7

communication radius, S represents the set of deployed devices, and D(A − q) is the distance between a device and earth surface q in the M2M network. —Definition 1 (Medium-Scale Network) In the case that we are considering a wireless sensor network (WSN) environment and we want to analyze the big data generated by various sensor nodes, in such an environment, if all the nodes have direct communication access to the sink node, then the network is considered to be a Medium-Scale Network (MSN). Suppose, in any environment, the WSN network comprising of 100 sensor nodes deployed in the area of 100m × 100m is considered as MSN. This definition can be modeled as ∀ p  p ∈ S, |D( p − BS)| < Rp, where p is, a sensor node among the set “S” of deployed sensor nodes, and D(p − BS) is the distance between any of the deployed network nodes, say, p and the sink node. Rp is the communication radius of node p. —Definition 2 (Large-Scale Network) If any of the satellites do not have direct communication access to the ground station, then the M2M network is considered to be a Large-Scale Network (LSN). Suppose, in any environment, the M2M network comprising of n numbers of satellites and sensor nodes placed in space is considered to be an LSN. This definition can be modeled as ∃ p  p ∈ S, |D( p − BS)| > Rp, where p is a satellite among the set “S” of placed satellite and D( p − BS) is the distance between any of the deployed network nodes, say, p and the ground station. Rp is the communication radius of node p.

5. DATA FUSION APPROACH FOR BIG DATA ANALYTICS IN M2M

In this section, we proposed big data analytical architecture using the data fusion approach in M2M communication. Firstly, we present a system model that justifies our proposal. Then, a detailed description is given in the System Model. 5.1. System Model

ˆ = A set of machines (satellites) are deployed in an earth arbitrary system, denoted by U {S1 , S2 , . . . , SN } for collecting data using a set of sensors or cameras from a Gaussian ˆ is a Gaussian random variable with random field, such that each source Zi , ∀i ∈ U, mean μi and variance σ z , and collection of data ZUˆ = {Z1 , Z2 , . . . , ZN } follows multivariant a Gaussian distribution. The covariance matrix of the N Gaussian sources is  denoted as = [σi j ]N ∗ N with σi j being the covariance between machine i and j as follows: σi j = σi σ j e

di2j c

,

(1)

where c is the correlation exponent and di j denotes the Euclidean distance between machines i and j. Given the fact that Gaussian source is continuous, we have made an assumption about the data source which is quantized with an already pre-configured quantization level . However, a digitized observation of machine i is denoted by Z  [Juan et al. U

2013]. (Interested readers are referred to Juan et al. [2013] for more details). Having  a small quantization level, the joint entropy H (Z  ) of the quantized source Z can be U U expressed as:     1 2π e 2 H Z = log2 . σ 2 2 i U ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

(2)

39:8

A. Ahmad et al.

Fig. 1. Four-layer structure of the proposed Big Data Analytical System.

5.2. Layered Architecture for Big Data in M2M

The proposed four-layered network architecture for big data in M2M is delineated in Figure 1. It consists of four functional layers, such as (i) data collection layer, (ii) communication layer, (iii) processing layers, and (iv) service layer. A data collection layer provides the necessary functionality to the entire system, which consists of data sensing, data acquisition, data buffering, and data processing. Data collection layer is a multidimensional wired/wireless platform that welcomes various protocols, such as, RFID, Ethernet, Bluetooth, and Radio Frequency (RF). The communication layer provides the basic functionalities that provide end-to-end connectivity to various devices involved in M2M applications. This layer is also responsible for aggregating measured data from different devices and arranging it into a proper format, which is ready for communication purposes. Moreover, any routing and MAC protocol can be employed to achieve efficient communication between various devices. The processing layer is a core unit of the big data analysis that receives aggregated data from the communication layer, processes it, and performs necessary actions,depending on the nature of the data. The process layer welcomes all kinds of data, such as ordinary data and big data. In the case of big data, the data is broken down into smaller blocks and each block is separately processed using the Map-Reduce function. Afterward, the data blocks are aggregated again and are stored in the result storage device for future analysis. Moreover, finally, the service layer provides the connectivity-to-end user to avail various features, such as, defining business rules, entities, tasks, and evaluating reports. For instance, while having visual results in-hand, the professional could make a better understanding as well as explanations for the derived results. Such results could also be used to predict about the future entities. ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:9

6. PROPOSED BIG DATA ARCHITECTURE FOR M2M 6.1. Overview

The term big data covers diverse technologies the same as cloud computing. The input of big data comes from social networks (Facebook, Twitter, and LinkedIn), web servers, satellite imagery, sensory data, banking transactions, and so on. Regardless of the very recent emergence of big data architecture in scientific applications, numerous efforts toward big data analytics’ architecture can already be found in the literature. Among numerous others [Ramaswamy et al. 2013; Li et al. 2013; Marchal et al. 2014; Mayilvaganan and Sabitha 2013], this article presents the data fusion approach for big data analytical architecture in M2M. The proposed architecture is used to analyze the massive volume of heterogeneous and dynamic data in an efficient manner. Figure 1 delineates “n” number of satellites that obtain the Earth Observatory big data images through sensors or conventional cameras. Such images or sceneries are recorded by using radiations. Special techniques are applied to process and interpret remote sensing imagery for the purpose of producing conventional maps, thematic maps, resource surveys, and the like. The proposed architecture is composed of various components, which are described as below. Whereas, the functionalities and working of the said parts are described in a later section. The proposed big data architecture welcomes both real-time as well as offline data. Real-time data are generated by satellites, which are directly propagated toward the ground base station. Whereas, offline data, found on social networks such as Facebook, Twitter, LinkedIn, and YouTube, are first stored in the databases. Afterward, the stored data could be processed by the system at any time. In the proposed architecture, the data is first collected from various applications, such as the earth observatory system, M2M network, IoT, and WSN. After the data is collected, the pre-processing technique applies the designed rules. These rules are application-dependent, since the nature of this application is heterogeneous in nature. Furthermore, a message queue is incorporated that helps in the alignment of the data, which can be helpful for the fusion domain. The fusion domain is an important contribution of the article since it is associated with five sets of rules. Each rule applies their algorithm, which helps in fusing the data to get the actual semantics of each data block. The implementation and performance of the fusion domain are tested in Hadoop (MapReduce). Considering the results being processed by the Hadoop server, these results are aggregated and then stored in the storage devices. The rationale behind the storage device is to store the resultant values. Such values could be used in future for further analysis. After getting the results, the proposed architecture visualizes the values in graphical format, which is readable by the human. 6.2. Architecture of the Big Data

6.2.1. Data Collection and Preprocessing Mechanism. The deployment of the satellites around the globe encourages the growth of the Earth Observatory System as costeffective parallel data acquisition, which is used to gratify for explicit computational requirements. This standard has already been provided by the Space Science Society for the parallel process in this context [Cohen et al. 2009]. The traditional data processing technologies could not provide sufficient power to process a massive amount of data that is collected by the satellites. Hence, satellites instruments for an Earth observatory system can be a better choice for data acquisition in a more sophisticated manner. Having such a background, there is a need to design a system that could efficiently collect and process the data. Therefore, the designed architecture introduces data collection and preprocessing mechanism for big data applications. The designed attribute gathers data from various satellites around the globe with the help of ground ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:10

A. Ahmad et al.

Fig. 2. Deployment scenario for proposed system model.

base station and also collects data from social media, such as Facebook, LinkedIn, Twitter, IoT, and WSN, as shown in Figure 2. The data is transmitted to the ground base station using downlink channel. This transmission is achieved either with direct communication or with the help of relay satellites along with an appropriate tracking antenna and communication link in the wireless atmosphere. We assume that the data must be corrected with different methods for removal of distortion caused by motion of the platform relative to the Earth, platform attitude, Earth curvature, nonuniformity of illumination, variations in sensor characteristics. We also assume that the correction of erroneous data is done at satellites end. However, if we are considering social network application data or networking application data, then there is no such mechanism for correction of data. For such data, the proposed architecture directly collects the data and start processing. For application point of view, we are considering satellite datasets for implementation purpose. The transformation of raw data into image form is performed by the satellites using Doppler or SPECAN algorithms [EnviSat 2007]. However, the biggest challenge here is that how the ground base station distinguishes satellites data? To cope with such challenge, the satellites transmit their data to the ground station with a unique satellite ID. Initially, the data is collected from various applications such as an Earth observation system, social networking applications, or networking applications. The proposed architecture welcomes both types of data, such as offline data and online data. The data generated by satellite are considered to be online because the proposed architecture does not need to store the continuous stream of the data because it may incorporate additional delay. Furthermore, the offline data is referred to the social networking application as well as the networking application because their ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:11

Fig. 3. Data collection and preprocessing.

resultant values could be accessed anytime. Therefore, to process such types of data, we need to define certain rules. Such rules are application-dependent based on offline or online data. These rules are defined as follows. Rule (1): continuous stream of data → data collection → store → split messages e.g., earth observatory data, Online streaming, etc. In case of Rule (1), if we are considering remote sensory communication, then a continuous stream of data is first collected using data acquisition. After data acquisition, the data is collected and is aggregated by the user-defined algorithm. Afterward, according to the technique being developed by the user, the data is split into smaller chunks. Each chunk is separately processed, which helps in enhancing data processing. Rule (2): storage device → split messages → ready for queue e.g., social networking applications, networking applications. In case of Rule (2), if we are considering social networking or application networking, then there is no need for continuous stream of data, that is, we have data in hand, and we need to process the data by means of user-defined algorithm. It is noticed that these rules have slightly different from each other since Rule (1) does not need to access the data from the storage device. There is a flow of the incoming stream of data, which is directly transmitted to the queue. Alternatively, Rule (1) needs to access the data from the storage device that is based on various configuration commands. For effective data analysis, the data configuration scheme, such as data integration, data cleaning, and data redundancy, are used to extract useful information from the raw data. The incorporation of raw data preprocessing technique combines satellite data with their unique satellite ID and stores them on its servers. This technique is used to create the equal size of data blocks. In order to enhance the computational efficiency, the raw data preprocessing technique not only decreases storage capacity, but it also helps in splitting the big data block into equal size, which helps in improving the analysis accuracy at each of its server. The block diagram of data collection and preprocessing is shown in Figure 3. 6.2.2. Message Queue and Data Fusion Mechanism. After the data being collected and preprocessed, the processed data is now transmitted to the storage server. The storage server has the capability to place the data blocks in a queue so that it may enhance the processing efficiency. Moreover, message queue server is also queuing up the data block for the Hadoop processing server so that each block could be processed using fusion technique. In the Hadoop processing server, each block is further subdivided into smaller blocks so that the data fusion could be applied on each subblock. Such division of data blocks helps in enhancing the computational efficiency of the processing server. Furthermore, the processing server is supported by filtration and load balancer server. The basic responsibilities of filtration and load balancing server is to identify the useful data for analysis because it only allows useful information, whereas the rest ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:12

A. Ahmad et al.

of the data is blocked and discarded. On the other hand, the load balancing part of the server provides the facility of assigning subblock to various processing servers. The filtration and load balancing algorithm varies from analysis to analysis, for example, if there is only a need for analysis of sea wave and temperature data, the measurement of these described data is filtered out and segmented into parts. Each processing server has its algorithm implementation for processing an incoming segment of data from the message queue. Each processing server makes statistical calculations, any measurements, and performs other mathematical or logical tasks to generate intermediate results against each segment of data. Since these servers perform tasks independently and in parallel, the performance-proposed system is dramatically enhanced, and the results against each segment are generated in real time. The results produced by each server (processing and assigning of subblocks) are then sent to the fusion domain for integration compilation, organization and storing for further processing. The detailed explanation and algorithm are described below. The proposed architecture employs the data fusion model that was originally proposed by the U.S. Joint Directors of Laboratories (JDL) and the U.S. Department of Defense (DoD), with terms of reference for data fusion techniques [Nakamura 2007]. However, we have modified the existing model based on our system requirements based on the abstraction of the data generated by during fusion. The data fusion model consists of five processing levels and a data bus connecting all levels, as shown in Figure 2. The description and designed algorithms of various levels are mentioned below. • Level 0: It is referred to process alignment that allocates each subblock to each server, which helps in minimizing the processing load. Algorithm 1: Level 0 Input: Satellite product Output: Fixed size block to each process/server. Steps: 1. Extract image related data (Pixel values) 2. Make fixed size blocks of size, that is, BS = 1,000 × 1,000 for extracted image. Each block will be denoted by Bi, where 1 ࣘ i ࣘ BS 3. Select appropriate sources to process Bi. Assign and transmit each distinct block(s) Bi to various processing servers/processes for level 1. • Level 1: It is referred to object refinement that converts the data into a consistent structure, such as, conversion of various types of data, that is, images, data, into target source. Algorithm 2: Level 1 Input: Block Bi Output: Statistical parameters results Steps: 1. Converts the data into a consistent structure 2. Calculate statistical parameters. a. Correlations b. Regression c. Variations Transmit the results against block ID and product ID to the Level 3. • Level 2: It is referred to situation refinement that provides a contextual description of the hidden correlation between subblocks. Such a mechanism is based on environmental data that helps in identification of a situation. ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:13

Fig. 4. Block diagram of fusion algorithm.

Algorithm 3: Level 2 Input: Block Bi results Output: Identify objects, relationships in the block. 1. Collect Every Bi’s result from level 1. 2. Identify relationships between current and previous results. 3. Identify situation based on observed events. Compile them and transmit them to level 3. • Level 3: It is referred to threat refinement that predicts future threats and vulnerabilities for operations. Such level is hard to achieve since it deals with the computations complexity and the design algorithms. Algorithm 4: Level 3 Input: Bi results and situation and event identified. Output: Final Decision. Steps: Rules used: 1. Statistical parameters of level 1. 2. Identified situation and event. Steps: 1. Decision making and identification 2. Broadcast decisions. 3. Store current decisions for future use. • Level 4: It is referred to process refinement that constantly monitors the processing performance as well as allocating the sources according to the specified goals. The block diagram of the levels mentioned above is delineated in Figure 4. Based on these levels, we are now in a better position to define the three prongs of the process cycle based on a data fusion model for M2M communications. The process cycle consists of four attributes that are intercorrelated with each to achieve the desired goal. Various activities of process cycle are: • Raw data is collected from the physical environment (Earth observatory system, social network applications, or networking applications, etc.). This step could be matched with Level 0 of the data fusion model. Once the raw data is collected, the next step is to organize the data in a meaningful form. Afterward, the unwanted ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:14

A. Ahmad et al.

Fig. 5. Process cycle of the data fusion model in big data.

data is discarded. Such factors could be achieved through comparison and analysis methods. This step could be matched with Level 1 of the data fusion model. • Extracting useful information from the raw data, the data blocks are fused and analyzed. This step could be matched with Level 2 of the data fusion model. • Moreover, finally, the fusion results are delivered to the end user in a visual form. The end user is now able to understand and read the data, which helps in taking the decision as well predicting the future. The process cycle is delineated in Figure 5. 6.2.3. Data Aggregation and Decision-Making Mechanism. Data aggregation and decision making are the final stages of the proposed architecture. Initially, when the fusion results are ready for compilation, the Hadoop processing server sends the partial results to the aggregation server because the aggregated results are not in an organized form. Therefore, there is a need to aggregate the related results and organized them into a proper form for further processing and to store them. In the proposed architecture, aggregation is supported by various algorithms that compile, organize, store, and transmit the results. Again, the algorithm varies from requirement to requirement and depends on the analysis needs. Aggregation server stores the compiled and organized results into results storage, With the intention that any server can use it for it is processing at any time, the aggregation server also sends the same copy of that result to the fusion result storage device, where the results are finally stored. The purpose of the storage server is to use the results in future if there is a need for any reference. Afterward, the final results are sent to a decision-making server that process the results for making a decision. The decision-making server is supported by the decision algorithm, which inquires different things from the result and then makes various decisions (e.g., in our analysis, we analyze land, sea, and ice datasets). Whereas other finding, such as social network applications (e.g., Facebook, LinkedIn, Twitter, YouTube, etc.), networking applications (IoT, WSN, etc.), fire, storms, Tsunami, earthquake can also be found. The decision algorithm must be strong and correct enough that efficiently produce results to discover hidden things and make decisions. The decision part of the architecture is significant because any small error in decision making can degrade the efficiency of the whole analysis. Finally, the results are displayed or the decisions are broadcast so that any application can utilize those decisions in real time or offline to make their development. The applications can be any business software, general-purpose community software, or other social network, and so on that need those findings (i.e., decision making). The self-explanatory flowchart that supports the working of the proposed architecture is depicted in Figure 6. 7. ANALYSIS AND DISCUSSION

This section provides the implementation of the two cases, such as remote sensing application and networking application (IoT scenario implementation). The detailed ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:15

Fig. 6. Flowchart of earth observatory big data architecture.

analysis is performed using the single-node Hadoop setup. Hadoop provides a parallel and high-performance computing facility using a large number of servers because it is suitable for analyzing the massive volume of data in an efficient way. To explain it further, we classify this section into two implementation cases, that is, case 1 and case 2. Case 1 is consists of Earth observatory satellite datasets, whereas case 2 consists of networking application datasets. The detail description is as follows. Case 1:

The proposed system architecture is composed of four levels of data fusion, in which various algorithms are employed to analyze and test European Space Agency (ESA) products [European Space Agency (ESA) 2014a] and networking application scenario. In ESA, the Land and Sea areas are detected various algorithms being developed for data fusion. The ESA datasets are illustrated in Table I. The datasets are taken from the ESA [European Space Agency (ESA) 2014b] that are also referred to as products, which covers different areas of Europe and Africa. These products consist of Earth-monitored areas captured by “ENVISAT.” The ENVISAT captured the ground surface from approximately 800KM above the surface of the Earth. For analysis purposes, five ENVISAT-captured products are considered, which consists of earth data captured from Advanced Synthetic Apertures Radar (ASAR) satellite sensors. These products cover the Land, Sea, Deserts, and Forest areas of Poland, Germany, Vietnam, Western Sahara, Mauritania, South Africa, and Spain from 2002 to 2009. Each product is explicitly described in Table I by using attributes such as ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:16

A. Ahmad et al. Table I. Datasets Information

Covered Area Type (country) Product Type Software Ver. Sensing Time Mission Product ID: Product 1. ASA_APM_1PNPDE20091007_025628_000000432083_00118_39751_9244 (23MB) ASA_APM_1P ASAR/4.07 07-OCT-2009 ENVISAT Sea and Land 2:56:29.39749 (Vietnam) Product ID: Product 2. ASA_APM_1PXPDE20020819_093008_000000622008_00394_02452_0009 (33MB) ASA_APM_1P SAR/3.00S00 19-AUG-2002 ENVISAT Land (Poland and 9:30:08 Germany) Product ID: Product 3. ASA_GM1_1PNPDE20100415_224615_000004102088_00345_42483_4425(9.4MB) ASA_GM1_1P SAR/5.03L03 15-APR-2010 ENVISAT Sea and Land 22:46:21.294 (Forest, Desert) (Western Sahara, Mauritania) Product ID: Product 4. ASA_WSM_1PNDPA20050331_075939_000000552036_00035_16121_0775 (55MB) ASA_WSM_1P SAR/4.02 31-MAR-2005 ENVISAT Sea and Land 07:59:36.4091 (Cape Town, South Africa) Product ID: Product 5. ASA_WSM_1PXPDE20021117_104431_000000672011_00180_03741_0009 (67MB) ASA_WSM_1P SAR/3.03S00 17-NOV-2002 ENVISAT Sea and Land 12:58:52.00 (Spain)

Table II. Statistics of Products/Datasets for Land Area Product 01 02 03 04 05

No. of blocks taken 20 20 20 20 20

No. of sample values (pixels) in each block 20,000 20,000 20,000 20,000 20,000

Mean 1,604–3,067 408–603 1,351–3,050 928–1,439 928–2,372

SD 598–1,787 91–215 462–776 298–718 387–979

Max. Value 6,560 7,902 6,409 12,676 30,608

product type, software used for processing and capturing, sensing time, mission, and the area covered. Initially, the products that are composed of Land and Sea are analyzed separately by consideration of mean values (the variations in the satellite product-generating image values and maximum image values). Approximately 20 Land and Sea blocks are considered from each product size, which is composed of approximately 20,000 pixels for analyzing blocks on the mentioned parameters. In Land blocks, the mean value, as well as variation in the pixels, is noticed as quite high, except for Product 2. Such factor (mean value and variation in the pixels) is due to Forest in the product that produces lower mean values. For Sea blocks, the mean values and variation of pixels is noticed as lower. Furthermore, the correlation between two blocks is also examined in which two consecutive blocks are correlated with higher rates. Therefore, these blocks, when meeting the particular threshold for the correlation, could be the part of the similar area (i.e., either Land or Sea). A summary of the analysis results is shown in Table II while considering the Land area and in Table III while considering the Sea area. Figures 7 and 8 show the pixels value distribution by consideration of Land and Sea. In Figure 7, the pixels values are more than 1,500 and the variation in pixels are also noticed as high, whereas Figure 8 shows higher mean values for the Sea block; however, the variations remain very low. The mean value is noticed as high because of the shade and time of the image. It is also observed that there is variation in pixel values in two different blocks, as shown in Figures 9 and 10. Figure 9 represents the pixel view, which shows lower variation in the pixel values. However, various colors ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:17

Table III. Statistics of Products/Datasets for Sea Area Product 01 03 04 05

No. of blocks taken 20 20 20 20

No. of sample values (pixels) in each block 20,000 20,000 20,000 20,000

Mean 253–287 264–829 514–617 365–820

SD 231–402 937–2,429 368–436 61–699

Max. Value 10,387 17,623 5,933 5,167

Fig. 7. Pixel values distribution (PVD) of various products for Land area.

could be seen in the Land block because of the city area, roads, rivers, and houses, and so on on the land. Hence, it results in variation in the pixel values (i.e., higher, as shown in Figure 10). Evaluation: On the basis of ESA product classifications (i.e., Land and Sea areas), the analysis results are applied in different levels of fusion domain, such as Level 1, Level 2, Level 3, and Level 4 algorithms. The evaluation of the proposed system focuses mainly the efficiency that is based on Hadoop implementation to classify Land and Sea successfully. The system efficiently detected Land and Sea from Products 1, 3, 4, and 5, as delineated in Figure 11. Case 2:

In case 2, health-related sensor datasets, including heart rate, ECG, and body temperature datasets, are considered for analysis. The heart rate data is taken from Reiss et al. [2012a, 2012b], which is generated from various patients while performing various activities. Furthermore, the ECG dataset corresponding to different activities generated by Banos et al. [2014] is taken for analysis purposes. Moreover, to consider the fire detection scenarios, the temperature data is generated by placing heat sensors and cameras in the room. Overall, more than 2GB data is analyzed. Also, the room ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:18

A. Ahmad et al.

Fig. 8. PVD of various product for the Sea area.

Fig. 9. Pixel view of Sea blocks.

Fig. 10. Pixel view of various Land blocks.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:19

Fig. 11. (a) Product 1 resulted in image (b). Product 3 resulted in image (c). Product 4 resulted in image (d) Product 5 resulted in image.

Fig. 12. Heart rates of various patients.

temperature dataset file consists of four attributes, the heart rate dataset has 54 attributes, and the ECG dataset has 24 attributes for each record. We analyzed the heart rates of various patients during various activities, such as watching TV, going upstairs and downstairs, running, and jumping rope. An ICU patient is also considered, who is a 9-month-old infant. We observed that the ICU patient’s heart rate is quite high, as he is young, and as compared to an elderly person, children have a higher heart rate. Moreover, it also has a serious disease that causes a higher heart rate. A few times, the heart rate crosses a serious threshold, such as at the reading of 203, 195, and 389. During these readings, the system generates an alarm and performs a quick action by calling doctors. By analyzing the heart rates while performing other activities, we observed that, for hard activities, such as jumping rope, ascending stairs, and running activities, the heart rate increases. The system continuously measures the heart rate and generates an alarm to the patient to stop the activity when the heart rate crosses its normal range. We also noticed that for normal activities, such as watching TV, walking, or descending stairs, the heart rate is normal and slowly fluctuates. The heart rates of various patients while performing different activities are depicted in Figure 12. With the heart rate, the temperature of the patients is also monitored and analyzed continuously. The temperature is measured by placing the sensor on the chest. ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:20

A. Ahmad et al.

Fig. 13. Temperature of patients.

Normally the temperature of the body remains almost the same. However, when the person has a fever, the temperature crosses the normal temperature limit (i.e., 38◦ C). This temperature threshold can vary with respect to the age of the patient. We considered the patients at walking, running, and ascending stairs. However, these activities do not have a major impact on temperature as we have seen in the case of heart rate. However, we perceived that when the person has a fever, there is a major deviation in temperature. Figure 13 shows the temperature graph of patients, in which the patient once has a fever that is detected while walking. During other activities, a patient has a fever twice, as shown in Figure 13. Since the temperature measurements are considered while performing various activities at various times, therefore, the temperature changes, as demonstrated by the green line in Figure 13. Furthermore, to check the feasibility of the proposed system architecture, we have also considered the fire detection using temperature sensors and flush lights. We deploy temperature sensors in various locations of the home. They sent data to the main station when any change in temperature occurred. We observed that, at the time of a fire event occurrence, the temperature dramatically moves away from the ordinary temperature range. Furthermore, the light is also flushed with high intensity. We also detected that the room temperature sometimes exceeds the normal range due to various reasons other than fire, such as while cooking. Therefore, two thresholds are considered: one is the serious threshold, and other is the normal threshold to detect fire in a room. The room data temperature with four to five fire events is illustrated in Figure 14. The serious threshold is considered to be 60◦ C, and normal is considered to be 50◦ C. When the temperature of the room exceeds the serious threshold, the system is alerted to perform necessary action and a call is also generated to the fire brigade and police station, as shown by reading no. 56, 110, 65, and 435. However, when the data exceeds the normal range, the data is analyzed by measuring statistical measures based on the previous history. It may use variation and an average (mean) mechanism to analyze the temperature for fire detection. Finally, based on the analysis results, the action will be taken. Furthermore, if the temperature is normal, then no action will be taken. The normal and serious threshold varies from season to season depending on the location of the sensor. Efficiency measurements are taken by considering the processing time and throughput in megabits per second for various datasets, as shown in Figures 15 and 16. MapReduce implementation of the analysis and fusion algorithms takes a few seconds to process gigabytes of datasets. It takes just 70 seconds to analyze 2GB of “Combine” ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

Fig. 14. Room Temperature with fire.

Fig. 15. Efficiency of system in terms of processing time on healthcare datasets.

Fig. 16. Efficiency of system in terms of throughput on healthcare datasets.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:21

39:22

A. Ahmad et al.

data on a single node. Since the temperature dataset size and the number of attributes are lesser, it takes the very short time to be processed. Furthermore, when the size of the dataset is increased, the throughput is minimized due to the distributed and parallel nodes of the Hadoop system. For smaller-sized products, the Hadoop implementation is not efficient because of its many input and output operations due to the Map and Reduce functions. In the case of large-sized products, Hadoop divided whole products into blocks and performed parallel tasking on them, resulting in increasing efficiency. Since the size of temperature and Activity_HR data is quite low, the throughput for these datasets is too small. 8. CONCLUSIONS AND FUTURE WORK

In this article, we proposed a multidimensional big data architecture that is based on the fusion model technique. The proposed architecture efficiently processed and analyzed real-time and offline data for decision making. The proposed architecture is composed of various components, such as data collection, message queue, Hadoop processing server, fusion results aggregation and storage unit, and decision-making unit. These units implement algorithms for each level of the architecture depending on the required analysis. The proposed architecture for analyzing real-time big data is generic (i.e., application independent), which can be used for any big data analysis (e.g., remote sensing application, social networking application, and networking application). Furthermore, the capabilities of collecting data, fusion model, and Hadoop parallel processing of only useful information is performed by discarding all other extra data. These processes make a better choice for real-time big data analysis. The algorithms proposed in the article for each unit and subunits are used to analyze remote sensing datasets, which helps to better understand the Land and Sea areas. Moreover, we also proposed four-layered network architecture for big data analytics based on the fusion model. The proposed layered architecture verifies the working of the proposed system architecture. The proposed architecture welcomes researchers and organizations for any big data analysis by developing algorithms for each level of the architecture depending on their analysis requirement. For future work, we are planning to extend the proposed architecture to make it compatible with big data analysis for social applications (e.g., Facebook and Twitter). We are also planning to use the proposed architecture and perform complex analysis on each earth observatory data for decision making in real time such as earthquake prediction, Tsunami prediction, fire detection, earth detection on land as well as in sea, and so on. REFERENCES Divyakant Agrawal, Sudipto Das, and Amr El Abbadi. 2011. Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology. ACM, 530–533. Awais Ahmad, Alfred Daniel, and Anand Paul. 2014. Optimized data transmission using cooperative devices in clustered D2D communication. In Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems. ACM, 209–214. Awais Ahmad, Anand Paul, and M. Mazhar Rathore. 2016. An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication. Neurocomputing 174, Part A, 439–453. Awais Ahmad, Anand Paul, M. Mazhar Rathore, and Seungmin Rho. 2015. Power aware mobility management of M2M for IoT communications. Mobile Information Systems 501, 521093. Volume 2015, pages 14. DOI: http://dx.doi.org/10.1155/2015/521093 Oresti Banos, Rafael Garcia, Juan A. Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez, and Claudia Villalonga. 2014. mHealthDroid: A novel framework for agile development of mobile health applications. In Ambient Assisted Living and Daily Activities. Springer, 91–98.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:23

Kelli A. C. Baumgartner, Silvia Ferrari, and Anil V. Rao. 2009. Optimal control of an underwater sensor network for cooperative target tracking. IEEE Journal of Oceanic Engineering 34, 4, 678–697. Cyril Cecchinel, Matthieu Jimenez, S´ebastien Mosser, and Michel Riveill. 2014. An architecture to support the collection of big data in the Internet of Things. In Proceedings of the IEEE World Congress on Services (SERVICES). IEEE, 442–449. Turgay Celik and Kai-Kuang Ma. 2010. Unsupervised change detection for satellite images using dual-tree complex wavelet transform. IEEE Transactions on Geoscience and Remote Sensing 48, 3, 1199–1210. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment 2, 2, 1481–1492. Graham Cormode and Minos Garofalakis. 2010. Histograms and wavelets on probabilistic data. , IEEE Transactions on Knowledge and Data Engineering 22, 8, 1142–1157. Matthew S. Crouse, Robert D. Nowak, and Richard G. Baraniuk. 1998. Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing 46, 4, 886–902. ` and Jeffrey D. Ullman. 2013. Big data: A research agenda. In Proceedings Alfredo Cuzzocrea, Domenico Sacca, of the 17th International Database Engineering & Applications Symposium. ACM, 198–203. Belur V. Dasarathy. Sensor fusion potential exploitation-innovative architectures and illustrative applications. Proceedings of the IEEE 85, 1, 24–38. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51, 1, 107–113. C. Eaton, D. Deroos, T. Deutsch, G. Lapis and P. C. Zikopoulos. 2012. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill. EnviSat, A. S. A. R. 2007. Product Handbook. European Space Agency 2.2. European Space Agency (ESA). 2014a. Earth Online. Retrieved from https://earth.esa.int/. European Space Agency (ESA). 2014b. The Beam Project. Retrieved from http://www.brockmann-consult.de/ cms/web/beam/. ´ ´ and Rafael Garc´ıa. 2004. Fusion of Maria Gonzalez-Aud´ ıcana, Jos´e Luis Saleta, Raquel Garc´ıa Catalan, multispectral and panchromatic images using improved IHS and PCA mergers based on wavelet decomposition. IEEE Transactions on Geoscience and Remote Sensing 42, 6, 1291–1299. Nicolas Haderer, Romain Rouvoy, and Lionel Seinturier. 2013. Dynamic deployment of sensing experiments in the wild using smartphones. In Distributed Applications and Interoperable Systems. Springer, Berlin, 43–56. John Heidemann, Fabio Silva, Chalermek Intanagonwiwat, Ramesh Govindan, Deborah Estrin, and Deepak Ganesan. 2001. Building efficient wireless sensor networks with low-level naming. ACM SIGOPS Operating Systems Review 35, 5, 146–159. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In CIDR. vol. 11, 261–272. Chalermek Intanagonwiwat, Deborah Estrin, Ramesh Govindan, and John Heidemann. 2002. Impact of network density on data aggregation in wireless sensor networks. In Proceedings. 22nd International Conference on Distributed Computing Systems. IEEE, 457–458. Tzu-Chuan Juan, Shih-En Wei, and Hung-Yun Hsieh. 2013. Data-centric clustering for data gathering in machine-to-machine wireless networks. In Proceedings of the 2013 IEEE International Conference on Communications Workshops (ICC). IEEE, 89–94. A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan. 2012. The big data bootstrap. In Proceedings of the 29th Internatinoal Conference on Machine Learning. 1–8. Alexandros Labrinidis and H. V. Jagadish. 2012. Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5, 12, 2032–2033. Runze Li, Dennis K. J. Lin, and Bing Li. 2013. Statistical inference in massive data sets. Applied Stochastic Models in Business and Industry 29, 5, 399–409. Xiaoquan Li, Fujiang Zhang, and Yongliang Wang. 2013. Research on big data architecture, key technologies and its measures. In Proceedings of the 2013 IEEE 11th International Conference on Dependable, Autonomic and Secure Computing (DASC). IEEE, 1–4. Chung-Chih Liu, Ming-Jang Chiu, Chun-Chieh Hsiao, Ren-Guey Lee, and Yuh-Show Tsai. 2006. Wireless health care service system for elderly with dementia. IEEE Transactions on Information Technology in Biomedicine 10, 4, 696–704. Juan Liu and Pierre Moulin. 2001. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Transactions on Image Processing 10, 11, 1647–1658.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

39:24

A. Ahmad et al.

Peng Liu and Kie B. Eom. 2013. Restoration of multispectral images by total variation with auxiliary image. Optics and Lasers in Engineering 51, 7, 873–882. Peng Liu, Fang Huang, Guoqing Li, and Zhiwen Liu. 2012. Remote-sensing image denoising using partial differential equations and auxiliary images as priors. IEEE Geoscience and Remote Sensing Letters 9, 3, 358–362. Jianguo Lu and Dingding Li. 2013. Bias correction in a small sample from big data. IEEE Transactions on Knowledge and Data Engineering 25, 11, 2658–2663. Wald Lucien. 1999. Some terms of reference in data fusion. IEEE Transactions on Geosciences and Remote Sensing 37, 3, 1190–1193. Stephane G. Mallat. 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11, 7, 674–693. Lori Mann Bruce, Cliff H. Koger, and Jiang Li. 2002. Dimensionality reduction of hyperspectral data using discrete wavelet transform feature extraction. IEEE Transactions on Geoscience and Remote Sensing 40, 10, 2331–2338. Samuel Marchal, Xiuyan Jiang, Radu State, and Thomas Engel. 2014. A big data architecture for large scale security monitoring. In Proceedings of the 2014 IEEE International Congress on Big Data. IEEE, 56–63. Samuel Marchal, Xiuyan Jiang, Radu State, and Thomas Engel. 2014. A big data architecture for large scale security monitoring. In Proceedings of the 2014 IEEE International Congress on Big Data. IEEE, 56–63. M. Mayilvaganan and M. Sabitha. 2013. A cloud-based architecture for Big-Data analytics in smart grid: A proposal. In Proceedings of the 2013 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). IEEE, 1–4. Katina Michael and Keith Miller. 2013. Big data: New opportunities and new challenges [guest editors’ introduction]. Computer 46, 6, 22–24. Pooja Mishra and Dharmendra Singh. 2014. A statistical-measure-based adaptive land cover classification algorithm by efficient utilization of polarimetric SAR observables. IEEE Transactions on Geoscience and Remote Sensing 52, 5, 2889–2900. S´ebastien Mosser, Ivan Logre, Nicolas Ferry, and Philippe Collet. 2013. From sensors to visualization dashboards: Need for language composition. In Proceedings of the 2nd International Workshop on Globalization of Modeling Languages at MODELS. 6. Sebastien Mosser, Franck Fleurey, Brice Morin, Franck Chauvel, Arnor Solberg, and Iokanaan Goutier. 2012. Sensapp as a reference platform to support cloud experiments: From the internet of things to the internet of services. In Proceedings of the 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). 400–406. F. Nakamura, A. F. Loureiro, and C. Frery. 2007. Data fusion for wireless sensor networks: Methods, models, and classifications. ACM Computing Surveys 39, 3, Article 9. Eduardo F. Nakamura, Antonio A. F. Loureiro, and Alejandro C. Frery. 2007. Information fusion for wireless sensor networks: Methods, models, and classifications. ACM Computing Surveys (CSUR) 39, 3, 9. Anand Paul and Seungmin Rho. 2015. Probabilistic model for M2M in IoT networking and communication. Telecommunication Systems 62, 1, 59–66. Anand Paul, Seungmin Rho, and K. Bharnitharan. 2014. Interactive scheduling for mobile multimedia service in M2M environment. Multimedia Tools and Applications 71, 1, 235–246. Anand Paul. 2013. Graph based M2M optimization in an IoT environment. In Proceedings of the 2013 Research in Adaptive and Convergent Systems. ACM, 45–46. Anand Paul. 2014. Real-time power management for embedded M2M using intelligent learning methods. ACM Transactions on Embedded Computing Systems (TECS) 13, 5s, 148. Zdzisław Pawlak. 2002. Rough set theory and its applications. Journal of Telecommunications and Information Technology 3, 3, 7–10. Antonio Plaza, Pablo Mart´ınez, Rosa P´erez, and Javier Plaza. 2004. A quantitative and comparative analysis of endmember extraction algorithms from hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing 42, 3, 650–663. Javier Portilla and Eero P. Simoncelli. 2000. A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision 40, 1, 49–70. Lakshmish Ramaswamy, Victor Lawson, and Siva Venkat Gogineni. 2013. Towards a quality-centric big data architecture for federated sensor services. In Proceedings of the 2013 IEEE International Congress on Big Data (BigData Congress). IEEE, 86–93. Attila Reiss and Didier Stricker. 2012a. Creating and benchmarking a new dataset for physical activity monitoring. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 40.

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.

An Efficient Multidimensional Big Data Fusion Approach in M2M Communication

39:25

Attila Reiss, and Didier Stricker. 2012b. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers (ISWC’12). IEEE, 108–109. ¨ and Esko Nuutila. 2012. SPARQL-based applications for RDF-encoded sensor Mikko Rinne, Seppo T¨orma, data. In Proceedings of the 5th International Conference on Semantic Sensor Networks, vol. 904. 81–96. Philip E. Ross. 2004. Managing care through the air [remote health monitoring]. IEEE Spectrum 41, 12, 26–31. Vijay P. Shah, Nicolas H. Younan, Surya S. Durbha, and Roger L. King. 2010. Feature identification via a combined ICA–wavelet method for image information mining. IEEE Geoscience and Remote Sensing Letters 7, 1, 18–22. R.D. Schneider. 2012. Hadoop for Dummies Special Edition. John Wiley & Sons. Wen-Zhan Song, Renjie Huang, Mingsen Xu, Behrooz A. Shirazi, and Richard LaHusen. 2010. Design and deployment of sensor network for real-time high-fidelity volcano monitoring. IEEE Transactions on Parallel and Distributed Systems 21, 11, 1658–1674. ˜ 2010. Superpixel endmemDavid R. Thompson, Lukas Mandrake, Martha S. Gilmore, and Rebecca Castano. ber detection. IEEE Transactions on Geoscience and Remote Sensing 48, 11, 4023–4033. Lizhe Wang, Hui Zhong, Rajiv Ranjan, Albert Zomaya, and Peng Liu. 2014. Estimating the statistical characteristics of remote sensing big data in the wavelet transform domain. IEEE Transactions on Emerging Topics in Computing 2, 3, 324–337. Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. 2014. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26, 1, 97–107. Chen Yang, Lorenzo Bruzzone, Fengyue Sun, Laijun Lu, Renchu Guan, and Yanchun Liang. 2010. A fuzzystatistics-based affinity propagation technique for clustering in multispectral images. IEEE Transactions on Geoscience and Remote Sensing 48, 6, 2647–2659. Qiang Yang, Yuqiang Chen, Gui-Rong Xue, Wenyuan Dai, and Yong Yu. 2009. Heterogeneous transfer learning for image clustering via the social web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 1–9. Xiaomeng Yi, Fangming Liu, Jiangchuan Liu, and Hai Jin. 2014. Building a network highway for big data: architecture and challenges. IEEE Network 28, 4, 5–13. Arkady Zaslavsky, Charith Perera, and Dimitrios Georgakopoulos. 2013. Sensing as a service and big data. arXiv preprint arXiv: 1301.0159. Jia Zhang, Bob Iannucci, Mark Hennessy, Kartik Gopal, Sean Xiao, Sudhakar Kumar, David Pfeffer, et al. 2013. Sensor data as a service—a federated platform for mobile data-centric service development and sharing. In Proceedings of the 2013 IEEE International Conference on Services Computing (SCC). IEEE, 446–453. Jin Zhou, Lei Hu, Feng Wang, Hai-Han Lu, and Kai Zhao. 2013. An efficient multidimensional fusion algorithm for IoT data based on partitioning. Tsinghua Science and Technology 18, 4, 369–378. Received April 2015; revised July 2015; accepted October 2015

ACM Transactions on Embedded Computing Systems, Vol. 15, No. 2, Article 39, Publication date: May 2016.