Big Data Industry Process Definition: Big data process ...

25 downloads 46782 Views 286KB Size Report
Definition: Big data process is the set of activities: business understanding, data ... Do statistical analysis: min, max, mean, standard deviation, variance...etc.
Big Data Industry Process

Definition: Big data process is the set of activities: business understanding, data collection, data exploration, data preprocessing, data mining, model evaluation and deployment; processed together in order to extract hidden information from a mass of data.

Fig.1: General overview of big data process

Big data process activities: During my experience in Data Science, i come up to resume the process of big data in the following steps: Step1: Understand the business In this step, we are concerned to: 

Well define the problem and its scope



Have a clear view of the goal



Draw the path to the objective

Page 1 – Big Data Industry Process – Adil ZEAARAOUI

Step2: Collect the data Import and collect the data from different sources like: RDMS, datalake store, datawarehouse...etc. Step3: Understand and explore data Before any kind of development, we must first explore our dataset. The exploration is manifesting in : 

Explore features



Distinguish categorical features from numerical ones



Do statistical analysis: min, max, mean, standard deviation, variance...etc.



Visualize data: missing values for each feature, unique values, how values are distributed…etc.



Define business important features

Step4 : Pre-process data This is the important step in big data; it can take up to 90% of the whole process. This step intends to prepare data before mine it. We must do: 

Correct wrong input values



Remove missing values



Fill the rest of missing values



Discretize continues features



Remove correlated features



Normalize features if required



Remove outliers if necessary



Etc.

Step4: Develop your model (Data mining) After building a clean and “ready to process” dataset, it is time to build our model. 

Transform our dataset if required



Apply our machine-learning algorithm

Page 2 – Big Data Industry Process – Adil ZEAARAOUI

Step5: Evaluate and deploy the model Before deployment, we must validate and see how accurate is our model. So we must : 

Evaluate and test the model



Review and enhance it



Deploy the model



Automate the system workflow

Page 3 – Big Data Industry Process – Adil ZEAARAOUI