An Experimental Evaluation of Fault Diagnosis from ... - MDPI

1 downloads 0 Views 8MB Size Report
Sep 21, 2018 - Handling semiconductor data can be challenging. ... of imbalanced dataset, in which failure instances create a small .... semiconductor industry to localize a fault and its root of cause and .... This model would not be able to perform the same when new data ... will leave less data to infer upon for each trial.
big data and cognitive computing

Article

An Experimental Evaluation of Fault Diagnosis from Imbalanced and Incomplete Data for Smart Semiconductor Manufacturing Milad Salem, Shayan Taheri

and Jiann-Shiun Yuan *

Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL 32816, USA; [email protected] (M.S.); [email protected] (S.T.) * Correspondence: [email protected]; Tel.: +1-407-823-5719 Received: 18 July 2018; Accepted: 14 September 2018; Published: 21 September 2018

 

Abstract: The SECOM dataset contains information about a semiconductor production line, entailing the products that failed the in-house test line and their attributes. This dataset, similar to most semiconductor manufacturing data, contains missing values, imbalanced classes, and noisy features. In this work, the challenges of this dataset are met and many different approaches for classification are evaluated to perform fault diagnosis. We present an experimental evaluation that examines 288 combinations of different approaches involving data pruning, data imputation, feature selection, and classification methods, to find the suitable approaches for this task. Furthermore, a novel data imputation approach, namely “In-painting KNN-Imputation” is introduced and is shown to outperform the common data imputation technique. The results show the capability of each classifier, feature selection method, data generation method, and data imputation technique, with a full analysis of their respective parameter optimizations. Keywords: classification; data imputation; fault detection; machine learning; semiconductor manufacturing

1. Introduction A semiconductor manufacturing process holds a colossal amount of data that are gathered from many sensors during the process. The gathered data contains valuable information regarding each produced entity, which can be used to gain knowledge about the yield of the production line, the failures that occurred and their results, and many other information that are crucial in operation of the semiconductor plant. Therefore, how this data is handled is of great importance. Handling semiconductor data can be challenging. Firstly, a semiconductor manufacturing line holds many equipment and sensors, each being operational for long periods of time. Therefore, the amount of data that needs to be handled is large. Secondly, failures are rarely seen in this process. Being a highly expensive process, semiconductor industry has evolved to be inherently controlled on all front to lower the amount of failures that occur in the process. This fact gives rise to the challenge of imbalanced dataset, in which failure instances create a small portion of all observed instances. Thirdly, being an engineering process in the real world, missing data is common in the semiconductor manufacturing. Some parts of the data or specific feature might be missed due to malfunction or mismanagement, which creates the challenge for handling the missing data. Lastly, many of the signals that are recorded in the process do not have correlation to the classification at hand or are noisy. This fact creates a need for feature selection to find the most important values. The SECOM dataset is a good example of semiconductor data. This dataset which holds data regarding instances that were a failure, is a highly imbalanced dataset, with many missing values, and noisy features. Therefore, this dataset creates an opportunity for realistic classification in the task

Big Data Cogn. Comput. 2018, 2, 30; doi:10.3390/bdcc2040030

www.mdpi.com/journal/bdcc

Big Data Cogn. Comput. 2018, 2, 30

2 of 20

of fault detection and diagnosis. Previous work has been done to classify this dataset to handle the imbalanced data and irrelevant features. In [1,2], the challenge of imbalanced data was evaluated and approaches for oversampling the minority distribution to create balance between the classes was introduced. In [3], the challenge of imbalanced data was evaluated from an under-sampling perspective as well, showing that oversampling performs better on this dataset. In [2,4–6] different approaches for feature selection were proposed to rise to the challenge of noisy features. The challenge of missing data remains unexplored in the SECOM dataset. To the authors’ knowledge, no literature thoroughly classifies the SECOM dataset after performing specialized data imputation. In this work the SECOM dataset is classified using a plethora of combination of approaches. The three challenges of data imbalance, missing data, and noisy data are handled via synthetic data generation, data imputation, and feature selection, respectively. Moreover, different classifiers are evaluated, and their performance is analyzed based on the task at hand. To face the challenge of missing data, a novel data imputation technique called “In-painting KNN-Imputation” is introduced which is inspired by image in-painting. In the end, leveraging the feature importance delivered by the classification model, fault diagnosis is performed, that demonstrates which features and measured parameters during the manufacturing process have high effect on the failure of the device. The rest of the paper is organized as follows. In Section 2 the background knowledge is provided. In this section, the semiconductor manufacturing, the manufacturing processes in this industry are conversed. The methodology of our work including the classification stages, the processes of data preparation, and procedures of constructing, evaluating, and interpreting the model for the data are discussed in Section 3. The designed and executed experiments and achieved results are catered in Section 4. The conclusion is given in Section 5. 2. Background In this section, the general information and concepts regarding smart semiconductor manufacturing and this work are outlined. 2.1. Semiconductor Manufacturing Processes and the Data Volume A sub-section of microelectronics is semiconductor manufacturing in which the “semiconductor wafers”, the substrates for semiconductor devices consisting of highly pure single crystal silicon, are processed in a fabrication facility using certain processing steps. The incorporating stages in processing these wafers can include: (1) Cleaning the wafer, in which the base of the semiconductor is cleaned. To accomplish this task, chemical substances are employed that can remove minor particles and residues made in the process of production or created by the exposure to air. (2) Deposition of film, in which thin film layers are formed on the wafer. Some of the common ways to perform this task are: (a) sputtering; (b) electro-deposition; and (c) chemical vapor deposition. (3) Cleaning after film deposition, in which the attached tiny particles to the wafer after the deposition of film are eliminated using physical-based methods such as spraying or brushing. (4) Resist coating, in which the surface of wafer is covered with photosensitive material for creation of resistivity on the wafer surface. The wafer is rotated by centrifugal force with the aim of creating a uniform layer of resistance on the wafer surface. (5) Exposure, in which the wafer is exposed to short wavelength deep ultraviolet radiations through a mask (containing the formed design patterns). (6) Development, in which the wafer is sprayed with the purpose of dissolving the exposed areas and consequently revealing the thin film on the wafer surface. (7) Etching, in which a pattern is created on the wafer by either: (a) excessive attacking on the wafer by ionized atoms for elimination of the film layer; (b) dissolving and removing the surface layer using chemicals (such as hydrofluoric acid or phosphoric acid). (8) Insertion of impurities, in which certain elements such as phosphor or boron ions are inserted into the wafer, creating the semiconducting properties. (9) Activation, in which the substrate is heated quickly via certain tools such as flash lamps. (10) Assembly, in which the output wafer is cleaned before it is separated into individual chips through dicing. (11) Packaging.

Big Data Cogn. Comput. 2018, 2, 30

3 of 20

Each of the aforementioned stages have different types of sensors installed that can measure parameters of interest during the manufacturing process. These parameters that create the in-line data include measurements such as thicknesses of thin films after the processes of deposition, etching, or polishing, the crucial dimensions of wafer after applying the processes, film thickness and resistance, chamber pressure, flows of gases, temperatures of chuck, temperatures of the chambers, and so forth. Furthermore, other types of data are commonly gathered form different workstations, the manufacturing equipment, and the assembly stations. The collected data may belong to different parts ranging from raw silicon to final packaged product. Through monitoring all the steps in semiconductor fabrication in the foundry, considering the number of stages and chambers, multiplied by the number of measurements in each stage, an immense amount of data is gathered. The data is not only large due to the complexity of processes, but also diverse due to the different sources that originate the collection of data. While being large in size, the gathered data is highly valuable. Via using data analysis and machine learning methods, the gathered data is used in tasks of predictive maintenance, fault detection and classification, track manufacturing flow, real-time monitoring of the production process, and yield learning and improvement. Moreover, these methods can be used to extract useful information from a limited number of recorded occurrences to gain insight into the origins of each occurrence. To fully take advantage of the data and implement the different aforementioned applications, the challenge of the collected data volume needs to be resolved. Analyzing a large corpus of data and making predictions based on it is a time-consuming task. This challenge is highly hindering in the semiconductor industry, since the time of delivery of products is of utmost importance and is a factor that affects the health and profitability of a semiconductor manufacturing plant. In the next section we present one of the applications of the gathered data which is the focus of this work, i.e., fault detection, and state its importance. 2.2. Fault Detection in Semiconductor Manufacturing Traditionally, the production and the assembly lines are structured to build the devices and the products from raw material and elements. These material and elements are transformed and assembled with an increasing and gradual step to deliver end products and devices. This flow of products and devices from the beginning to the end of manufacturing line is not without flaws. Possible flaws to mention are breakdowns of machines, inconsistent processing times, and inaccessibility of operators and resources, and disturbances. The existing flaws challenge the way the production line operates and need to be faced to ensure profitability. The key factors for companies and manufacturers within the semiconductor industry to be able to actively participate in the competition are cost, quality, and time of delivery of products. Monitoring and identifying the main and special characteristics of products in a timely manner and detecting the abnormalities at the right time during the manufacturing processes are essential tasks in this domain. Integrating modern technologies in this industry has enabled us to perform these tasks in real-time [7–9]. The large amount of data helps in improving the predictive maintenance problem (mainly when maintenance is needed for the equipment). Also, building a model for the task of fault detection and diagnosis is made possible using these data and the corresponding techniques. By making the task of fault detection and diagnosis timely, the downtimes of the machines are reduced, the companies burden less cost, and the quality of the products is guaranteed more. There are other challenges with respect to the production environment and its dynamics. It is extremely important to have high yield across the entire line (more than 99% for each step of manufacturing process). On the other side, it is important to have as less waste (including product scrap, loss of production, environmental waste, general capital waste) as possible. Another challenging factor is the near-capacity operation of all the equipment and tools within the manufacturing facility. To resolve these challenges, a continuous data capturing, analyzing, and learning system should

Big Data Cogn. Comput. 2018, 2, 30

4 of 20

be established capable of maintaining the yields at high levels (based on the profit margins of the products). The system should also be able to detect and classify the faults and control the processes and predict the time for maintenance. Without a system such as this, the current demands and requirements in the semiconductor manufacturing will not be satisfied and as a result the devices and products will not have the sufficient quality for sale. It is a requirement to perform diagnosis on the systems, tools, and equipment within a semiconductor industry to localize a fault and its root of cause and perform maintenance. Localization of a fault is done by looking into engineering models and determine if something goes wrong in the operation of every element in making a semiconductor device. Based on the determined fault model, the consequences of the occurring fault are determined. 2.3. SECOM Dataset SECOM dataset first presented in [10], contains data gathered in a semiconductor manufacturing process. This data consists of 591 attributes and signals recorded throughout different trials, with each trial belonging to a production entity. Each trial would then be concluded by passing or failing an in-house testing line. Analyzing the data and finding correlations between failing the test and the attributes of the trial, enables engineers to firstly gain insight of why the failures occur, and secondly which attribute has the highest impact on the failure. However, the SECOM dataset presents some challenges from a classification point of view:







Overall 1567 trials were recorded, with only 104 of which resulted in failure. This creates a near 1:14 relation between the number of failures and successes recorded, resulting in a highly imbalanced dataset. Not all the signals that are recorded are useful for classification. Features exist in the SECOM dataset that present little to no data and do not change throughout the trials, moreover, some features exist that are very noisy or have no effect on the outcome of the test. Nearly 7 percent of the data in the SECOM dataset is missing. These data points were either never measured, or somehow lost and never made it to the final records.

3. Methodology 3.1. Classification System Architecture Overall, three problems of data imbalance, unimportant features, and missing values exist in the SECOM dataset, which are prevalent in data gathered in a real-world process. In this work we use data imputation, synthetic data generation, and feature selection to overcome these problems and optimize the classification of SECOM dataset. To begin with, the dataset needs to be cleaned. We delete features that have little to no effect on the outcome of classification, such as constant features or features with low variance, during the data pruning phase. After cleaning the data, the dataset is split into training and testing datasets. Next, the missing values are imputed using two approached of “Mean” and “In-painting KNN-Imputation”, the latter being an approach proposed by this work. Then synthetic data is generated to balance the classes of the dataset. Having done so, the task of feature selection is applied in different ways, to find the features that are most relevant and correlated with the outcome. Finally, the data is classified using four classifiers of Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), and Logistic Regression (LR). A summary of these stages can be viewed in Figure 1.

Big Data Cogn. Comput. 2018, 2, 30

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

5 of 20

5 of 20

Figure Overview of Figure 1. 1. Overview of the the diagnosis diagnosis system system architecture. architecture.

Cross-Validation and the Proper Way to Do It Cross-Validation and the Proper Way to Do It To present fair results and avoid over-fitting, ten-fold cross-validation is used in this classification To present fair results and avoid over-fitting, ten-fold cross-validation is used in this system. To implement ten-fold cross-validation, the dataset is divided into ten sections. Having done classification system. To implement ten-fold cross-validation, the dataset is divided into ten sections. so, one of the sections is chosen as the test set, while the remaining nine sections are considered as Having done so, one of the sections is chosen as the test set, while the remaining nine sections are training sets. This process continues via choosing the next section as the test set, until all ten sections considered as training sets. This process continues via choosing the next section as the test set, until have been considered as the test set. Therefore, in the end, all the instances inside the data set have all ten sections have been considered as the test set. Therefore, in the end, all the instances inside the been tested and all the data has been viewed. data set have been tested and all the data has been viewed. While examining several related works that perform classification on the SECOM data set via a While examining several related works that perform classification on the SECOM data set via a pipeline of preprocessing stages, it can be seen that many choose to apply some or all the preprocessing pipeline of preprocessing stages, it can be seen that many choose to apply some or all the on the data before splitting the data into training and test data sets. In [3–5], the task of feature selection preprocessing on the data before splitting the data into training and test data sets. In [3–5], the task was applied before cross-validation, and in [1,2,6,11] this task was applied before isolating one third of of feature selection was applied before cross-validation, and in [1,2,6,11] this task was applied before the data instances into the test set. isolating one third of the data instances into the test set. The repercussions of applying preprocessing tasks such as feature selection on the dataset with all The repercussions of applying preprocessing tasks such as feature selection on the dataset with the dataset as input can be huge on the classification results. In [12], a dataset is generated using 5000 all the dataset as input can be huge on the classification results. In [12], a dataset is generated using columns of random values as features and 50 random values for the outputs consisting of two classes 5000 columns of random values as features and 50 random values for the outputs consisting of two with the same size. Since the features are unrelated to the output, the true test error of any classifier is classes with the same size. Since the features are unrelated to the output, the true test error of any 50%, while via applying feature selection before splitting the data, they were able to achieve a 3% error classifier is 50%, while via applying feature selection before splitting the data, they were able to rate. This huge gap between the real performance of the model and the achieved results comes from achieve a 3% error rate. This huge gap between the real performance of the model and the achieved the fact that the feature selection stage has had access to all the data and has chosen features with the results comes from the fact that the feature selection stage has had access to all the data and has highest correlation to all the labels. This model would not be able to perform the same when new data chosen features with the highest correlation to all the labels. This model would not be able to perform is presented to it, rendering its results unreliable. the same when new data is presented to it, rendering its results unreliable. This fact holds true for other stages such as data imputation and synthetic data generation as well, This fact holds true for other stages such as data imputation and synthetic data generation as which encourages one to bring those stages inside the cross-validation loop. In this work, the stages of well, which encourages one to bring those stages inside the cross-validation loop. In this work, the data imputation, balancing the data, and feature selection, and classification are performed inside each stages of data imputation, balancing the data, and feature selection, and classification are performed fold of the ten-fold cross-validation, to avoid the leakage of data regarding the test data distribution inside each fold of the ten-fold cross-validation, to avoid the leakage of data regarding the test data into the training process. The data imputation stage is the only stage to have access to the test set, distribution into the training process. The data imputation stage is the only stage to have access to since it needs to fill the instances in that set as well; however, this stage will not use any information the test set, since it needs to fill the instances in that set as well; however, this stage will not use any regarding the test data distribution, as described in Section 3.3.2. information regarding the test data distribution, as described in Section 3.3.2. To tune each stage, hyper-parameter tuning is done in the form of a loop inside each fold to To tune each stage, hyper-parameter tuning is done in the form of a loop inside each fold to find find the optimum parameters. The details and results of each hyper-parameter tuning are presented the optimum parameters. The details and results of each hyper-parameter tuning are presented in in Section 4. Section 4. 3.2. Data Pruning 3.2. Data Pruning In the 591 features present in the dataset, 7 features present no data since their value never changes. In the 591 features present in the dataset, 7 features present no data since their value never Moreover, after normalization of the dataset, 6 features have data distribution with a drastically low changes. Moreover, after normalization of the dataset, 6 features have data distribution with a drastically low variance, such as features that are mostly constant in all cases but one. These features can be removed due to having very little information.

Big Data Cogn. Comput. 2018, 2, 30

6 of 20

variance, such as features mostly constant in all cases but one. These features can be removed Big Data Cogn. Comput. 2018, 2, xthat FORare PEER REVIEW 6 of 20 due to having very little information. Furthermore, out of the near nine hundred thousand data points existing existing in the dataset, dataset, approximately approximately sixty sixty thousand thousand data data points points are are missing. missing. The The missing missing data data points points do not have a uniform distribution and some features or instances contain more more of these missing data. The missing data can be viewed for each data instance (row-wise) (row-wise) in in Figure Figure 2a. 2a.

(a)

(b)

Figure 2. Number of missing data points in: (a) Each instance; (b) Each feature.

Since all the Since no no instance instance shows shows aa drastic drastic amount amount of of missing missing data, data, all the instances instances are are kept, kept, and and no no pruning occurs. The number of missing values for each feature (column-wise) can be inspected as well, pruning occurs. The number of missing values for each feature (column-wise) can be inspected as which is shown in Figure 2b. As2b. it isAs visible, a handful of features have more 600 missing values. well, which is shown in Figure it is visible, a handful of features havethan more than 600 missing This factThis renders those features useless,useless, as the amount of information that remains available is not values. fact renders those features as the amount of information that remains available enough to basetothe classification upon or to extrapolate the missing valuesvalues in these is not enough base the classification upon or to extrapolate the missing infeatures. these features. Therefore, to prune the data, we delete feature columns that present no change, those features features Therefore, to prune the data, we delete feature columns that present no change, or or those that are beyond repair and have more than 600 missing values. that are beyond repair and have more than 600 missing values. 3.3. Data Imputation Imputation 3.3. Data After of of missing data points is decreased to 4.5topercent. WhileWhile these After pruning pruningthe thedata, data,the theamount amount missing data points is decreased 4.5 percent. features have missing values, they might hold important information that can affect the result of the these features have missing values, they might hold important information that can affect the result failure test, sotest, theysocannot be simply deleted.deleted. Therefore, to deal to with the problem of missing values, of the failure they cannot be simply Therefore, deal with the problem of missing we need to apply imputation on the data. Imputation is the task of filling the missing values with real values, we need to apply imputation on the data. Imputation is the task of filling the missing values values that both have sensibility and maximize the expectation for the data point of interest. with real values that both have sensibility and maximize the expectation for the data point of interest. In the literature, literature,different differentapproaches approacheshave have been proposed of data imputation. In the been proposed for for the the tasktask of data imputation. One One approach is to completely delete any feature that contains missing data. In most this approach is to completely delete any feature that contains missing data. In most cases, thiscases, approach approach which isasknown as “complete case analysis” will inevitably a part thethat datahas thathigh has which is known “complete case analysis” will inevitably delete adelete part of the of data high effect on the outcome of the classification, rendering the final model inefficient [13]. Moreover, effect on the outcome of the classification, rendering the final model inefficient [13]. Moreover, the the number of trials needed achieve higheraccuracy accuracyisisincreased, increased,since since this this data data deletion deletion number of trials thatthat areare needed to to achieve a ahigher will less data data to to infer inferupon uponfor foreach eachtrial. trial.The Themost mostprevalent prevalent approach data imputation is will leave leave less approach toto data imputation is to to substitute the missing values with the mean of the feature columns [14]. This type of imputation substitute the missing values with the mean of the feature columns [14]. This type of imputation can can distort distribution of the feature vectors changingtheir theirrespective respectivestandard standarddeviation deviation [15]. [15]. distort the the distribution of the feature vectors byby changing Another local Another approach approach that that has has been been proposed proposed for for data data imputation imputation consists consists of of models models leveraging leveraging the the local structure of the data to find the missing value. KNN-Imputation [15,16] is one of these approaches that structure of the data to find the missing value. KNN-Imputation [15,16] is one of these approaches uses KNNKNN to predict the missing data point. In thisIn work, by image we modify that uses to predict the missing data point. this inspired work, inspired byin-painting, image in-painting, we the approach of using KNN classifier for the task of data imputation. modify the approach of using KNN classifier for the task of data imputation.

Big Data Cogn. Comput. 2018, 2, 30

7 of 20

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

7 of 20

3.3.1. Image In-Painting In-Painting 3.3.1. Image When When aa picture picture is is damaged, damaged, an an artist artist is is asked asked to to repair repair the the damage damage by by in-painting in-painting the the missing missing parts [17]. This task is performed on digital images as well, by looking at the values of each pixel. parts [17]. This task is performed on digital images as well, by looking at the values of each pixel. One median filter missing part and find find out the approximate approximate value based One can can apply apply aa median filter to to the the missing part and out the value of of aa pixel pixel based on its neighbors. An implementation of image in-painting with the method introduced in [18] is shown on its neighbors. An implementation of image in-painting with the method introduced in [18] is in Figure shown in 3. Figure 3.

(a)

(b)

Figure 3. A sample of digital digital image image in-painting: (a) Image with missing section; (b) Repaired Image after in-painting.

Intuitively, ofof the missing section. This is Intuitively, aapainter painterstarts startsin-painting in-paintingfrom fromthe theouter outermost mostregions regions the missing section. This because the painter is most confident as to what exists in those regions. In in-painting for digital is because the painter is most confident as to what exists in those regions. In in-painting for digital images, are images, one one approach approach that that can can help help increase increase the the quality quality of of the the outcome outcome is is to to choose choose the the pixels pixels that that are on a on the the outermost outermost region regionof ofthe themissing missingsection sectionand andpredict predicttheir theirvalues valuesfirst. first.The Themore moreneighbors neighborsthat that pixel has, the more confident one can be about the value of that pixel. a pixel has, the more confident one can be about the value of that pixel. Furthermore, considering the the familiarity familiarity of of different different parts parts of Furthermore, aa painter painter starts starts in-painting in-painting considering of the the missing regions. If one missing section is located on a low textured part of the image (i.e., the branches missing regions. If one missing section is located on a low textured part of the image (i.e., the branches and tree), and and the the other other is is located located on on aa highly highly textured textured part part of of the the image image (i.e., (i.e., the the leaves), leaves), and body body of of the the tree), the inin thethe low textured section andand can can predict the missing part with the painter painterperceives perceivesmore morefamiliarity familiarity low textured section predict the missing part higher confidence. with higher confidence. 3.3.2. Our Approach: In-Painting KNN-Imputation 3.3.2. Our Approach: In-Painting KNN-Imputation In this work we try to use the concepts of “neighbors”, “confidence” and “familiarity” from image In this work we try to use the concepts of “neighbors”, “confidence” and “familiarity” from in-painting to predict the values of missing data and perform data imputation. This algorithm is image in-painting to predict the values of missing data and perform data imputation. This algorithm shown in Algorithm 1. is shown in Algorithm 1. While the neighbors of a pixel in an image are defined by proximity in location, the neighbors While the neighbors of a pixel in an image are defined by proximity in location, the neighbors of a missing data point in a dataset can be defined by proximity using a distance metric. For of a missing data point in a dataset can be defined by proximity using a distance metric. For example, example, by using Euclidian Distance, one can find data instances that are similar to the missing by using Euclidian Distance, one can find data instances that are similar to the missing value, i.e., value, i.e., “neighbors”. “neighbors”. To define a metric of “confidence”, one can examine the dataset from the data instances and the To define a metric of “confidence”, one can examine the dataset from the data instances and the features perspective. The more data points that are missing in a data instance, the less confident we are features perspective. The more data points that are missing in a data instance, the less confident we about the missing values in that instance. Moreover, the more missing values a feature column holds, are about the missing values in that instance. Moreover, the more missing values a feature column the less confident we are about the value of the missing features. Therefore, we propose a metric of holds, the less confident we are about the value of the missing features. Therefore, we propose a confidence for each missing value that consists of the summation of the number of the missing values metric of confidence for each missing value that consists of the summation of the number of the in the column that holds the missing value of interest, and the number of missing values in the row of missing values in the column that holds the missing value of interest, and the number of missing the missing value of interest. values in the row of the missing value of interest. To bring the concept of “familiarity” to our approach, the accumulative distance of the data To bring the concept of “familiarity” to our approach, the accumulative distance of the data instance to its “KNN” is calculated. If this data instance is an outlier, the instance is not familiar, instance to its “KNN” is calculated. If this data instance is an outlier, the instance is not familiar, and and the calculated distance would be relatively large. On the other hand, if there are many instances the calculated distance would be relatively large. On the other hand, if there are many instances existing in the dataset that are similar to the instance of interest, the calculated distance would be relatively small, and the instance would be a familiar instance.

Big Data Cogn. Comput. 2018, 2, 30

8 of 20

existing in the dataset that are similar to the instance of interest, the calculated distance would be relatively small, andComput. the instance would be a familiar instance. Big Data Cogn. 2018, 2, x FOR PEER REVIEW 8 of 20 Algorithm 1 The Flow of Inpainting KNN Imputation. Input Parameters: Input Data and Number of Neighbors (K-Param) Output Parameters: Filled Data Fetching Input: Data ← Read input data from file Calculating Number of Missing Values: for i in Columns do: NullSum[i] ← Sum(IsMissing(Data[:,i]) Iteration ← Sum(NullSum) Calculation of Confidence Value (Sum of Missing Values in Rows and Columns): Confidence ← 360 × 360 × OneArray(DataDimension): for i in Rows do: RowSum ← Sum(IsMissing(Data[i,:])) ColumnSum ← Sum(IsMissing(Data[:,j])) for i in Rows do: for j in Columns do: if IsMissing(Data[0]) do: Confidence[i,j] ← RowSum[i] + ColumnSum[j] Familiarity = Sum(KNeighborsDistances): NNB ← NearestNeighbors(NeighborsNumber = KParam, DistanceOption = "Euclidean") FitForNNB(XArray) for i in Rows do: Distances = SumofDistances (KNeighborsForNNB (Data[i,:])) Familiarity[i] ← Distances Normalize Familiarity: FamiliarityNorm ← Normalized Familiarity Between [6, 360] Loop for Filling Missing Values: for Counter in Range(Iteration) do Order ← Confidence × FamiliarityNorm Target ← argmin(Order): (x, y) ← Location of Target Extract Target Row for XTest: XTest ← Data[x,:] XTrain ← Remove row of "x" and column "y" from Data YTrain ← Extract column "y" from Data Delete Instances with Missing Values in the Target Column: Let Index be an empty array for Element in YTrain do: if IsMissing(Element) do: Removing row of element Filling Remained Part of XTest By Mean Values of Features: MeanTrain ← Mean(XTrain, Column) for FeatureCounter in Columns do: for InstanceCounter in Rows do: if IsMissing(XTrain [InstanceCounter, FeatureCounter]) do: XTrain [InstanceCounter, FeatureCounter] ← MeanTrain [FeatureCounter] for FeatureCounter in Range(Length(XTest)) do: if IsMissing(XTest [FeatureCounter]) do: [FeatureCounter] ← MeanTrain Big Data Cogn. XTest Comput. 2018, 2, x FOR PEER REVIEW [FeatureCounter] Predicting the Missing Value with KNN: Classifier ← KNeighborsRegressor(NeighborsNumber = KParam) FitForClassifier(XTrain,YTrain) Prediction ← ClassifierPrediction(XTest) for j in Column y do: Confidence[x,j] ← Confidence[x,j] − 1 for i in Row x do: Confidence[i,y] ← Confidence[i,y] − 1 Confidence[x,y] ← 360 × 360 Data[x,y] ← Prediction

9 of 20

Having defined the metrics of “confidence” and “familiarity” and the concept of “neighbors”, an algorithm similar to image in-painting can be applied to impute the data. Using the multiplication of normalized confidence and familiarity, a metric is defined which helps choose the order in which data points should be filled. Similar to how a painter starts in-painting from the sections that they are more confident about, using this metric data points with higher confidence and familiarity can be identified and filled first.

Big Data Cogn. Comput. 2018, 2, 30

9 of 20

Having defined the metrics of “confidence” and “familiarity” and the concept of “neighbors”, an algorithm similar to image in-painting can be applied to impute the data. Using the multiplication of normalized confidence and familiarity, a metric is defined which helps choose the order in which data points should be filled. Similar to how a painter starts in-painting from the sections that they are more confident about, using this metric data points with higher confidence and familiarity can be identified and filled first. The task of imputation is done using KNN regression. To begin with, the missing data point (X0 , Y0 ) with the highest confidence and familiarity is found. The output of the regression would be the missing value that fills this data point, considering the values inside the feature column Y0 . To train the model, parts of the dataset, excluding the label column and the data instance row (X0 ), are given as input. Since at this point the dataset contains missing values other than the missing value of interest (X0 , Y0 ) and the regression model does not work with missing values, the rest of the missing values in the training data are filled with mean value of their respective feature columns. On the other hand, if there are any other missing values in the feature column of interest Y0 , their respective instances are deleted from the training data, since they cannot offer any information regarding the missing value of interest (X0 , Y0 ). After the training is done, the data instance row X0 is fed to the model and the missing value of interest is predicted and placed into the original dataset. This process repeats until all the missing values are filled. Since this KNN regression is done once on the data for each missing value, it is beneficial to use parallel computing to calculate the nearest neighbors for each iteration, reducing the overall time the algorithm consumes. In this work, this algorithm is first applied to the training datasets inside each fold of the cross-validation. Having filled the training data, one instance from the test data is added on top of the filled data and the algorithm is applied again to fill the test instance. Then, the test instance is removed, and the same steps are taken for the remaining test instances. The reason for filling the training data and isolating it from test data, as well as isolating the test instances from each other, is to avoid letting any information regarding the distribution of the test data to be leaked into the imputation task. 3.4. Feature Selection Having filled the dataset via data imputation, lack of missing values allows the task of feature selection to be carried on. This task is needed to delete unimportant and noisy features and to only keep the relevant data for the next stages. Feature selection in this work is performed using four different approaches: 1.

2.

3.

Univariate feature selection: This type of feature selection performs univariate statistical tests on the input data and selects the ones with the highest scores. The scoring functions used in this work are Chi-squared (CHI2), ANOVA F-value between label and feature (FCLASSIF), mutual information (MFCLASSIF), false positive rate test (SELECTFPR), estimated false discovery rate (SELECTFDR), family-wise error rate (SELECTFWE). Selecting from a model: a group of estimators can output feature importance after they have been trained on the data. Using this feature importance, the highest-ranking features are selected from models such as RF, Logistic Regression with L1 (LR) and L2 (LR2) regularization, and Support Vector Machine with L1 (SVM) and L2 (SVM2) regularization. Recursive Feature Elimination (RFE): Similar to the previous approach, a model that outputs feature importance is fitted on the training data; however, the features with the lowest importance are removed. This process is repeated, and model is trained on the new data, until the features with the highest importance are remaining. The same models used for the previous approach are used for this approach (RFERF, RFELR, RFELR2, RFESVM, and RFESVM2).

Big Data Cogn. Comput. 2018, 2, 30

4.

10 of 20

Dimensionality reduction: In this work Principle Component Analysis (PCA) is used to linearly reduce the dimensionality of the data, using Singular Value Decomposition of the data matrix to project each data vector to a space with lower dimensions. Overall 18 different feature selection methods are evaluated.

3.5. Synthetic Data Generation To tackle the problem of data imbalance, the relative number of the success instances need to become as same as or comparable to the relative number of success instances. This balance can be created either by oversampling the minority distribution—in our case the failures—or by under-sampling the majority distribution—i.e., the successes. Since oversampling has been shown to outperform under-sampling [3], we use oversampling in this work. The approach used in this work is Synthetic Minority Oversampling Technique (SMOTE) [19]. This technique is provided with the minority distribution, and it can generate a given number of new instances by sampling and interpolating among minority instances. 3.6. Classification After the data is pruned, the features are selected, and the classes are balanced, the task of classification can be performed. Since the dataset is a structured dataset with binary classes, the four classifiers of LR, RF, SVM, and KNN are chosen and analyzed. To validate our classification, we choose ten-fold cross-validation. Ten-fold cross-validation first partitions the data into ten sets, then the classifier is trained on nine sets and tested on one set. This process repeats until all folds have been tested. This approach results in avoiding over-fitting, since no special test set has been chosen, and all the dataset has been viewed and tested. Cross-validation causes the training set to change every time a model is trained. However, to train each model, the training data needs to go through the stages of imputation, synthetic data generation, and feature selection. Therefore, these tasks need to be processed every instance the training data changes, inside each fold of ten-fold cross-validation. 4. Results 4.1. Experimental Setup The data from the SECOM dataset goes through the stages of data pruning, data imputation, feature selection, and data boost and finally is classified using ten-fold cross-validation. All of stages are enabled using Python as software and four GeForce GTX 1080 graphic cards as hardware. The Scikit Learn tool [20] was leveraged in order to implement the feature selection stages. Overall, 288 different combinations of stages are examined, and the important results are reported hereafter. The main metric of evaluation in this work is area under the ROC curve (AUC), since this coefficient is a fair metric when classifying an imbalanced dataset. Another important metric of performance is recall in which the percentage of the faults that were caught is reflected. On the other hand, true negative ratio (TNR) indicates the percentage of successes that were rightly predicted and indicated how few false alarms were given by the classifier. Higher AUC, recall, and TNR indicated better classification. 4.2. Results The existence of multiple stages in the classification pipeline encourages one to view the effects of each stage separately. In this section, the results are viewed objectively from each stage, comparing the performance of each combination within that stage. This comparison is done in two ways, firstly in the best delivered results for each approach by looking at the maximum AUC, secondly in the averaged delivered results between all approaches. The averaged result is computed by averaging the best results from the hyper-parameter tuning loop that contain the parameter of interest and not all the

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

11 of 20

in the best delivered results for each approach by looking at the maximum AUC, secondly in the Big Data Cogn. Comput. 2018, 2, 30 between all approaches. The averaged result is computed by averaging 11 of 20 averaged delivered results the best results from the hyper-parameter tuning loop that contain the parameter of interest and not all thefrom resultthe from theByloop. Bythis, doing we can one parameter and the observe the best result loop. doing wethis, can keep onekeep parameter constantconstant and observe best results it results it can deliver while other parameters change. can deliver while other parameters change. While While comparing comparing the the best best results results will will give give insight insight into into which which combination combination of of stages stages can can deliver deliver the best model, comparing the average results helps the fairness of generality in comparison by the best model, comparing the average results helps the fairness of generality in comparison by making making each model compete in the same combination of stages. each model compete in the same combination of stages. 4.2.1. Classification Results To begin the needs to to be be optimized considering the the stages the the task taskof ofclassification, classification,each eachmodel model needs optimized considering stages data has passed through. The output of the KNN classifier is heavily affected by the number the data has passed through. The output of the KNN classifier is heavily affected by the number of of neighbors find the the optimum optimum number neighbors (K). (K). To To find number of of K, K, we we classify classify the the data data using using different different values values for for K, K, and and calculate calculate the the average average of of AUC AUC for for all all classifications. classifications. This This averaging averaging allows allows us us to to find find the the optimum optimum number number of of neighbors neighbors in in the the training training set set that that need need to to be be examined examined to to make make aa fair fair prediction. prediction. Since Since the the training setdiffers differsbetween between approaches SMOTE we average the two groups training set thethe approaches usingusing SMOTE or not,or wenot, average the two groups separately, separately, shown as shown inasFigure 4. in Figure 4.

Figure 4. KK is is changed in KNN classification on imbalanced datadata and Figure 4. Average AverageAUC AUCwhen whenthe thevalue valueofof changed in KNN classification on imbalanced on balanced datadata using SMOTE. and on balanced using SMOTE.

As it can be seen in Figure 4, the optimum value for K is equal to 1 in imbalanced data and is As it can be seen in Figure 4, the optimum value for K is equal to 1 in imbalanced data and is equal to 9 for balanced data. After applying SMOTE on the dataset and balancing the data, more data equal to 9 for balanced data. After applying SMOTE on the dataset and balancing the data, more data is available for the KNN model to infer upon. Therefore, in the balanced case, the model can rely on a is available for the KNN model to infer upon. Therefore, in the balanced case, the model can rely on higher number of neighbors to infer upon without the loss of performance. a higher number of neighbors to infer upon without the loss of performance. This optimization can be done for the number of trees (N) in a RF model as well. Again, the value This optimization can be done for the number of trees (N) in a RF model as well. Again, the value of interest is changed, and the average of AUC is recorded for the imbalanced and balanced data, of interest is changed, and the average of AUC is recorded for the imbalanced and balanced data, as as seen in Figure 5. seen in Figure 5.

Big Comput. 2018, Big Data Data Cogn. Cogn. Comput. 2018, 2, 2, 30 x FOR PEER REVIEW Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

12 12of of20 20 12 of 20

Figure 5. 5. Average is changed changed in in RF RF classification classification on on imbalanced imbalanced and and Figure Average AUC AUC when when the the value value of of N N is Figure 5. Average AUC when the value of N is changed in RF classification on imbalanced and balanced data. balanced data. balanced data.

As it is visible in Figure 5, the optimum value for N is equal to 1 in imbalanced data and is equal As it is visible in Figure 5, the optimum value for N is equal to 1 in imbalanced data and is equal to 3 for balanced data. As with KNN, more data is available in the balanced cases. Therefore, the RF to 3 for balanced data. As with KNN, more data is available in the balanced cases. Therefore, the RF the data data without without the the loss loss of of performance. performance. model can build more trees on the model can build more trees on the data without the loss of performance. optimize the classifiers of SVM and LR, the inverse of regularization strength (C) can be To optimize To optimize the classifiers of SVM and LR, the inverse of regularization strength (C) can be changed. The lower this parameter is, the stronger regularization occurs. The result of changing C is changed. The lower this parameter is, the stronger regularization occurs. The result of changing C is reflected in Figure 6. reflected in Figure 6.

(a) (a)

(b) (b)

Average AUC when the value of C is changed in classification via: Figure 6. Average via: (a) SVM; (b) LR. Figure 6. Average AUC when the value of C is changed in classification via: (a) SVM; (b) LR.

The value for for C C for for SVM, as seen The optimum optimum value SVM, as seen in in Figure Figure 6, 6, is is 11 for for balanced balanced data data and and 10,000 10,000 for for The optimum value for C for SVM, as seen in Figure 6, is 1 for balanced data and 10,000 for imbalanced data. A similar case is seen with LR, in which the optimum value for C is 10 for balanced imbalanced data. A similar case is seen with LR, in which the optimum value for C is 10 for balanced imbalanced data. A similar case is seen with LR, in which the optimum value for C is 10 for balanced data the need forfor a bigger regularization parameter in the data and and 100 100 for forimbalanced imbalanceddata. data.To Tounderstand understand the need a bigger regularization parameter in data and 100 for imbalanced data. To understand the need for a bigger regularization parameter in cases with with imbalanced data, we need examine the feature In the casesInofthe imbalanced the cases imbalanced data, wetoneed to examine theselection feature stage. selection stage. cases of the cases with imbalanced data, we need to examine the feature selection stage. In the cases of data in this stage, thethis number features thatofdeliver better in better the classification stage are usually imbalanced data in stage,ofthe number features thatresults deliver results in the classification imbalanced data in this stage, the number of features that deliver better results in the classification higher to usually preservehigher enough regarding the minority class.the Therefore, to the stage are to information preserve enough information regarding minority compared class. Therefore, stage are usually higher to preserve enough information regarding the minority class. Therefore, balanced features aremore kept features and passed to and the next stages. increase in the number of comparedcases, to themore balanced cases, areon kept passed on toThis the next stages. This increase compared to the the balanced cases, more features are kept and passedregularization on to the next parameter. stages. This increase features that classification stage receives necessitates a bigger in the number of features that the classification stage receives necessitates a bigger regularization in theTonumber features that the classification receivesthe necessitates a bigger regularization discoverofwhich classifier performed better,stage we average results on all different approaches, parameter. parameter. as shown in Figure 7. To discover which classifier performed better, we average the results on all different approaches, To discover which classifier performed better, we average the results on all different approaches, as shown in Figure 7. as shown in Figure 7.

Big Data Cogn. Comput. 2018, 2, 30

13 of 20

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

13 of 20

(a)

(b)

Figure 7. Averaged results eachclassifier classifier on: data; (b)(b) Balanced data.data. Figure 7. Averaged results ofofeach on: (a) (a)Imbalanced Imbalanced data; Balanced

When comparing performanceofofthe the classifiers classifiers using KNN fails to deliver goodgood results When comparing thethe performance usingAUC, AUC, KNN fails to deliver results in imbalanced cases, on the other hand, has the biggest increase in recall when the data is balanced. in imbalanced cases, on the other hand, has the biggest increase in recall when the data is balanced. This is understandable to the nature of this classifier that relies heavilyon onthe thedata dataprovided provided from This is understandable duedue to the nature of this classifier that relies heavily from a certain number of neighbors. SVM has delivered the highest average AUC both in the a certain number of neighbors. SVM has delivered the highest average AUC both in the imbalanced imbalanced and balanced cases, closely followed by LR. and balanced cases, closely followed by LR. Therefore, SVM is on average more capable of delivering classification results that are accurate Therefore, SVM isfailure on average capable of delivering results that that are this accurate in both success and classes.more On the other hand, KNN has classification the highest recall, showing in both success and failure classes. On the other hand, KNN has the highest recall, showing that this classifier is more capable of catching failures. Moreover, the highest TNR belongs to RF, implying classifier is more capable of catching failures. Moreover, the highest TNR belongs to RF, implying that that this classifier gives lower false alarms in general. Choosing optimized and trained classifier at this point depends highly on the preferences of this classifier givesan lower false alarms in general. the user of an the optimized model. In semiconductor manufacturing not capturing fault a costly Choosing and trained classifier at thisfacilities, point depends highly oncan thebe preferences of mistake. Moreover, false alarms can be disruptive with prevalent systems; therefore, the user of the model. In semiconductor manufacturing facilities,fault notdetection capturing fault can be a costly a balanced modelfalse is recommended, which can catch faults and still have false alarms. Based on mistake. Moreover, alarms can be disruptive with prevalent faultfew detection systems; therefore, the results and comparing AUC, SVM and LR would be the preferred models. a balanced model is recommended, which can catch faults and still have few false alarms. Based on The final results of each classifier with highest AUC reported in detail in Table 1. the results and comparing AUC, SVM and LR would be the preferred models. The final results of each classifier with highest AUC in detail Table 1. Table 1. Details of classification results with thereported highest AUC in eachin model. Classifier LR SVM

In-painting In-painting

RF

In-painting

Classifier Imputation KNN In-painting LR SVM KNN RF LR

Feature

True

SELECTFDR

68

1098

365

SELECTFDR True67 Feature SVM 81 Selection Positive

1104 True 791 Negative

359 False 672 Positive

1152

311

Imputation Table 1.

True

False

False

Precision Details of classification results with the highest AUC TNR in each model. Recall Selection Positive Negative Positive Negative

SELECTFWE

In-painting SELECTFDR In-painting SELECTFDR AsIn-painting seen in Table SVM 1, the best In-painting is the superior SELECTFWE model in this

48

36 37 False 23 Negative 56

75.05 75.46

AUC

TNR 54.07

Precision Recall65.96 AUC 10.75 77.88

15.70 15.72

65.38 64.42

70.22 69.94

78.74

13.37

46.15

62.45

68 1098 365 36 75.05 15.70 65.38 70.22 67 1104 359 37 75.46 15.72 64.42 69.94 results in AUC achieved by an LR 10.75 model. Therefore, 81 791 were672 23 training 54.07 77.88 65.96 48 1152 311 56 78.74 13.37 46.15These62.45 group of classifiers when considering the best performance.

results also show the superiority of In-painting KNN-Imputation to using mean value for imputation, since it has delivered thebest bestresults resultsin inAUC all thewere examined models. As seen in Table 1, the achieved by training an LR model. Therefore, LR is The best results achieved in this work are compared to the results from a similar work [21], the superior model in this group of classifiers when considering the best performance. These results which performs preprocessing after splitting the dataset into training and test set. This comparison also show the superiority of In-painting KNN-Imputation to using mean value for imputation, since it is shown in Table 2.

has delivered the best results in all the examined models. The best results achieved this workofare to the results Table in 2. Comparison the compared results to a similar work [21]. from a similar work [21], which performs preprocessing after splitting the dataset into training and test set. This comparison is True True False False Classifier Evaluation Imputation TNR Recall Positive Negative Positive Negative shown in Table 2. Our Work [21]

LR Ten-fold CV In-painting 1NNDecision Trees 70:30 Split Table 2. Comparison of the Imputation

68 10 results

to a

1098

365

36

75.05

65.38

392 similar

work46[21].

20

89.50

33.33

Imputation

True Positive

True Negative

False Positive

False Negative

TNR

Ten-fold CV

In-painting

68

1098

365

36

75.05

65.38

70:30 Split

1NN-Imputation

10

392

46

20

89.50

33.33

Classifier

Evaluation

Our Work

LR

[21]

Decision Trees

Recall

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

14 of 20

Big Data Data Cogn. Cogn. Comput. Comput. 2018, 2018, 2, 2, x30FOR PEER REVIEW Big

14 14 of of 20 20

As it is visible from Table 2, our model was able to have a higher recall while understandably a lowerAsTNR. This higher is recall is arguably valuable than thewhile decrease in TNR when it is visible from increase Table 2, our model was able more to have a higher recall understandably a As of it isfault visible from Table 2, our model was able to have a higher recall whilewith understandably a the task detection is performed, resulting in more faults being discovered an acceptable lower TNR. This higher increase is recall is arguably more valuable than the decrease in TNR when lower TNR. This higher increase is recall is arguably more valuable than the decrease in TNR when increase in fault false detection alarms. is performed, resulting in more faults being discovered with an acceptable the task of the task of fault detection is performed, resulting in more faults being discovered with an acceptable increase in false alarms. increase in false alarms.Results 4.2.2. Feature Selection 4.2.2. Feature Selection evaluate the 18 Results different types of feature selection used in this work, similar to the previous 4.2.2.To Feature Selection Results section, we examine the average of the performance metrics, and the best performance of each To To evaluate evaluate the the 18 18 different different types types of of feature feature selection selection used used in in this this work, work, similar similar to to the the previous previous approach in balanced and imbalanced cases. These results are shown in Figures 8–11. section, weexamine examinethe the average of performance the performance metrics, and best performance of each section, we average of the metrics, and the bestthe performance of each approach approach in balanced and imbalanced cases. These results are shown in Figures 8–11. in balanced and imbalanced cases. These results are shown in Figures 8–11.

Figure 8. Average performance of each feature selection approach on imbalanced data. Figure 8. Average performance of each feature selection approach on imbalanced data. Figure 8. Average performance of each feature selection approach on imbalanced data.

Figure 9. Average performance of each feature selection approach on balanced data. Figure 9. Average performance of each feature selection approach on balanced data.

Figure 9. Average performance of each feature selection approach on balanced data.

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW Big Data Cogn. Comput. 2018, 2, 30 Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

15 of 20 15 of 20 15 of 20

Figure 10. Classification results with the highest AUC in each feature selection on imbalanced data. Figure 10. Classification results with the highest AUC in each feature selection on imbalanced data. Figure 10. Classification results with the highest AUC in each feature selection on imbalanced data.

Figure 11. Classification results with the highest AUC in each feature selection on balanced data. Figure 11. Classification results with the highest AUC in each feature selection on balanced data.

InFigure both the balanced andresults imbalanced SELECTFDR given one ofon thebalanced best performances. 11. Classification with thecases, highest AUC in eachhas feature selection data. In both of theSELECTFDR balanced andencourages imbalancedthe cases, SELECTFDR hasfavor givenof one of thefalse best discovery performances. The nature results to be in the lower rate, The nature of encourages to class, be in i.e., the favor lower rate, In both theSELECTFDR balanced imbalanced cases, SELECTFDR hasfaults. givenof one of thefalse bestdiscovery performances. therefore focusing on the and identification ofthe theresults positive therefore focusing on the identification theresults positive The nature of SELECTFDR encouragesofthe to class, be in i.e., the faults. favor of lower false discovery rate, therefore focusing on the identification of the positive class, i.e., faults.

Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

16 of 20

Big Data Cogn. Comput. 2018, 2, 30

16 of 20

As seen in Figures 8 and 10, conducting no feature selection on the imbalanced data can yield one of the best results. When the data is imbalanced, more features have the potential to help represent thein minority in discrimination the small from the large class, As seen Figuresclass 8 andand 10, aid conducting no featureofselection onclass the imbalanced data can therefore yield one feature selection the end more results. In thehave same it can be seen that of the best results.might Whenend-up the datahurting is imbalanced, features thefigures potential to help represent alongside SELECTFDR, MFCLASSIF have shown good performance. MFCLASSIF relies on mutual the minority class and aid in discrimination of the small class from the large class, therefore feature info gathered from the classes andthe is adjusted to the the classes. selection might end-up hurting end results. Insize the of same figures it can be seen that alongside As seen in Figures 8 and on average SELECTFWEMFCLASSIF performancerelies has on themutual most improvement SELECTFDR, MFCLASSIF have9,shown good performance. info gathered when the training data is balanced. This is because this feature selection approach uses family-wise from the classes and is adjusted to the size of the classes. errorAs and the sizes of the classes have great effect on the outcome, making it unsuitable for the seen in Figures 8 and 9, on average SELECTFWE performance has the most improvement when imbalanced cases.is balanced. This is because this feature selection approach uses family-wise error and the training data As seen in classes Figureshave 9 and 10, effect feature via SVM anditLR can lead for to high AUC and recall. the sizes of the great onselection the outcome, making unsuitable the imbalanced cases. Considering results, the10, fact that these twovia classifiers had therecall. last As seen these in Figures 9 and feature selection SVM and LRthe canbest leadperformance to high AUCin and section shows their suitability for discriminating the classes in the data distribution of the SECOM Considering these results, the fact that these two classifiers had the best performance in the last section dataset. shows their suitability for discriminating the classes in the data distribution of the SECOM dataset. 4.2.3. Synthetic Synthetic Data Data Generation Generation Results To evaluate evaluate the the outcome outcomeof ofapplying applyingSMOTE SMOTEon onthe thedataset, dataset,the theaverage average and best performance and best performance of of classification with imbalanced data and balanced data (using SMOTE) compared in Figure classification with imbalanced data and balanced data (using SMOTE) areare compared in Figure 12. 12.

(a)

(b)

Figure 12. Classification Classification results results on on balanced balanced and and imbalanced imbalanced data: data: (a) (a)Average Average of of all all approaches; approaches; (b) (b) The best approaches (highest AUC).

Using SMOTE in in a boost in performance via increasing AUC and recall. SMOTE Using SMOTEhas hasresulted resulted a boost in performance via increasing AUC andSince recall. Since creates a balanced training dataset, the models have a higher number of instances to train SMOTE creates a balanced training dataset, the models have a higher number of instances to upon train (compared to training on anon imbalanced dataset), therefore, have have a better understanding of theofdata upon (compared to training an imbalanced dataset), therefore, a better understanding the distribution. A decrease in TNR is also visible which data distribution. A decrease in TNR is also visible whichbelongs belongstotothe thetrade-off trade-off between between identifying identifying more faults faults and and having having less less false false alarms. alarms. Therefore, Therefore, the the models models trained trained on on balanced balanced data data can can classify classify more more faults and deliver a better performance with acceptable change in the number of false alarms. more faults and deliver a better performance with acceptable change in the number of false alarms. These results results show show the These the importance importance of of synthetic synthetic data data generation. generation. 4.2.4. Data Imputation Results 4.2.4. Data Imputation Results Training the KNN regression model relies heavily on the number of neighbors that are considered Training the KNN regression model relies heavily on the number of neighbors that are for regression (K). The number of neighbors is modified and the results, as shown in Figure 13, state that considered for regression (K). The number of neighbors is modified and the results, as shown in the best number of neighbors to consider for the regression and the familiarity metric is 3. Figure 13, state that the best number of neighbors to consider for the regression and the familiarity metric is 3.

Big Data Data Cogn. Cogn. Comput. Comput. 2018, 2018, 2, 2, x30FOR PEER REVIEW Big Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

17 17 of of 20 20 17 of 20

Figure for In-painting KNN-Imputation. Figure 13. 13. Classification Classification results results using using different different values values Figure 13. Classification results using different values for for In-painting In-painting KNN-Imputation. KNN-Imputation.

After finding the optimum number for K, K, the the In-painting In-painting KNN-Imputation KNN-Imputation results results are are compared compared After finding the optimum number for K, the In-painting KNN-Imputation results are compared to imputation via replacing the missing values with the mean value of the feature vector. The results to imputation via replacing the missing values with the mean value of the feature vector. The results are shown in Figure 14. are shown in Figure 14.

(a) (a)

(b) (b)

approaches: (a)(a) Average of all Figure 14. Classification Classification results resultson ondata datafilled filledwith withdifferent differentimputation imputation approaches: Average of Figure 14. Classification results on data filled with different imputation approaches: (a) Average of approaches; (b) (b) TheThe bestbest approaches (highest AUC). all approaches; approaches (highest AUC). all approaches; (b) The best approaches (highest AUC).

While on on average averageusing usingmean mean value imputation performs to In-painting While value for for imputation performs similarsimilar to In-painting KNNWhile on average using mean value for imputation performs similar to In-painting KNNKNN-Imputation, it is visible in Figure 14b that the best performance these approaches can give Imputation, it is visible in Figure 14b that the best performance these approaches can give is highly Imputation, it is visible in Figure 14b that the best performance these approaches can give is highly is highly different. In-painting KNN-Imputation has increased and in the best approach, different. In-painting KNN-Imputation has increased AUC andAUC recall in recall the best approach, raising different. In-painting KNN-Imputation has increased AUC and recall in the best approach, raising raising the limit our models could reach in performance. the limit our models could reach in performance. the limit our models metrics could reach in performance. To evaluate using three three To evaluate the the metrics defined defined in in our our algorithm, algorithm, the the algorithm algorithm was was implemented implemented using To evaluate the metrics definedfamiliarity, in our algorithm, the algorithm was implemented using three different order metrics (confidence, and the multiplication product of both). In Figure 15, different order metrics (confidence, familiarity, and the multiplication product of both). In Figure 15, different order metrics (confidence, familiarity, and the multiplication product of both). In Figure 15, the performance of KNN-Imputation (regular) is compared to In-painting KNN-Imputation with these the performance of KNN-Imputation (regular) is compared to In-painting KNN-Imputation with the performance of KNN-Imputation (regular) comparedand to In-painting introduced KNN-Imputation with three different order metrics. The novel metrics ofisconfidence in thisinwork these three different order metrics. The novel metrics of confidencefamiliarity and familiarity introduced this these three different order metrics. The novel metrics of confidence and familiarity introduced in this can each increase the AUC and recall of the original algorithm. By choosing the order metric as work can each increase the AUC and recall of the original algorithm. By choosing the order metricthe as work can each increase the AUC and recall of the original algorithm. By choosing the order metric as multiplication of these twotwo metrics, as introduced in Algorithm 1, the increase in AUC andand recall are the multiplication of these metrics, as introduced in Algorithm 1, the increase in AUC recall the multiplication ofthe these two metrics, introduced in Algorithm 1, the increase in AUC and recall greater. ThisThis shows effectiveness of as In-painting KNN-Imputation since it results in models with are greater. shows the effectiveness of In-painting KNN-Imputation since it results in models are greater. This shows the effectiveness of In-painting KNN-Imputation since it results in models better performance and higher probability in classifying a fault. with better performance and higher probability in classifying a fault. with better performance and higher probability in classifying a fault.

Big Big Data Data Cogn. Cogn. Comput. Comput. 2018, 2018, 2, 2, x30FOR PEER REVIEW Big Data Cogn. Comput. 2018, 2, x FOR PEER REVIEW

18 of of 20 20 18 18 of 20

Figure 15. Averaged classification results using different order metrics for KNN-Imputation. Figure 15. Averaged classification results using different order metrics for KNN-Imputation. Figure 15. Averaged classification results using different order metrics for KNN-Imputation.

4.2.5. 4.2.5. Feature Feature Importance Importance 4.2.5. Feature Importance Once is finished andand the models are trained, it is possible to find out Once the thetraining training is finished the models are trained, it is possible tohow findimportant out how Once the training is finished and the models are trained, it is possible to find out how important each feature is for each model and how much effect it has on the outcome of classification. The best important each feature is for each model and how much effect it has on the outcome of classification. each feature is for each model and how much effect it has on the outcome of classification. The best model in model this work results from In-painting KNN-Imputation, SELECTFDR feature selection, SMOTE The best in this work results from In-painting KNN-Imputation, SELECTFDR feature selection, model in this work results from In-painting KNN-Imputation, SELECTFDR feature selection, SMOTE data generation, and LR classification. By examining this model, themodel, importance of each feature is SMOTE data generation, and LR classification. By examining this the importance of each data generation, and LR classification. By examining this model, the importance of each feature is observed. feature is observed. observed. SELECTFDR SELECTFDR filters filters the p-values for an estimated estimated false false discovery discovery rate rate from from the p-values p-values given given to to filters the p-values for an estimated discoveryF-values rate from given to it. this p-values are alongside the to the the p-values feature selection it. In InSELECTFDR this case, case, these these p-values are delivered delivered alongsidefalse the ANOVA ANOVA F-values it. In this To case, these p-values are delivered thep-value ANOVA to theis feature selection function. the of feature, for each feature extracted inside function. To compute compute the importance importance of each eachalongside feature, the p-value forF-values function. To compute the importance of each feature, the p-value for each feature is extracted inside each the average averageisiscalculated calculatedafter afterthe theten tenfolds foldsend. end.The Theaverage averageofof the p-values shown each fold, and the the p-values is is shown as each fold, and the average is calculated after the ten folds end. The average of the p-values is shown as importance in Figure 16. importance in Figure 16. as importance in Figure 16.

Figure 16. Importance of each feature. Figure 16. Importance of each feature. Figure 16. Importance of each feature.

Big Data Cogn. Comput. 2018, 2, 30

19 of 20

Figure 16 reveals that the features that have the highest effect on the failures of the units are features 60, 104, 511, and 349. Via this approach, the root of the fault to failure is discovered, thus fault diagnosis is performed. 5. Conclusions In this work, the SECOM dataset, containing data from a real-world semiconductor manufacturing plant, was examined and classified. During classification, 288 different approaches containing different stages for data imputation, data imbalance, feature selection, and classification were evaluated. Furthermore, a novel process for data imputation inspired by image in-painting was introduced. In the end, feature importance from the perspective of the best approach was analyzed, helping us gain insight to the result of failures and find the most crucial stages of the manufacturing line. By conducting this experimental evaluation, suitable tools and stages for classifying the SECOM dataset were identified, and the superiority of LR for classification, SMOTE for synthetic data generation, SELECTFDR for feature selection, and “In-painting KNN-Imputation” for data imputation were shown. Author Contributions: M.S. formulated the idea, performed the simulations, and analyzed the results. S.T. reviewed the related literature, wrote the first draft, and drew the figures. J.-S.Y. formulated the problem, provided guidance in the writing of the manuscript, and reviewed the manuscript. All authors have proof-read the final manuscript. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflicts of interest.

References 1.

2. 3.

4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Kim, J.; Han, Y.; Lee, J. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process. Adv. Sci. Technol. Lett. 2016, 133, 79–84. Kim, J.K.; Cho, K.C.; Lee, J.S.; Han, Y.S. Feature Selection Techniques for Improving Rare Class Classification in Semiconductor Manufacturing Process; Springer: Cham, Switzerland, 2017; pp. 40–47. Moldovan, D.; Cioara, T.; Anghel, I.; Salomie, I. Machine learning for sensor-based manufacturing processes. In Proceedings of the 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 7–9 September 2017; pp. 147–154. Kerdprasop, K.; Kerdprasop, N. A Data Mining Approach to Automate Fault Detection Model Development in the Semiconductor Manufacturing Process. Int. J. Mech. 2011, 5, 336–344. Wang, J.; Feng, J.; Han, Z. Discriminative Feature Selection Based on Imbalance SVDD for Fault Detection of Semiconductor Manufacturing Processes. J. Circuits Syst. Comput. 2016, 25, 1650143. [CrossRef] Kim, J.; Han, Y.; Lee, J. Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process. Adv. Sci. Technol. Lett. 2016, 133, 85–89. Subramaniam, S.K.; Husin, S.H.; Singh, R.S.S. Production monitoring system for monitoring the industrial shop floor performance. Int. J. Syst. Appl. Eng. Dev. 2009, 3, 28–35. Caputo, G.; Gallo, M.; Guizzi, G. Optimization of Production Plan through Simulation Techniques. WSEAS Trans. Inf. Sci. Appl. 2009, 6, 352–362. Munirathinam, S.; Ramadoss, B. Predictive Models for Equipment Fault Detection in the Semiconductor Manufacturing Process. Int. J. Eng. Technol. 2016, 8, 273–285. [CrossRef] McCann, M.; Li, Y.; Maguire, L.; Johnston, A. Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control. J. Mach. Learn. Res. 2010, 6, 277–288. Kim, J.K.; Han, Y.S.; Lee, J.S. Particle swarm optimization-deep belief network-based rare class prediction model for highly class imbalance problem. Concurr. Comput. Pract. Exp. 2017, 29, e4128. [CrossRef] Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. Gelman, A.; Hill, J. Missing-data imputation. In Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006; pp. 529–544.

Big Data Cogn. Comput. 2018, 2, 30

14. 15.

16.

17. 18. 19. 20.

21.

20 of 20

Mockus, A. Missing Data in Software Engineering. In Guide to Advanced Empirical Software Engineering; Springer: London, UK, 2008; pp. 185–200. De Silva, H.; Perera, A.S. Missing data imputation using Evolutionary k- Nearest neighbor algorithm for gene expression data. In Proceedings of the 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer), Negombo, Sri Lanka, 1–3 September 2016; pp. 141–146. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [CrossRef] [PubMed] Guillemot, C.; Le Meur, O. Image Inpainting: Overview and Recent Advances. IEEE Signal Process. Mag. 2014, 31, 127–144. [CrossRef] Biradar, R.L.; Kohir, V.V. A novel image inpainting technique based on median diffusion. Sadhana 2013, 38, 621–644. [CrossRef] Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [CrossRef] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Chomboon, K.; Kerdprasop, K.; Kerdprasop, N.K. Rare class discovery techniques for highly imbalance data. In Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).