What is data mining - SSRN

11 downloads 17336 Views 371KB Size Report
Data Warehouse and 'Online analytical processing' (OLAP) go beyond .... domain experts initially assisting in the classification, or creation, of the training data.
Data Mining: A Tool to Increase Productivity in Supply Chain Management Raul Valverde Concordia University [email protected]

Abstract Data Mining techniques are able to predict the future by analysing the past. Data Mining is able to sift through massive amounts of data and find hidden information and relationships. Data mining has had an important impact in supply chain management, data mining has made mass customisation and personalisation systems easier to implement, has also been used to detect supply chain fraud in e-procurement transactions, reduce the supply chain risk by analysing public data information and improve the financial performance of corporation by optimizing the use of supply chain data maintained in data centres. The article has the objective of introducing main data mining concepts and explains how these have been applied in supply chain management.

Page 1

Electronic copy available at: http://ssrn.com/abstract=2642145

Introduction The definition of Data Mining is one that is often confused. Many feel it is merely a part of the Knowledge Management concept. In fact, Data Mining is referred to in a broader context as Knowledge Discovery in Databases or KDD. It may appear as if Knowledge Management and KDD themselves are recycled concepts. In fact, many definitions of Knowledge Management are very similar to many of its predecessors; Information Systems, Decision Support Systems, Expert Systems and their earlier forms. Not only, do they all exhibit very similar goals, the methods in which they extrapolate information from data are not too dissimilar either. The Knowledge Management concept "emanates from its earlier definition of capturing, storing and analytically processing that resides in the various companies databases for decision making." Kanter [1]. This does not appear to be any different from that of a Management Information System. Kanter does note that knowledge includes tacit or implicit knowledge of the user which does not exist within any database which does set KM apart from the pack. Fayyad, et al [6], however, assert that Knowledge Discovery in Databases and Data Mining are different. "The term knowledge discovery in databases, or KDD for short, was coined in 1989 to refer to the broad process of finding knowledge in data, and to emphasize the 'high-level' application of particular data mining methods. The term Data Mining has been commonly used by statisticians, data analysts and the MIS (Management Information Systems) community, while Knowledge Discovery in Databases has been mostly used by artificial intelligence and machine learning researchers…Knowledge Discovery in Databases refers to the overall process of discovering useful knowledge from data while data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the Knowledge Discovery in Databases process.'' The data mining differs from common statistical methods in the quantity of data processed. Berry and Linoff [2] define Data Mining as "the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover

Page 2

Electronic copy available at: http://ssrn.com/abstract=2642145

meaningful patterns and rules." The choice of words 'automatic or semi-automatic' is interesting but not wholly necessary as many if the mining techniques may be employed manually. Hand et al [4] state that "Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner." KDD is also deemed to be an extension of Data Warehousing discipline as they both involve the collection of data into a 'silo' which may be analysed [12,23]. The major difference between the two disciplines is the way in which the data is organised. Data Warehousing uses a lightly denormalised data set often derived from numerous operational and transactional relational databases. Data Warehouse in principle is designed to allow the analysts to observe the data set from multiple viewpoints. These multiply dimensional forms of data representation are commonly referred to as data 'cubes'. The data set is commonly summarised in the creation of a Data Warehouse which does not lend itself to many analysis techniques employed by Data Mining. Data Mining on the other hand works most proficiently when the data set is arranged in the tabular format, a single non-normalised table. Unlike Data Warehouse, Data Mining works with a fixed viewpoint and the analyst, assisted by the computer, discover relationships and structures within the data set. To build a data 'cube' you first need to know the relationships to design the model prior to populating with data. With the similarities of all the various methods of converting data into information and information to knowledge why, would we need to create a new term? Kanter notes that buzzwords are a positive contribution as they draw the attention of the subject at hand. Spiegler [3] goes further, "buzzwords in a fast moving society is a double-edged sword. ...buzzwords tend to create a shallow image of ideas and a notion that their introduction is more for marketing and sales consumption to denote innovation." Data Mining is no exception to this, even though it is indeed a separate branch of information discovery it has not had time to mature. Data Mining will suffer as the increased speed to market of new concepts within the computing industry does not allow a great deal of time develop it own language. Data Mining is an interesting derivation of Knowledge Management that requires time before the true potential is realised.

Page 3

So how does Data Mining differ from other information systems and statistical tools? 'Query and reporting' methods are common among relational database systems and applications. By using the standard 'query and reporting' methods of extracting information from data sets, the analyst will only be able to answer simple questions, i.e. 'Who bought what?' Data Warehouse and 'Online analytical processing' (OLAP) go beyond queries allowing the user to 'drill-down' into the data. OLAP is good for drilling into summary and consolidated data to answer more complex historical questions, i.e. "What is the average annual income of households of pet owners by year by region?" Data Mining techniques are able to predict the future by analysing the past unlike other disciplines. Data Mining is able to sift thorough massive amounts of data and find hidden information and relationships. The other methods are unable to predict the future as the user is only given results to questions posed. If you used only queries and OLAP tools you would therefore need to know what you were looking for prior to initiating the search, employ good analytical techniques and have plenty of time before you would eventually find your answer. Data Mining does not suffer from these limitations. As Data Mining uses a host of different algorithms to sort through each record in the data set it is able to unearth patterns and relationship that were unknown. Data Mining goes further than OLAP by allowing the analyst to ask the system to give predictions, i.e. "Who is likely to purchase a cell phone and why?"

Page 4

Data Mining Concepts The major driving factor of the creation of Data Mining is enormous quantities of data being captured. Many Internet based businesses have in excess of one billion transactions a day and thus possess databases containing gigabytes or even terabytes of data. These vast mountains of data contain potentially valuable information. Data Mining seeks out the useful patterns and valuable information from the surround noise of the irrelevant values. Data Mining as mentioned previously is the process of seeking meaningful relationships within data set. The relationships and patterns found using Data Mining must be fresh and original as Hand et al state, "There is little point in regurgitating well-established relationships (unless, the exercise is aimed at 'hypothesis' confirmation, in which one was seeking to determine whether established pattern also exists in a new data set) or necessary relationships (that, for example, all pregnant patients are female)."[4] The boundaries of Data Mining as part of knowledge discovery are not precise. Knowledge Discovery in Databases involves several stages: Selecting the target data, processing the data, transforming the data where necessary and applying algorithms to discover patterns. Some argue that the data transformation is an intrinsic aspect of Data Mining, as without first pre-processing the data the analyst will not be able to ask meaningful questions or the interpretation of the extracted patterns would be impossible. Data Mining techniques can fall within two simple headings, 'supervised or directed' and 'unsupervised or undirected' learning. The directed learning techniques require the analyst to specify a target field or particular variable of interest. The directed algorithm then sifts through the data set establishing relationships and structures between the chosen target and the independent variables. In the undirected approach, the analyst does not determine an objective to the algorithm. No variable or target field is specified; therefore, the subject of the question is not defined. The associations between data are not restricted to its dependence to the target. Indeed this allows the algorithm to discover relationships and structures in the data independently of any prior implicit knowledge of the user.

Page 5

Data Mining incorporates six basic activities; 

Classification.



Estimation.



Prediction.



Affinity Grouping or Association rules.



Clustering.



Description and Visualisation.

The first three - classification, estimation and prediction- are examples of directed mining. These work by using the available data to build a model based on the target field or chosen variable. The remainder, grouping, clustering and visualisation, undirected techniques whose goal is to establish a relationship between the available variables.

Artificial Intelligence

Databases

Statistics

Cognitive Psychology

Figure 1. KDD and Data Mining should combine cognitive psychology with AI, Database technology and statistical techniques to insightful model.

If whilst creating 'learning', unsupervised, algorithms we observe the domain expert's cognitive processes and take them into account, we may be able to increase the overall usefulness of the relationships discovered. The implicit knowledge and perceptions of the domain experts ultimately determine the novelty, usefulness and acceptance to the Data Mining projects findings. Most Data Mining systems produce a single set of associations and make no effort to include this into the knowledge base. The experts, who are expected to

Page 6

gain insight by applying their knowledge, must evaluate these possibly interrelated concepts. With the study of the cognitive techniques and the methods in which the knowledge is assimilated could lead to better designed systems [4]. Hearst states that during an examination of the popular texts on the KDD subject none dedicated space to examine methods to ensure the knowledge extracted is useful, novel and understandable. "While some KDD papers cover these topics, most contain unfounded assumptions about 'comprehensibility' or 'interestingness'"[16]. As advances in AI allow for a greater understanding of the learning process, combined with the prevalence of 'vertical' applications, an increase in the Data Mining algorithms to discover 'interesting' and novel patterns. The problems will occur when, like some Data Mining projects have highlighted, that the empirical domain experts' observations are incorrect. Examples can be found in the later sections. This paper will not spend much time discussing the data processing issues such as cleansing, validation, transformation and variable definition. Instead, it will focus on the basic principles of the techniques employed to identify data sources and extract relationships from the resultant data set.

Page 7

Where's The Data? After settling on a task and the methods you are to employ, you can begin to gather data. The initial data will assist in induction process of building the model. In some cases, this is a straightforward task with huge quantities of data already being collected. In other domains, this can be a largest challenge of the process. The availability and quality of the data is dependent on the instrumentation collecting the information. In some manufacturing environments, a plethora of different measuring equipment will be used with varying levels of accuracy. Odd cases as in the example of Evans and Fisher [17] the had to rely on the technicians running the equipment to manually record values periodically. Most domains will fall somewhere these two extremes, the domain experts initially assisting in the classification, or creation, of the training data. Langley insists that "In the ideal situation, the expert systems can be tied directly into the flow of data from the operating system's instruments."[18] With modern machinery including a greater available of running data for control systems I would expect a substantial increase in the amount of engineering data captured.

Data Warehouse Overview Data Warehouse is a technology, pioneered by Inmon et al, as a means to store analysis data sets without sacrificing the transaction speed of the production applications. The data is stored separately form the operational systems as Data Warehouses are primarily concerned with historical data which has in many legacy systems been archived as it became inactive. Traditionally reports would be run form these data silos as to minimise the impact on the operational systems. The Data Warehouse has formalised the collection, storage and representation of this data. The data is moved from, what Fayyad terms 'Data Tombs' [11] into a format that allows the business exploit this potentially valuable information. Inmon defines a Data Warehouse as an "integrated, subject-oriented, time-variant, nonvolatile database that provides support for decision making."[7]. This definition provided us with a clear view to how it relates to Decision Support Systems (DSS) and by quickly analysing its components an understanding of the creation of a Data Warehouse. Integrated, consolidated and consistent data. A Data Warehouse must collect operational data from a multitude of sources which its own right is an arduous task. The Data

Page 8

Warehouse must ensure that the data gathered is in a consistent state; names, unit attributes, domain limits and characteristics are uniformly applied. The business metrics must be described in the same way through out the entire enterprise. The Data Warehouse therefore will show a consistent image of the various sources from which the original data had been collected. Subject-oriented or Topic-oriented view of the data is markedly different to that of the operational databases. Operational systems tend to hold a large amount of function focused data sets, Product information, Accounts, etc. The transactional aspect of the these system are often not regarded as important in the business decision making process, to this end much of the variable data stored with a Data Warehouse is summarised. A Data Warehouse would contain, for example, the quantity and monetary value of sales by region thus allowing the user to produce quick comparisons. Time-variant aspect is important as data in a Data Warehouse "constitutes a snapshot of the company history as a measured by its variables….within a time framework."[8] This is a stark contrast to the operational systems, which are concerned with reporting the current value at that moment in time. As the contents of data sources are periodically, extracted and used to populate the Data Warehouse a new time slice will be added to the view and all time dependent aggregate functions will be recalculated, monthly or annual sales. Non-volatile data is essential as the Data Warehouse represents a point in time, therefore should never change. As Microsoft literature states "after the data has been moved to the Data Warehouse successfully, it typically does not change unless the data was incorrect in the first place."[9]. To this end, the Data Warehouse should give a consistent view of the company's history where as the operational systems will give a representation of the present.

On-line Analytical Processing & On-line Transactional Processing What is the difference between OLTP, OLAP and Data Mining? On-line Transactional Processing (OLTP) relies on complete transactional data. Therefore, this style of query and reporting must use the operational or business system as a source of data. With the most obvious reason for separating the operational (transactional) data from the analysis data has always been the degradation of the response time of the operational systems. Operational system often rely on high performance and quick response times, the loss of the efficiency and associated costs incurred through the poor performance of such

Page 9

systems can be easily calculated or measured [13]. For the casual user of the reporting tool an additional 10-second would be negligible, yet if extrapolated across a transactional system processing hundreds or thousands of individual operations the effect would be devastating. On-line Analytical Processing (OLAP) is a descriptive querying tool where analysts verify a hypothesis. Typically OLAP analysis uses predefined, summated or aggregated data, such as 'multidimensional cubes', where as Data Mining requires detailed data that is high level of granularity, very denormalised and then analysed at the individual record level. Regardless of how the questions are formulated, the results returned OLAP applications are purely factual. For example, the number of blue shoes sold in March in Paris was 123. Data Mining on the other hand, is a form of discovery driven analysis. The use of artificial intelligence and statistical techniques allow the model to make predictions or estimates about outcomes to future events. "Data Mining techniques are used to find interesting, often complex, and previously unknown patterns in data."[12], i.e., How many blue shoes should be ordered for Europe next year? Han et al in their study into On-line Analytical Mining claim "based on our analysis, there is no fundamental difference between the data cube required for OLAP and that for OLAM, although OLAM analysis may often involve the analysis of more dimensions with finer granularities, or involve the discovery-driven exploration of multi-feature aggregations on the data cube."[23]. This affirms that the difference in the underlying data source for both OLAP and Data Mining do not need to be different, despite the fact that many Data Warehouses are unsuitable for both.

Problems for Data Mining using Data Warehouse "Even though data mining in the detail data may account for a very small percentage of the data warehouse activity, the most useful data analysis might be done in the detail data." [13], it appears that Microsoft have conceded that a Data Warehouse may have problems satisfying the Data Mining requirements. The granularity of the data warehouse schema will have been optimised for OLAP use. With data summarised to a weekly or greater level, most Data Mining techniques would not be valid. If we looked at the Data Warehouse, we may

Page 10

notice the subject hierarchy of the structure worked at a regional level and therefore would be an inadequate data source to investigate customer behaviour in depth. Most commonly, the Data Warehouse has been constructed to fulfil a specific requirement rather than a generic data repository for global analysis [12,23]. Later in this paper, I sight the resource intensive nature of the gathering, cleaning and loading of the data into the Data Warehouse as a major limiting factor when contemplating a Data Warehouse project. With this in mind, the data sets created tend to become 'leaner' and narrowly focused. The data structure is formulated from the reporting requirements as set out by the business, this invariably produces a limited focus. Within Data Warehouses, the data a generally pre-processed prior to loading. By applying standard business rules the data will support a high level of consistency, yet the consolidated views may be to narrow. As Data Mining algorithms will not be able to give a comprehensive analysis if the data is over processed, the Data Warehouse may be a false grail.

Page 11

How do I gather the data? Common structures in Data Mining Continuous variables Classification variables Binary structures Temporal Spatial Data Mining applications are built around models that are similar to a relational table. The model will contain key attributes, input values and fields for predictive elements, this structure will be associated with a Data Mining algorithm. Once the initial model is created, a process of 'training' is used to hone it and increase overall effectiveness. Model Training is carried out by using a sample dataset against the Data Mining algorithm. This type of data set known as a 'training set'. The algorithm will recursively iterate through the training set several times extracting patterns discovered. The discovered patterns are stored in the Data Mining model. Soni et al illustrate the Data Mining model "as a 'truth table' containing rows for every possible combination of the distinct values for each column of the model"[22].

Problems identifying sources Extraction, Transformation and Loading The primary reason for combining the data from multiple source applications is the ability to cross-reference data from these applications. Extraction - ways and means, basic problems Transformation - ways and problems Loading

Why most Data Warehouse & Data Mining data projects fail. Data consolidation, or fusion, is an all-encompassing problem cutting across the entire data infrastructure. The initial hurdle to overcome with any data consolidation effort is where is the data required to begin mining. Unfortunately, most efforts at building a DSS

Page 12

infrastructure, including Data Warehouses, have proved to be big, complicated, and expensive. A number of Industry reports state the majority of Data Warehouse projects fail for the same reasons [7]. Hence, although the some data accumulates in stores, it is not organised in a format conducive for data mining or anything more then basic decision support functions. Much of the problem involves the data extraction, transformation and loading. How can a data administrator consistent reconcile a variety of heterogeneous data sources? Often placed under the banner of data integration, warehousing or other IT projects, the problem tends to lie unresolved hence impeding Data Mining initiatives. The problem of creating and maintaining data warehouses remains one of the major obstacles to successful Data Mining. Before the analysts are able to apply any algorithm, they must spend an exceptional length of time, sometime years, pulling together the required data sources.

Page 13

Data Mining Techniques Classification Predictive modelling targets uses methods for predicting a single, occasionally more than one, field by using the remaining others. If the variable being predicted is categorical (i.e. approve or reject loan) this is known as classification. By placing the subject into a predefined class, further analysis can be done easily. Classification is a methodology which relies on a model that will classify the unclassified data. This model can be as basic as, if applicant has high income and low debts then low risk, low income and high debts high risk. The most common form of classification is the 'Decision Tree' structure. Classification And Regression Techniques (CART) is a statistical method to produce models with a tree type structure. The classification aspect of this model involves mapping a variable to a predetermined category. The CART algorithm is build using a series of binary questions structured in a hierarchical manor. The representation of such structures closely resembles an inverted 'tree', hence they are referred to as decision trees. The record is assessed, commonly by applying a single variable against the question, and the binary decision will route via one of the branches. These tests through the nodes of the tree will resolve in the classification of the record when the branch reaches the node at end of a branch, a ‘leaf’ node.

Page 14

Figure 2. Example of a Decision Tree.

The structure of CART trees is created from the data being analysed. CART operates by finding the most suitable variable to for splitting the data into two sub-categories at the ‘root’, base of the tree, node. The outcome of the various methods applied is two distinctly different subset categories are derived. The splitting process the then applied these subsets, ‘child nodes’, recursively. The initial tree will be far too large and the model will be considered to be overfitted. The tree must then undergo a complicated ‘pruning’ process to determine the final size. The pruning process is part of the a wider series of testing methodologies termed 'scoring', which is discussed later in this paper. If the pruning process is over ambitious the tree will be too small may be insufficient to accurately classify by producing insufficient predictions. One of the major advantages of using the CART methods is that mixed variable can be used in the algorithm during the classification process, as each node is a simple binary test. Unlike other classification methods that combine variables to create boundaries, the CART model assesses the individual variables against each test. Hence, the use of CART against mixed variable datasets is common practice. For example the test may be for "Males over 6 feet tall", a mixture between a categorical and a continuous value which could not be resolved by simple mathematical techniques.

Page 15

Estimation Where classification deals with discrete results, estimation applies a continuous valued output. Where is in the example in classification gives a credit rating as high or low, estimation will give a value, within set boundaries, commonly a credit risk percentage. In actuality Estimation is used as a form of Classification, by estimation the probability of a customer responding to an event occurring the Analyst would rank into discrete values. The deciding factor is the threshold between classes. In models that produce estimations, and predictions, the resultant variable is an expression of the other variables. This allows us to estimate a parameter of the object, e.g. customer, based on the given values. The distinction is that the estimation function does not produce a categorical output result rather a quantitative. The most commonly used method to estimate values or parameters of an object is the 'maximum likelihood' technique. The maximum likelihood estimation, MLE, uses the probability of a specific data value being observed in a dataset. The estimated value will be the one with the highest probability of arising within the dataset, the maximum likelihood estimation. The MLE method views the parameter being estimated as a fixed value Hand chapter 4.5 The usual techniques employed in this method are 'Linear' and 'Logistic' Regression.

Prediction Prediction is an extension of the Classification and Estimation methods with a minor difference. The Predictive model is categorising or estimating a variable, based on the current inputs, thus predicting future behaviour or patterns. Previously we have used credit risk as the example, where we could is if the customer was currently an acceptable or unacceptable risk. If we used a predictive model, we would make estimates as the perceived future credit risk of the said customer.

Page 16

The creation of predictive model relies heavily on highly accurate historical data. By careful analysis of the historical values a model can be produced, that accurately emulates the present. Once the model is built it can be applied to current values; the result is a prediction of the future. The ultimate aim of any predictive model is to make predictions for object that the input values are unknown. The most well know of the methods used to predict variables is the non-linear modelling technique, 'Neural Networking'. The aim of these supervised classification models is to assign the correct category to new objects, based on the values supplied. As with all classification we are essentially interested in defining the boundaries between classes. When creating an estimation model you can choose to make simple assumptions about the boundaries, thus apply a linear plane to define the divide between two classes. The simple linear model produces what is known as a 'hyperplane'

Y=a0 + ajXj Figure 3. Linear function (see hand pg 169)

In the above simple linear function we notice that an extraordinary level of empirical knowledge is required as the model is 'highly fitted'. The model can be improved whilst retaining the simple additive nature and extent beyond the linear function of the input variables. The inclusion of the function fi in the model will allow for a non-linear boundary, i.e. the fi could be a log, square root or other transformation of the predicting variable X. Y=a0 + ajXj Figure 4. Linear function (see hand pg 170) As we can observe from the graphs (Figure 5) that increasing the number of components of prediction variables, X, the accuracy of the estimation function is enhanced. If the number of components, p, is low then the estimation problem will be fairly straightforward. However as input dimension increases the number of interaction components increases as an amalgamatory function of p. The number of possible interactions between the prediction

Page 17

variables swells dramatically as p increases. The most practice method is to select a subset of the total interactions within the model. If data-driven selection routines are used to carry out the selection of interactions, the problem will multiply almost exponentially.

Figure 5. Example of Linear and Non-linear regression.

As in the linear estimation models, the basic function can be further generalised if we work under the assumption that Y is locally linear between the prediction variables (X's). This method allows for different dependencies to be applied to various regions of the variable space, this is termed a 'piecewise' linear model. The model structure includes both boundaries, locations, of the hyperplanes and additionally the local parameters for each sub-plane. The piecewise model divides the straight line into a k segments, thus giving an approximation of the 'true' curve. Commonly the resultant segments would be joined in a continuous line, occasionally this is not desirable and the points may work in a discounted manor. The results produced but should a model may prove to be problematic as the leap in the predicted value with only a minor change in the predictive input values. If the lack of continuity with the disconnected model is not acceptable a further variation can be applied. Enforcing a continuous line by deriving various values at the of the segment lines. This will obviously lead to segment lines which are no long straight; such curved segment models are termed spline functions

Page 18

Figure 6. Piecewise linear regression.

The piecewise linear model, mentioned previously, is an example a method of creating relatively complex models from non-linear variable set. The piecing together of simple local components to build but the overall model is a reoccurring theme in Data Mining. This, local component theme, is further explained in the Clustering and Grouping subsections.

Affinity Grouping or Association rules As the name suggests, associated items are grouped together by this method. The use of the affinity grouping is widely used by the retail industry to arrange goods within the store. As the items are often purchased together, it is better if placed where they will be seen together. Another form of this technique is 'sequence analysis', a variation on affinity analysis. Using this method, you could begin to understand the order in which events occur. The sequence analysis may find that bank customers who have purchased a new home often buy a new car within 3 months. Armed with this knowledge a marketing a strategy can be developed. The task of discovering patterns and rules within data is to find all associations between parameters that occur with a certain level of regularity and precision. "A pattern is a local concept, telling us something about a particular aspect of the data." [4] An association rule is the likelihood of certain events within the dataset. In its simplest an association rule would be; If A=0 and B=1 then C=1 with probability of p

Page 19

The probability (p) of the existence of this set of events is known as "confidence". The aim of most rule discovery routines is to find all associations that not only display a confidence rating greater than a determined threshold, but also satisfy the basic event constraints to defined probability level. The probability of the rule conditions being meet is referred to as 'accuracy', p(C=1|A=0,B=1). Hence we would search for associations between the variables with an accuracy of x and a confidence of y. By using the combination of the accuracy and confidence we would have increased possibility of discovering novel and interesting rules. Even though they are not always a good, or strong, indication as the data population as a whole and are viewed as a "weak form of knowledge; they are really just summaries of cooccurrence patterns in the observed data" [4], they provide a robust method for the discovery of simple pattern in data. The proliferation of the association rule application has been through the 'market-basket' analysis. The rule discovery has given new insight into the data due to the nature of its size, a matrix involving millions of rows (shoppers) and thousands of columns (items). These 'sparse' structures are very difficult to easily observe relationships or associations. A commonly used method to discover relationships within the data is a APriori algorithm. The algorithm work by finding the frequent sets of 1 parameter, i.e. the most common values or classes (milk, bread and cheese). Once these single variable sets are found the algorithm then searches sets containing 2 parameters based on the values from the single parameter results. The second pass will look for sets {X,Y} if both {X} and {Y} are frequent. The process continues to sets of 3 and so on. The APriori method is used to reduce the number of passed through the data as a way of speeding up the search. Sequence analysis is based on similar methods to the association techniques with the main difference is focus on order of events. The sequence of the events is far more important when working with temporal rather than spatial data sets.

Clustering Clustering is segmenting the data into a number of groups and subgroups. These groups and subgroups are referred to as 'Clusters'. The items are grouped on a basis of similarity of characteristics, clustering does not use predefined classes unlike classification techniques.

Page 20

The analyst must determine the meaning the resultant clustering and the value of the discovered associations. Clustering is a prelude to another Data Mining technique being applied. For example, it would advantageous to segment the customer base into clusters before beginning a marketing campaign analysis.

Description and Visualisation Simply put the main purpose of most Data Mining projects is to describe the interactions within complicated data sets so creating an increased understanding of the situation. The knowledge extracted pertaining to the customers, products or processes will provide the business an insight into areas of the business. If the description of the data is well presented it, may in turn suggest the explanation for the behaviour. Data visualisation is a very powerful tool for describing in Data Mining. It is not easy to produce a meaningful visual representation of the data, yet if found the human mind is a far more practical tool for extraction meaning from visual data than machines.

Scoring None of the fore mentioned techniques would be any good if we did not have a way of judging the validity of the model or algorithm. The techniques used to assess the discovery routines are referred to as 'scoring'.

Page 21

Data Mining in the Workplace Overview The traditional approach to data analysis for decision support has been to combine domain expertise with statistical modelling techniques to develop bespoke solutions. With the increase of the availability of large volumes good quality high dimensional data, this paradigm is changing. As Inmon [10] suggests there are three basic types of data user, "Tourist, Explorers and Farmers." The Tourist is a casual viewer of the data and normally has little, or no, knowledge of the tools and techniques used to define the results. The Farmer is a regular user of the data and will often be using the results verify a hypothesis. "The Farmer finds small flakes of gold with every query." The methods of the Farmer will be predicable; often one would expect this type of user to work smaller data samples. The Explorer is someone who "thinks outside the box", makes unpredictable searches within data and operates on large amounts of detailed historical information. This type of data user is the most likely to require or employ Data Mining and KDD techniques. Data Mining can provide them with information they did not even know they were looking for. Explores can discover large pieces of business-data gold, but they can also miss the gold and dig up a lump of coal instead if they are not careful. As with most new technologies without a great penetration into the business world the methods will merely continue the improvement of theoretical analysis. In our commercial economy, we are being to find a trend of 'vertical', or domain specific, solutions as opposed to early efforts to develop new tools for Data Mining [11]. Modern customer relationship management and Web analysis solution contain some level of embedded Data Mining technology, thus highlighting the trend. By changing the focus of the Data Mining to increasingly target the end-user than an analysis, this steps away from the idea of Data Mining experts. Data Mining is concerned with allowing a simple, convenient and efficient method of discovering of patterns in large data stores [4]. As with any business application, the Data Mining work must deliver measurable results and benefits. These benefits may be in the form of reduced operating costs in the case of targeted marketing, improve profits as experience with the credit card fraud algorithms. As the applications embed increasingly automated Data Mining tools the greater the likelihood that the user will only discover small grains of 'gold', rather than the nugget in which they yearn for. The results may be

Page 22

encouraging initially with an increased uptake of automatic / semi-automatic technology, yet the body of knowledge may diluted if is not used caution. A positive side effect of producing these emerging vertical applications fortunately, are new disciplined approaches to data collection and storage. The lack of consistent data, as mentioned previously, is the prerequisite to Data Mining projects that commonly forces businesses to abandon initiatives [11].

Data Mining in Supply Chain As the Data Mining is an ongoing processes, the supply chain model would continue to evolve, as new nuggets of knowledge are extracted form the resultant data. "Before the full functional; and economic benefits of a computer integrated manufacturing operation can be realized, a method of replacing the intelligence of the displaced equipment operators must be found." [15] A mixture of traditional 'expert' computer control system technology and the emerging Data Mining analysis could be an invaluable tool to the endeavour to address this type of problem. Information Technology has impacted supply chain management in a positive way, Valverde & Saade[24][25][26] conducted a study that shows that that e-supply chain management had a positive effect in the electronic manufacturing services industry as these showed that the profits of the firm increased and internal communications were improved due to the implementation of e-supply chain management. Information Technology has also impacted the supply chain with the rapid rise of Business to Business (B2B) transactions over the internet with the increasing use of e-procurement solutions by large organizations for purchasing [27] and collaboration [28]. RFID devices have been used to optimize the operation of warehouses, stores and transportation logistics [29][30][31].

Data Mining in the Supply Chain for the Printing Industry The printing has been searching to increases the ability to individually customise, personalise, books. This has been available in a very limited fashion, the mainstay being merely 'inkjeting' data directly on the printed copy.

Page 23

Claudia Imhoff recently discovered the possibilities when "my daughter received a book for her birthday from her grandmother. The book looked like an ordinary book until we began reading it. The very first paragraph started out, "Jessica Imhoff lives in Colorado with her three cats, Sailor, Abby and Alice. One day her grandmother traveled from South Carolina to visit her..." and off they all went on their adventure." [20] The manufacturing process no longer relies on mass production, as the customisation of the book is the 'added value' of the process. By creating points in the text of the product to insert, or amend, words and phrases inexpensive and effective customisation is permitted. The mass-produced copy of Offset printing will not disappear. Yet the days of printing thousands of brochures, or promotional materials with the intention distributing them to all clients and prospects is giving way to an era of focused marketing. Printers will need to combine their printing expertise with 'variable-data' printing technologies to take advantage of what has be come an extremely competitive environment. We can see example of mass-customisation in many printed items around us. The mass of promotional leaflets that arrive though the mail, most of which have been based on previous purchases. The regionalisation of national newspapers and TV listing magazines, in the case of some subscription magazines they have begun to include various sections based on the customer segment you belong. These mass customisation and personalisation systems could work on the simple method of a data query method. This would not allow for focused campaigns or prediction of most likely clients for the new offering or promotions. Data Mining must be the driver for such systems, 1214, software developers of variable digital printing applications, state use of these new applications "opens up the world of one-to-one marketing to businesses that do not have IT departments and data mining initiatives."[21]

Data Mining in the Supply Chain Fraud The concept of exchanging goods and services over the Internet has seen an exponential growth in popularity over the years [32]. This increase of internet transactions generate large data volumes and the inability to analyse them enables fraudulent activities to go

Page 24

unnoticed in supply chain management processes such as procurement, warehouse management and inventory management. This fraud increases the cost of the supply chain management and a fraud detection mechanism is necessary to reduce the risk of fraud in this business area. A study was carried out by Kraus & Valverde[33] in order to develop a data warehouse design that supports data mining forensic analytics by using the Benford’s law in order to detect fraud. A datawarehouse with data mining technology had the main objective of be able to detect fraud in supply chain management transactions [33]. The Benford’s law is a data mining technique used often in forensic accounting for fraud detection that has been proven useful in supply chain fraud management.

Data Mining for financial performance improvement of the supply chain The advancement of RFID technology has been viewed by many as one of the most beneficial developments in the business world. Furthermore, the progress in this technology has motivated software and hardware manufacturers to leverage RFID capability and drive the adoption of RFID in the data center market. RFID technology holds promise in transforming supply chain management by providing real time intelligence for tracking enterprise assets. As it stands, the objective of RFID is to manage the entire life cycle of an asset by determining the time of initial asset acquisition, the asset's physical location, the asset's movement within a data center and the time of the asset’s ultimate decommission. In addition, RFID is also capable of managing the motion of devices in and between data centers thus enhancing the ability to forecast data center capacity. Although data centers has been readily adopted and implemented in commercial sectors such as the retail environment, its introduction and implementation in the financial market sector has not occurred with similar speed and enthusiasm, suggesting presence of some reluctance. However, financial institutions are being under pressure from clients in order to provide real-time financial data and are looking at data centers integrated with RFID based supply chain systems for this purpose. A study performed by Khan & Valverde [34] has examined the contribution of RFID supply chain systems in data centers and data mining and the motivation of financial institutions to use these data centers and data mining technologies in order to become more competitive in the market.

Page 25

Data mining for supply chain risk management Data mining has a history of good results in risk management [35][36][37]. In recent times, companies are increasingly forming global supply chains and favoring global sourcing practices for lowering the purchase prices. While the global sourcing truly offered the expected benefits in the short run, it increased the risk of facing several challenges in the long run. One of the major issues is the supplier financial distress leading to supplier bankruptcy due to slower demand, shrunk liquidity, and increasing pressure on cost [38]. In a study developed by Valverde [39], the Black-Scholes-Merton (BSM) Model was used for default prediction and risk pooling management techniques as a way to reduce the risk due to supplier bankruptcy. A data mining technique was used to populate a database with a sample of companies selected from the New York Stock exchange and data for historical stock prices from the Center for Research in Security Prices database was collected in order to calculate the probability of bankruptcy of a sample of suppliers from different industries by using the BSM model. This data mining technique probed to be beneficial to evaluate suppliers and avoid those that represent a high risk to the company due to its high risk of bankruptcy.

Conclusion By using Data Mining techniques with complex algorithms, the current traditional 'open' process manufacturing practices could be transformed into 'closed-loop' systems, therefore removing the need for the present high numbers of human machine operators, increasing throughput and reducing overheads. The future of commercial printing is likely to be radically different from the past few decades. The notion that printers can improve profitability by installing faster presses will become increasingly untenable because their clients’ needs have changed in response to mass media and, increasingly, the Internet. Data mining also has been proven to be a high potential in the detection of fraud in the supply chain, improvement of the financial performance of the corporation due to better use of assets and reduction of the risk of the supply chain management.

Page 26

Bibliography and References [1] Kanter J. (1999) "Knowledge Management, Practically Speaking." ISM ACM Fall pg. 7-15 [2] Berry M. and Linoff G. (2000) "Mastering Data Mining." - Wiley & Sons [3] Spiegler (2000) "Knowledge Management: A New Idea or a Recycled Concept?" AIS Vol. 3 Article 14 [4] Hand D., Mannila H., Smyth P. "Principles of Data Mining" - MIT Press [5] "Oracle9i Data Mining" - An Oracle white paper Dec 2001 [6] Fayyad U., Piatetsky-Shapiro G., Smyth P. and Uthurusamy R. (1996) "Advances in Knowledge Discovery and Data Mining" - AAAI/MIT Press [7] Inmon W., Kelley C. (1994) "Twelve Rules of Data Warehouses for a Client/Server World" - Data Management Review May pg. 6-16 [8] Coronel R. (2000) "Database Systems Design, Implementation & Management" Thompson Learning MA. [9] "Microsoft SQL Server version 7.0 Technical Manual" (1998) - Microsoft Press [10] Inmon W. (1999) "Tourists, Explorers and Farmers." Bill Inmon White Papers www.BillInmon.com [11] Fayyad U. and Uthurusamy R. (2002) "Evolving Data Mining into Solutions for Insights." -Communications of the ACM, Aug 2002 Vol. 45 No 8 Pg. 28-31. [12] Zaima A. (2002) "Data Mining Primer for the Data Warehousing Professional" - A Teradata White Paper EB-3078 www.teradata.com [13] Gupta V.R. (1997) "Data Warehousing with MS SQL Server" - Microsoft Press

Page 27

[14] "Oracle9I Data Mining Concepts" - (June 2001) Release 9.0.1, Oracle Corp A90389-01 [15] Grier A., Hsiao R. (1990) "A Decision Support System For Correction Of Manufacturing Process Problems" - ACM Pg. 395-403 [16] Pazzani M., Hearst M. (2000) "Knowledge Discovery From Data?" - March IEEE Intelligent Systems Pg. 10-13 [17] Evans R., Fisher D. (1994) "Overcoming Process Delays With Decision Tree Induction" IEEE Expert 9 PG. 60-66 [18] Langley P., Herbert S. (1995) "Applications Of Machine Learning And Rule Induction" Communications of the ACM, Nov 1995 Vol. 38 No 11 Pg. 55-64 [19] Luger G. (2002) "Artificial Intelligence. Structures and Strategies for Complex Problem Solving" - Addison Wesley [20] Imhoff C. (2001) "Intelligent Solutions: Techno Wave – Mass Customization" - Data MiningReview.com (accessed Wednesday, June 19, 2002) [21] Hamilton A (2000) "A Practical Solution To Producing Personalized Communications For The Printing Industry." - Digital Works, 1214U white paper www.candcc.com 1214uwp.pdf [22] Soni S., Tang Z., Yang J. (2001) "Performance Study Of Microsoft Data Mining Algorithms" - Microsoft Technical Papers [23] Han J., Chee S.H.S., Chiang J. (1998) "Issues For On-Line Analytical Mining Of Data Warehouses"

-

White

Paper

Simon

Fraser

University

BC

Canada,

http://db.cs.sfu.ca/dmkd98.pdf [24] Valverde, R., & Saadé, R. G. (2015). The Effect of E-Supply Chain Management Systems in the North American Electronic Manufacturing Services Industry. Journal of theoretical and applied electronic commerce research, 10(1), 79-98.

Page 28

[25] Grittner, D. H., & Valverde, R. (2012). An object-oriented supply chain simulation for products with high service level requirements in the embedded devices industry. International Journal of Business Performance and Supply Chain Modelling, 4(3-4), 246270. [26] Valverde, R. (Ed.). (2012). Information Systems Reengineering for Modern Business Systems: ERP, Supply Chain and E-Commerce Management Solutions: ERP, Supply Chain and E-Commerce Management Solutions. IGI Global. [27] Stephens, J., & Valverde, R. (2013). Security of e-procurement transactions in supply chain reengineering. Computer and Information Science, 6(3), p1. [28] Avédissian, A., Valverde, R., & Barrad, S. An Extension Proposition for the Agent-Based Language Modeling Ontology for the Representation of Supply Chain Integrated Business Processes. [29] Adoga, I., & Valverde, R. (2014). AN RFID Based Supply Chain Inventory Management Solution for the Petroleum Development Industry: A Case Study for Shell Nigeria. Journal of Theoretical and Applied Information Technology, 62(1), 199-203. [30] Felix, F., & Valverde, R. (2014). An RFID Simulation for the Supply Chain Management of the UK Dental Industry. Journal of Theoretical and Applied Information Technology, 60(2), 390-400. [31] Rathore, A., & Valverde, R. (2011). An RFID based E-commerce solution for the implementation of secure unattended stores. Journal of Emerging Trends in Computing and Information Sciences, 2(8), 376-389. [32] Massa, D., & Valverde, R. (2014). A fraud detection system based on anomaly intrusion detection systems for e-commerce applications. Computer and Information Science, 7(2), p117.

Page 29

[33] Kraus, C., & Valverde, R. (2014). A DATA WAREHOUSE DESIGN FOR THE DETECTION OF FRAUD IN THE SUPPLY CHAIN BY USING THE BENFORD’S LAW. American Journal of Applied Sciences, 11(9), 1507-1518. [34] Khan, N., & Valverde, R. (2014). The use of RFID based supply chain systems in data centers for the improvement of the performance of financial institutions. Engineering Management Research, 3(1), p24. [35] Valverde, R. (2011). A Business Intelligence System for Risk Management in the Real Estate Industry. International Journal of Computer Applications, 27(2), 14-22. [36] Valverde, R. (2011). A Risk Management Decision Support System for the Real Estate Industry. International Journal of Information, 1(3). [37] Valverde, R. (2010). An Adaptive Decision Support Station for Real Estate Portfolio Management. Journal of Theoretical and Applied Information Technology, 12(2), 84-86. [38] Valverde, Raul and Talla, Malleswara (2013) Risk Reduction of the Supply Chain Through Pooling Losses in Case of Bankruptcy of Suppliers Using the Black–Scholes–Merton Pricing Model. In: Some Recent Advances in Mathematics and Statistics. World Scientific, pp. 248-256. [39] Valverde, R. (2015). An Insurance Model for the Protection of Corporations against the Bankruptcy of Suppliers. European Journal of Economics, Finance and Administrative Sciences, 76(1).

Page 30