Statistical Methods for Data Mining and Knowledge ... - Springer Link

25 downloads 7083 Views 113KB Size Report
Abstract. This survey paper aims mainly at giving computer scientists .... The next section reviews the various choices afforded to the data analyst when selecting ...
Statistical Methods for Data Mining and Knowledge Discovery Jean Vaillancourt UQO, Gatineau, QC J8X 3X7, Canada

Abstract. This survey paper aims mainly at giving computer scientists a rapid bird’s eye view, from a mathematician’s perspective, of the main statistical methods used in order to extract knowledge from databases comprising various types of observations. After touching briefly upon the matters of supervision, data regularization and a brief review of the main models, the key issues of model assessment, selection and inference are perused. Finally, specific statistical problems arising from applications around data mining and warehousing are explored. Examples and applications are chosen mainly from the vast collection of image and video retrieval, indexation and classification challenges facing us today.

1

Introduction

Data mining for the purpose of knowledge discovery has been around for decades and has enjoyed great interest and a flurry of research activity in just about every field within the scientific community. The purpose here is to broach the topic from a mathematician’s perspective with a view towards understanding what exactly can be said and concluded, when using some of the more sophisticated statistical tools to extract knowledge from large data bases. The statistical techniques used currently in data mining practice often come from the great statistical toolbox built out of problem solving issues arising in other fields and taught in the standard science curriculum. They require several hypotheses to be checked and tested. These hypotheses reflect in large part the basic character of the data under study and it is vital to assess whether or not these hypotheses are valid in order to draw any inference from our data mining endeavor. This is especially true when using the more sophisticated and complex statistical estimators and tests. In this survey paper I first give a rapid bird’s eye view of the main statistical methods used in order to extract knowledge from databases comprising various types of observations. After touching briefly upon the matters of supervision, data regularization and a brief review of the main models, the key issues of model assessment, selection and inference are perused. Finally, specific statistical problems arising from applications around data mining and warehousing are explored. Ten years of collaboration with experts in image and video retrieval, indexation and classification, as well as in formal concept analysis, color my choice of examples and applications; however, the statistical methods discussed are, now as ever, universal. L. Kwuida and B. Sertkaya (Eds.): ICFCA 2010, LNAI 5986, pp. 51–60, 2010. c Springer-Verlag Berlin Heidelberg 2010 

52

J. Vaillancourt

In spite of this ambitious program, the style of this paper is purposefully light and equations will be kept to a minimum. Extensive references to the literature should yield ample compensation for this shortcoming, as the goal here is to supply the mathematical intuition behind the choice of methods rather than the precise formulas which can be found elsewhere, for instance in the few main references brought forward next. No claim is made as to the novelty of the statistical methods described here. The originality of the paper lies instead in the choice and presentation of the material, inasmuch as it displays the aforementioned heavy personal bend towards applications in imaging. The main sources used in producing this survey comprise first and foremost, the excellent book [1] of Hastie, Tibshirani and Friedman. Friedman’s take on the link between data mining and statistics [2] provides a complementary treatment of the interface between the two fields as it stood fourteen years ago, a treatment that is still relevant today, particularly in the light of Hand’s paper [3] and Friedman’s comments [4] on classifier methodology. The emphasis in these sources is put on supervised learning as it affords a wealth of existing statistical techniques to be used in specific applications and to be compared in performance under precise conditions. To understand the distinction between supervised and unsupervised learning, a simple classification example from the field of content-based image search may be helpful. The goal pursued in this field is the automatic identification and labeling of images containing a variety of objects, some of interest and some not, as well as noise from several sources linked to distance, media related distorsion or even physical features of the image collection device. The ultimate goal is of course the retrieval of images containing objects in a prespecified class with some degree of success. In [5] Sarifuddin et al. propose a framework whereby certain objects of interest are sought within a structured database of pictures, using similarity measures based solely on the color characteristics within the image. Here the target population is the database itself, the search is done on the whole database but interest lies on the top five to ten hits, so fluctuations are observed between the competing search algorithms. A training subset of images is used for computational purposes and learning is achieved through similarity measurements with the rest of the database. The algorithms proposed there have the feature of predicting a new top hit (the outcome of the learning experiment) whenever the training set (the input) is changed, with some measure of the error, thus yielding opportunities for making credible and scientifically defensible inferences on the whole database. This last ability for meaningful prediction expresses the presence of supervision and allows for a probability (or sample) space to be defined with rigor. Many image, video and metadata bases possess these features : collections of highway camera readings of licence plates on rear views of cars and lists of persons of interest in police work, standardized DNA databanks, virtual museum collections, topographic maps and handwritten zipcodes on surface mail are but a few examples. They tend to be well structured for searching, grow (relatively) slowly and require frequent access.

Statistical Methods for Data Mining and Knowledge Discovery

53

By contrast, just attempt searching the web (or some other database not specifically structured with content-based image search in mind) through your favorite search engine by way of some bit of text (in image search parlance this is called semantic information). An excellent survey of the most popular systems including Diogenes, ImageRover, WebSeek and Atlas WISE can be found in Kherfi, Ziou and Bernardi [6]. The result is a collection of images that do not constitute an outcome in the inferential sense above, since the database is for all intent and purposes infinite and the sample space is defined with a new ambiant distribution upon every iteration of the algorithm as every new search receives no quantifiable predictive benefit from the previous ones. This inability to predict new outcomes with statistical measurements of accuracy typifies unsupervised learning and data mining in general. Unsupervised learning offers limited avenues to measure rigorously the validity of inferences or to compare learning methods with some degree of scientific credibility. By enriching association rule analysis with inherent graphical and hierarchical structures, formal concept analysis (FCA) (Ganter and Wille [7]) provides a stepping stone towards this end through the systematic process of generating ontologies. Our main reference for FCA here is Valtchev, Missaoui and Godin [8]. It is as much a choice of convenience (it is still current as an overview) as a personal one. The next section reviews the various choices afforded to the data analyst when selecting a statistical learning model. The matters of identifying the possibility of supervision, the presence in the data of linearity, the need for grouping or regularization, and finally the key issues of model assessment and selection, as well as that of inference and testing, are briefly touched upon. The last section delves into ancillary issues relating to the organization of data into shapes amenable to statistical treatment, with an eye on image and video sources for illustrating purposes.

2

Statistical Teaser

Let us begin by reviewing some basic definitions and notation used throughout this paper. The terminology is described in the context of supervised learning, but we will keep the same terms when switching to unsupervised learning, even though this context will require some clarification when interpreting the results. Good statistics ideally starts with good data collected with a view to an end. Data usually comes in the form of a set of input (independent or predictor) variables, which we control or are able to measure; and a set of output (dependent or response) ones, which are observed. Some of them will be quantitative measurements, some not (called factors or qualitative or even categorical variables). Throughout this paper, the variables will be noted by uppercase letters (X for input and Y for output, with real, vector, matrix or even functional values as required by the context) and their values by lowercase ones (similarly). Predicting quantitative outputs is called regression and qualitative outputs, (statistical) classification. In both cases, whatever the technique of choice (more

54

J. Vaillancourt

on this later), supervised learning is achieved through a predictive model of the form Y = f (X, ) where  is a random error and f is an unknown function to be chosen or estimated from the observed (raw) data (xi , yi ) for i = 1, 2, . . . , n. The function will usually be selected according to some optimization (scoring) principle, from a simple family allowing the algebraic or numerical isolation of the error. Keep in mind that while a classification problem can always be reformulated as a regression one using indicator (dummy) variables, their number (one per class of values per variable) grows very rapidly. As a result many techniques are specially devised to deal with estimation in that context. We shall nevertheless focus on the regression side of statistics in this section in order to keep it short. The classical least squares linear regression known to all is just the case f (X, ) = g(X) +  with g in a restricted set of nice functions and estimated n to some gˆ that minimizes the sum of squares or errors i=1 yi − g(xi )2 as a function of g. Here  ·  denotes some appropriate norm. The basic examples where g is itself a linear function or a polynomial (possibly in many variables, in which case the xi ’s are vector, matrix or even function valued) are the most commonly used models around, even when the input is time dependent (where the theory is already rich and complex from a mathematical point of view, see Solo [9]) or time and space dependent, as are some current models for video segmentation (see for instance Cremers [10]). The target value in this case is simply the conditional expectation E(Y |X). This basic idea has been expanded to richer families of functions g than linear ones (such as piecewise polynomials, splines, wavelets or the directional mappings used in projection pursuit regression) but we shall not dwell on these methods in this least squares context since their implementation challenges are not key to our purpose here. All of these so-called regularization methods aim at approximating the true (unknown and nonlinear) function g of the input data in order to ensure a better fit of the model to the (usually highly nonlinear) output data. The interested reader may read Wahba [11] for more on splines, Daubechies [12] on wavelets and our main reference [1] for a thorough review of projection pursuit and the statistical side of neural networks in general. Another direction for development which has shown great results in several areas of application including satellite imaging is the use of mappings f (X, ) that are not linear in . Recall that segmentation algorithms are a crucial part of the automatic systems used in modeling and processing image data. These algorithms parse out each image into smaller ones (the content of which tends to be simpler), thus enhancing the discriminating power of the searching tools by increasing the spread (or variety) of visual features extracted and used for identification. These features normally include color, texture and shape, as well as some dynamical measurements in the case of video databases. The statistical comparison of unsupervised segmentation algorithms on structured image databases was initiated by Graffigne et al. in [13] and [14]. The two main classes of methods studied were the bayesian ones (in the variational context of energy minimization later used to great effect by Bentabet et al. in [15] and by Jodouin

Statistical Methods for Data Mining and Knowledge Discovery

55

et al. in [16]), pionneered in the works of Grenander [17] and Geman and Geman [18] ; and hierarchical Markov random field based methods (first used for image analysis by Besag in a series of papers starting with [19] through to [20]). In both cases the methods were used in order to decrease the large size of the optimization problem posed by image segmentation — to see just how slow the process is, even when a Gibbs sampler is used to accelerate the process, read Gibbs [21]. In much of the research done on image segmentation and restoration, the noise source is assumed to affect the image additively. This is simply not the case in SAR imaging, since the noise (speckle) does not behave additively — this is borne out both by analyzing the shortcomings of linear methods like Fourier transforms (see DeGraaf [22]) and by noticing the (undesirable) heteroscedastic behavior of the residuals when modeling with additive noise, a tell tale sign of lack of fit from a statistical point of view. Proper modeling calls for nonlinear dependency in the noise, like the multiplicative noise used for instance in [15] and [16] where very good fit is attained. This last approach, when combined with hierarchical (multi-resolution) Markov methods, is state of the art and also affords rigorous mathematical tracking. Refinenements of this basic least squares regression method abound in the literature and lead to the construction of alternative (often better fitted) estimates to the above gˆ. The target value in most of the classes of alternatives mentioned below will usually no longer have the simple closed form of a conditional expectation and require some numerical effort in order to reach an approximate value. As a first refinement, the square function in the previous example can be replaced by some other loss function, for instance, through the addition of a penalty term like the weight decay used in projection pursuit, neural networks or shrinkage methods (like ridge regression), if the outputs are quantitative. Explicit probabilistic modeling offers another collection of techniques. If the data collection is to be repeated often and rapidly, the observer may choose to weight the data unequally according to some a priori distribution on the outputs (as in the bayesian approach, useful with both types of outputs) or according to a well chosen mixture of distributions on the whole data (which has shown great success in many applications including image selection, as in Bouguila, Ziou and Vaillancourt [23] where the bayesian framework and a clever choice of mixtures are combined to great effect). Detecting the presence of a mixture in data has been adressed in Walther [24]. Mixture modeling offers a natural framework to locate common features like data clusters and to discriminate between classes of output values. Likelihood based methods constitute a third refinement (actually more of an alternative) to classical least squares regression by imposing a statistical framework into the model Y = f (X, ) from the get-go. They constitute a large body of statistical literature (see Severini [25] for a good overview), give meaningful (statistically interpretable) results even with small sample sizes, are based on the likehood principle respected even by bayesian posterior decision analysis (see section 4.4 in Berger [26]), and their asymptotics are tractable rigorously (see Prakasa Rao [27]). They perform very well in complex mathematical contexts

56

J. Vaillancourt

like that of computer vision, as displayed in Amit and Geman [28] and further in Amit and Trouv´e [29]. A fourth refinement consists in preselecting target areas in the space of input values and applying some local method, such as local regression, density estimation and the many methods relying on measures of similarity, nearest neighbors, clustering or kernel functions at each point of interest in that space. These methods require particular care and attention when used on high dimensional data as they then tend to show much sensitivity to additional data and lose both accuracy (minimal bias) and precision (minimal spread) against the other classes of methods. Combining them with judiciously chosen distributional restrictions from those in the previous paragraphs is the usual way out. This approach has been used successfully in the context of image retrieval, indexation and sorting. For example, the importance of statistical mixtures of Dirichlet distributions in computerized image searches was first brought forward in [23] after a series of papers (listed therein) on particular aspects of the subject. Similarity measures have also played a central role in proper mathematical frameworks for image selection and the reader should consult Missaoui et al. [30] and Sariffudin et al. [5] for some simple and convincing examples. Finally on this issue, everything you ever wanted to know about the basic statistical aspects of kernel estimators can be found in Devroye [31]. On the computational side of things, these methods often (but not always, see [23] and [30]) turn out to be impracticable because of excessive cost. The fifth and final class of alternatives (and the most commonly handled in computer science practice) require some systematic selection of a subset of the input data (for instance, when using shrinkage); of a boundary within the data to break the problem down to smaller size (using separating hyperplanes to find good linear boundaries or support vector machines to find nonlinear ones or else tree-based methods to break the space into a few partially ordered blocks); or of a small subset of certain combinations of the whole input data (principal component regression, projection pursuit and neural network methods fall in this class). All three approaches work by decreasing the size or the dimensionality of the space of input variables while preserving what the observer believes to be the core features of interest. With extremely large databases, they tend to be the wise way to go. Armed with this rich collection of models, the experimenter now comes to the matter of model selection and assessment. Selection consists in estimating the performance of each model contemplated with the purpose of choosing the best; assessment, in estimating the predictive power of each model on new data. The key issue at this point is not to decide right away whether or not your model of choice is the best for the data at hand (no model ever outperforms all others in all situations anyway) but rather to check how good is your data to start with. A good training set often has more impact on the quality of the results of the inferential selection and assessment process than the sophistication and complexity of the models chosen for comparative purposes.

Statistical Methods for Data Mining and Knowledge Discovery

57

If your data set of interest is very large, common practice consists in separating it in three parts, one serving the purpose of the training set, one used to select the model of choice through a validating estimation of the prediction error (the validation set) and one for the final assessment of the predictive ability of the chosen model on new data (the test set). This allows for accurate measurements of bias and variability. When this is not the case, one needs to generate pseudo-observations to compensate. This is usually done by way of resampling techniques known as jackknifing and bootstrapping (see Efron [32] for the simplest contexts), as well as the derivatives of the bootstrap known as bagging (bootstrap aggregating, due to Breiman [33]) and boosting (see [1]). These last two methods are amenable to nonlinear function estimation and any choice of loss function. They display remarkable performance in tests and benchmarks against most other methods mentioned thus far. They also tend to be robust against many distributional alternatives, a reassuring characteristic when dealing with large heterogeneous data sets. The basic idea behind their use is that averaging amongst several predictors should decrease the variability of the results while maintaining bias to a minimum. Applying the various models to several data sets will of course make the results more credible, especially when combined through the construction of a random forest, (a double randomization technique involving both bootstrapping of the data and random selection of the subset of most interesting variables) which can be rigorously analyzed (see Breiman [34]). Model selection now becomes a matter of chosing a loss function relevant to the nature of the data (usually a measure of distance like the sum of squares or a measure of entropy like the sum of log-likelihoods) and then estimating the parameters associated with each model under purview in order to minimize (at least approximately) the corresponding expected loss. A trade off between accuracy (small bias and spread) and parcimony (as few parameters as possible) will usually be included in the loss function or the optimization scheme itself. The importance of parcimony is convincingly made by Besse et al. [35] and we concur with them that simpler models, whether they involve neural networks, mixtures, support vector machines or any other sophisticated tools, combined with resampling, generally constitute a better choice from both efficiency and reliability standpoints, than complex interpretative ones with large numbers of parameters to be estimated.

3

Unsupervised Learning and Statistics: Some Challenges

We now turn to the context of unsupervised learning and data mining, urging the reader to consult Besse et al. [35] as well as chapter 14 of Hastie, Tibshirani and Friedman [1]. At the risk of being repetitive, keep in mind that any information held by the scientist about the data prior to mining (or snooping), must be incorporated in the experimental design leading to data collection in order to have some hope of checking statistical hypotheses. These informations are usually available in the presence of supervision and at least some of the statistical hypotheses can be checked.

58

J. Vaillancourt

In (unsupervised) data mining one usually cannot afford this level of control over data collection, since the data warehouses are usually assembled before the experimentation is devised, a pity. The choice of the training set is often the only latitude left at our disposal and it should be done with some care and attention towards ensuring that some statistical inference can be made. Nevertheless, as long as the conclusions drawn from the experiment are formulated with the proper reserve, selecting methods with a proven record of quality remains the sensible thing to do. Many of the early developments in data mining methodology stemmed from incursions into exploratory data analysis (EDA) that predated data mining and were based on key ideas already in the statistical literature at the time. These incursions were championned in the mid sixties simultaneously, independently and along completely different methodological approaches by Tukey (see [36] for the history of this branch of EDA) using (and reinventing) robust statistics and by the French school of EDA inspired by classical geometry and initiated with the proposal by Escofier (in his 1965 doctoral thesis) of correspondance analysis as we know it today (see Benz´ecri [37] for a history of this branch of EDA). For a statistician, unsupervised learning consists in estimating the probability distribution of the input variable X (usually valued in a space of large dimension) based on the (vector, matrix or even functional valued) observations x1 , x2 , . . . , xn . There is no output variable since the focus here is understanding the data and discovering pertinent subsets rather than predicting outcomes. In low dimensions, non parametric density estimators (see Devroye [31]) will usually provide sufficient insight into the data to satisfy the user. The large dimension of the space in data mining makes these density estimators unreliable; however, they can still be used to get estimates for the one dimensional marginal distributions (margins) of the input X. Once the margins have been estimated, the problem at hand becomes equivalent to the determination of an appropriate copula for the distribution of X given its margins. A copula is a multidimensional distribution on the unit cube of the space of values of X with uniform one dimensional margins. This provides a first strategy to extract knowledge from the observations, since copulas are currently well researched (if at times controversial, see Genest and R´emillard [38]). As copulas comprise all the information within a distribution given the margins, they are a powerful tool indeed but determining them explicitly in high dimension remains difficult for now. One gets around this issue by finding instead the most frequent values (statistical modes) of X within the data base, since a large enough number of them will cover the most significant regions of the distribution. This approach greatly reduces dimensionality, is easier to implement and has become the very popular technique of association rule mining. Alternatively, one bypasses the probability model and, using a measure of distance or similarity on the data, looks for agregated clouds of data points through one of the many clustering algorithms available. Again one is faced with classical methods that quickly find their limits in high dimensional space unless supported by one of the data reduction techniques mentioned in section 2 (our fifth class of alternatives). For

Statistical Methods for Data Mining and Knowledge Discovery

59

example, Bouguila [39] combines clustering with mixtures to generate a rich class of models and then uses marginal likelihood for succesful model selection. The first attempts at setting the probability distribution on the (oriented) relations between input observations instead of the observations themselves led to the creation of statistical implicative analysis, recently surveyed by Gras and Kuntz in [40]. Since the number of such relations grows like the square of the size of the database, it suffers the same challenges as unaided clustering algorithms do. The twin needs to enrich the set of relations and to reduce the speed of growth of the pertinent or good subsets leads to formal concept analysis (FCA). According to Valtchev, Missaoui and Godin [8], FCA has demonstrated cost-effectiveness, adaptability and user-friendliness in a variety of settings. Incorporating statistical structure to FCA will likely be the next stage to making it step from a very useful mathematical tool for structuring knowledge, to a privileged methodology for drawing rigorous inference from large and complex data warehouses.

References 1. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning. Springer Series in Statistics (2001) 2. Friedman, J.H.: Data Mining and Statistics: What’s the Connection? Keynote presentation at 29th Symposium on Interface: Computer Science and Statistics (1997), http://www-stat.stanford.edu/~ jhf/ 3. Hand, D.: Classifier technology and the illusion of progress. Statist. Sci. 21(1), 1–14 (2006) 4. Friedman, J.H.: Comment on classifier technology and the illusion of progress. Statist. Sci. 21(1), 15–18 (2006) 5. Sarifuddin, M., Missaoui, R., Vaillancourt, J., Hamouda, Y., Zaremba, M.: Analyse statistique de similarit´e dans une collection d’images. Revue des Nouvelles Technologies de l’Information 1(1), 239–250 (2003) 6. Kherfi, M.L., Ziou, D., Bernardi, A.: Image retrieval from the world wide web: issues, techniques and systems. ACM Computing Surveys 36(1), 35–67 (2004) 7. Ganter, B., Wille, R.: Formal concept analysis, mathematical foundations. Springer, Heidelberg (1999) 8. Valtchev, P., Missaoui, R., Godin, R.: Formal concept analysis for knowledge and data discovery: new challenges. In: Proc. Second Int. Conf. Formal Concept Analysis, Sydney, Australia, pp. 352–371 (2004) 9. Solo, V.: Topics in advanced time series analysis. Lecture notes in mathematics, vol. 1215, pp. 165–328. Springer, Heidelberg (1986) 10. Cremers, D.: Bayesian approach to motion-based image and video segmentation. In: J¨ ahne, B., Mester, R., Barth, E., Scharr, H. (eds.) IWCM 2004. LNCS, vol. 3417, pp. 104–123. Springer, Heidelberg (2007) 11. Wahba, G.: Spline models for observational data. SIAM, Philadelphia (1990) 12. Daubechies, I.: Ten lectures on wavelets. SIAM, Philadelphia (1992) 13. Graffigne, C., Heitz, F., Perez, P., Preteux, F.J.: Hierarchical Markov random field models applied to image analysis: a review. In: Proc. SPIE, vol. 2568, pp. 2–17 (1995) 14. Graffigne, C.: Stochastic modeling in image segmentation. In: Proc. SPIE, vol. 3457, pp. 251–262 (1998)

60

J. Vaillancourt

15. Bentabet, L., Jodouin, S., Ziou, D., Vaillancourt, J.: Road vectors update using SAR imagery: a snake-based approach. IEEE Trans. on Geoscience and Remote Sensing 41(8), 1785–1803 (2003) 16. Jodouin, S., Bentabet, L., Ziou, D., Vaillancourt, J., Armenakis, C.: Spatial database updating using active contours for multi-spectral images: application with Landsat 7. ISPRS J. of Photogrammetry and Remote Sensing 57, 346–355 (2003) 17. Grenander, U.: Lectures in pattern theory, vol. I, II and III. Springer, New York (1981) 18. Geman, D., Geman, S.: Stochastic relaxation, Gibbs distributions and the bayesian restoration of images. IEEE Trans. Pattern Anal. Math. Intell. 6(6), 721–741 (1984) 19. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc., B 36, 192–236 (1974) 20. Besag, J.: On the statistical analysis of dirty pictures. J. Roy. Statist. Soc., B 48, 259–302 (1986) 21. Gibbs, A.L.: Bounding the convergence time of the Gibbs sampler in Bayesian image restoration. Biometrika 87(4), 749–766 (2000) 22. DeGraaf, S.R.: SAR imaging via modern 2-D spectral estimation methods. IEEE Trans. on Image Processing 7(5), 729–761 (1998) 23. Bouguila, N., Ziou, D., Vaillancourt, J.: Unsupervised learning of a finite mixture model based on the Dirichlet distributions and its applications. IEEE Trans. Image Processing 13(11), 1533–1543 (2004) 24. Walther, G.: Multiscale maximum likelihood analysis of a semiparametric model, with application. Ann. Stastist. 29(5), 1297–1319 (2001) 25. Severini, T.: Likelihood methods in statistics. Oxford Univ. Press, Oxford (2001) 26. Berger, J.O.: Statistical decision theory and bayesian analysis. Springer, Heidelberg (1980) 27. Prakasa Rao, B.L.S.: Asymptotic theory of statistical inference. John Wiley, Chichester (1987) 28. Amit, Y., Geman, D.: A computational model for visual selection. Neural Computation 11, 1691–1715 (1998) 29. Amit, Y., Trouv´e, A.: POP: Patchwork of parts models for object recognition. Intern. J. Comp. Vision 75(2), 267–282 (2007) 30. Missaoui, R., Sarifuddin, M., Vaillancourt, J.: Similarity measures for an efficient content-based image retrieval. In: IEE Proc. Vision, Image and Signal Processing, vol. 152(6), pp. 875–887 (2005) 31. Devroye, L.: A course in density estimation. Birkhauser Verlag, Basel (1987) 32. Efron, B.: The jackknife, the bootstrap and other resampling plans. SIAM, Philadelphia (1982) 33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996) 34. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) 35. Besse, P., Le Gall, C., Raimbault, N., Sarpy, S.: Data mining et statistique, avec discussion. Journal de la Soci´et´e Francaise de Statistique 142, 5–35 (2001) 36. Tukey, J.W.: Exploratory data analysis. Addison-Wesley, Reading (1977) 37. Benz´ecri, J.P.: Histoire et pr´ehistoire de l’analyse des donn´ees. Dunod (1982) 38. Genest, C., R´emillard, B.: Comments on T. Mikosh’s paper Copulas: tales and fact. Extremes 9, 27–36 (2006) 39. Bouguila, N.: A model based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowledge and Data Engineering 21(12), 1649–1664 (2009) 40. Gras, R., Kuntz, P.: An overview of the statistical implicative analysis (SIA) development. Studies in computational intelligence, vol. 127, pp. 11–40 (2008)