Treebased iterative input variable selection for ... - Wiley Online Library

18 downloads 988 Views 1MB Size Report
Dec 11, 2012 - In addition, advances in monitoring systems, from remote sensing .... the tree-based iterative input variable selection (IIS). IIS incorporates some ...
WATER RESOURCES RESEARCH, VOL. 49, 4295–4310, doi:10.1002/wrcr.20339, 2013

Tree-based iterative input variable selection for hydrological modeling S. Galelli1,2 and A. Castelletti2 Received 11 December 2012; revised 16 May 2013; accepted 29 May 2013; published 29 July 2013.

[1] Input variable selection is an important issue associated with the development of several hydrological applications. Determining the optimal input vector from a large set of candidates to characterize a preselected output might result in a more accurate, parsimonious, and, possibly, physically interpretable model of the natural process. In the hydrological context, the modeled system often exhibits nonlinear dynamics and multiple interrelated variables. Moreover, the number of candidate inputs can be very large and redundant, especially when the model reproduces the spatial variability of the physical process. The ideal input selection algorithm should therefore provide modeling flexibility, computational efficiency in dealing with high dimension data set, scalability with respect to input dimensionality and minimum redundancy. In this paper, we propose the tree-based iterative input variable selection algorithm, a novel hybrid model-based/model-free approach specifically designed to fulfill these four requirements. The algorithm structure provides robustness against redundancy, while the tree-based nature of the underlying model ensures the other key properties. The approach is first tested on a well-known benchmark case study to validate its accuracy and subsequently applied to a real-world streamflow prediction problem in the upper Ticino River Basin (Switzerland). Results indicate that the algorithm is capable of selecting the most significant and nonredundant inputs in different testing conditions, including the real-world large data set characterized by the presence of several redundant variables. This permits one to identify a compact representation of the observational data set, which is key to improving the model performance and assisting with the interpretation of the underlying physical processes. Citation: Galelli, S., and A. Castelletti (2013), Tree-based iterative input variable selection for hydrological modeling, Water Resour. Res., 49, 4295–4310, doi:10.1002/wrcr.20339.

1.

Introduction

[2] The problem of input variable selection arises every time one wants to model the relationship between a variable of interest, or predictand, and a subset of potential explanatory variables, or predictors, but there is uncertainty about which subset to use among a number, usually large, of candidate sets available [George, 2000]. The selection of the most relevant or important input variables is a recurrent problem in hydrologic and water resources applications involving the resolution of single or multiple regression problems, such as rainfall-runoff modeling, prediction in ungauged basins, water quality modeling, etc. The difficul-

Additional supporting information may be found in the online version of this article. 1 Singapore-Delft Water Alliance, National University of Singapore, Singapore. 2 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy. Corresponding author: S. Galelli, Pillar of Engineering Systems & Design, Singapore University of Technology and Design, 20 Dover Drive, 138682, Singapore. ([email protected]) ©2013. American Geophysical Union. All Rights Reserved. 0043-1397/13/10.1002/wrcr.20339

ties in variable selection for these applications come primarily from three sources [May et al., 2008]: (i) the complexity of the unknown functional relationship between inputs and output; (ii) the number of available variables, which may be very large; and (iii) the cross correlation between candidate inputs, which induces redundancy [Maier et al., 2010]. If a linear relationship characterizes the underlying system, well established selection methods exist to obtain a candidate subset of the input variables [e.g., Miller, 1990]. However, the assumption of linear dependency between the inputs and the output is overly restrictive for most real physical systems [see, e.g., Tarquis et al., 2011]. In addition, advances in monitoring systems, from remote sensing techniques to pervasive real and virtual sensor networks [e.g., Hart and Martinez, 2006; Hill et al., 2011], has made available an increasingly larger amount of data at the local and global scale at progressively finer temporal and spatial resolution, thus not only increasing data set dimension from dozens to tens or hundreds of thousand but also adding considerably to data set redundancy. The objective of variable selection is threefold [Guyon and Elisseeff, 2003]: improving model performance by avoiding the interference of nonrelevant or redundant information and more effectively exploiting the data available for model calibration, providing faster and more cost-effective models, and assisting in interpretation of the

4295

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

underlying process by enabling a more parsimonious and compact representation of the observational data set. In fact, by reducing the input space dimension the generalization capability of the constructed model is maximized, while parsimonious and compact representations of the data allow for an easier interpretation of the underlying phenomenon [Kohavi and John, 1997]. [3] Input variable selection methods can be distinguished between model-based (or wrapper) and model-free (or filter) approaches (see Das [2001]; Guyon and Elisseeff [2003]; Maier et al. [2010], among others). The modelbased approach relies on the idea of calibrating and validating a number of models with different sets of inputs and to select the set that ensures the best model performance. The candidate inputs are evaluated in terms of prediction accuracy of a preselected underlying model. The problem can be solved by means of global optimization techniques used to define the combination of input variables that maximizes the underlying model performance, or by stepwise selection (forward selection/backward elimination) methods, where inputs are systematically added/removed until model performance is no longer improved. The main drawback of this approach stands in its computational requirements [Kwak and Choi, 2002; Chow and Huang, 2005], as a large number of calibration and validation processes must be performed to single out the best combination of inputs and so the method does not scale well to a large data set. Moreover, the input selection result depends on the predefined model class and architecture. Thus, the optimality of a selected set of inputs obtained with a particular model is not guaranteed for another one, and this restricts the applicability of the selected set [Maier et al., 2010]. On the other hand, model-based approaches generally achieve better performance since they are tuned to the specific interactions between the model class and the data. Unlike the model-based approach, in model-free algorithms the variable selection is directly based on the information content of the candidate input data set, as measured by interclass distance, statistical dependence, or information-theoretic measure (e.g., the mutual information index [Peng et al., 2005]). Computational efficiency is a strong argument in favor of model-free methods; however, the significance measure is generally monotonic and, thus, without a predefined cutoff criterion, the algorithm tends to select very large subsets of input variables, with high risk of redundancy. [4] In hydrological and water resources studies, input variable selection has been mainly used for two classes of problems: problems involving the identification of a nonlinear regression between time-independent explanatory factors and a relevant statistic of a given hydrological output characteristic and problems concerning the selection of the most relevant explanatory time varying variables to characterize, through a dynamic model, the development over time of a hydrological variable of interest. The first class is predominantly populated by statistical regionalization methods and the use of regional hydrological model parameters for the estimation of the hydrological response in ungauged watersheds. Traditionally, both model-free approaches, such as principal component analysis [e.g., Alc azar and Palau, 2010; Salas et al., 2011; Wan Jaafar et al., 2011], and model-based methods, such as stepwise

regression [e.g., Heuvelmans et al., 2006; Barnett et al., 2010], have long been used to relate runoff characteristics to climate and watershed descriptors. The use of variable selection for statistical downscaling of meteorological data can also be classified under this category [e.g., Traveria et al., 2010; Phatak et al., 2011, and references therein]. More recently, novel causal variable selection methods [Ssegane et al., 2012] have been demonstrated to outperform stepwise selection approaches in terms of accuracy in characterizing the physical process while showing a lower predictive potential. [5] The second class of problems comprises a larger variety of application fields ranging from rainfall predictions [e.g., Sharma et al., 2000] to streamflow modeling [e.g., Wang et al., 2009], evaporation estimation [Moghaddamnia et al., 2009], and water quality modeling [e.g., Huang and Foo, 2002]. Given the usually high number of candidate input variables in these problems, model-free methods are generally preferred over model-based approaches [Maier et al., 2010], which are mostly used to determine the optimal input set to specific classes of model [e.g., Noori et al., 2011, and references therein]. Starting from Sharma [2000], who expanded the mutual information index into the computationally more efficient and reliable partial mutual information (PMI), traditional nonlinear informationtheoretic-based selection criteria have been revisited and adapted to a number of increasingly complex hydrological applications. Bowden et al. [2005a] improve the PMI criterion by Sharma [2000] using artificial neural networks for salinity prediction in the River Murray, South Australia [Bowden et al., 2005b]. The method is further elaborated by May et al. [2008], who proposed an alternative termination criteria to make the PMI more efficient and, subsequently, by Fernando et al. [2009], who introduced the use of a more efficient estimator (shifted histograms) of the mutual information. Finally, Hejazi and Cai [2009] illustrate an improved, computationally efficient, minimum redundancy maximum relevance approach to select the most significant inputs among 121 candidates to predict the daily release from 22 reservoirs in California. [6] In this paper, we build on these previous works and propose a novel hybrid approach, combining model-free and model-based methods to input variable selection, called the tree-based iterative input variable selection (IIS). IIS incorporates some of the features of model-based approaches into a fast model-free method able to handle very large candidate input sets. The information-theoretic selection criterion of model-free methods is replaced by a ranking-based measure of significance [Wehenkel, 1998]. Each candidate input is scored by estimating its contribution, in terms of variance reduction, to the building of an underlying model of the preselected output. First, unlike information-theoretic selection, ranking-based evaluation does not require any assumption on the statistical properties of the input data set (e.g., Gaussian distribution) and, thus, can be applied to any sort of sample. Second, it does not rely on computationally intensive methods (e.g., bootstrapping) to estimate the information content in the data and, thus, is generally faster and more efficient. Nonparametric tree-based regression methods, namely extremely randomized trees [Geurts et al., 2006], are adopted as underlying model family since, thanks to their ensemble nature

4296

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

[e.g., Sharma and Chowdhury, 2011], they perform particularly well in characterizing strongly nonlinear relationships and provide more flexibility and scalability than parametric models (e.g., artificial neural networks). The ranking-based selection is embedded into a stepwise forward selection model-based approach evaluated by k-fold cross-validation [Allen, 1974]. Forward selection is appropriate because on most real-world hydrological data sets the number of significant variables is a small proportion of the total number of available variables. In such situations, forward selection is far less time consuming than backward elimination [Das, 2001]. In the rest of the paper, we first present the IIS algorithm and illustrate its building blocks. Then, we validate the IIS selection accuracy on the synthetic test problems used by Sharma [2000], Bowden et al. [2005b], May et al. [2008], and Hejazi and Cai [2009]. Following that, we demonstrate the algorithm on a real-world case study of streamflow prediction in the upper Ticino River Basin (Switzerland), a subalpine catchment characterized by extremely variable weather conditions, where both rainfall and snowmelt significantly contribute to the high flow variations. A comparison between the variables selected by the IIS and PMI-based input selection (PMIS) by May et al. [2008] is also provided. Finally, section 5 gives conclusions.

2.

Methods and Tools

[7] Given a sample data set D composed of N observations of the output variable y to be explained and n candidate inputs xi 2 X i ; i ¼ 1; . . . ; n, the model-based input variable selection problem can be formulated as follows: to i find the subset X i; y  X of input variables that provides the best performance of an underlying model mðÞ as measured by a suitable metric Dðy; mðX i; y ÞÞ. As detailed next, within the proposed hybrid selection approach, the subset X i; y is incrementally built using the information content in the data with a ranking-based procedure and then validated by the model-based forward selection process as illustrated below.

2.1.

Iterative Input Variable Selection

[8] The IIS algorithm is composed of three steps as in Figure 1: [9] Step 1. Given the sample data set D and n candidate inputs, the IIS algorithm runs an input ranking (IR) algorithm to sort the n candidate inputs according to a nonlinear statistical measure of significance (e.g., the explained variance). In principle, the first variable in the ranking should be the most significant in explaining the output. In practice, in the presence of several potentially significant, but redundant, inputs, their contribution to the output explanation is equally partitioned and they might not be listed in the very top positions. To reduce the risk for mis-selection, the first p variables xj 2 X p  X i ; j ¼ 1; . . . ; p in the ranking are individually evaluated in the following step. [10] Step 2. The relative significance of the first p-ranked variables is assessed against the observed output. To this end, p single-input-single-output (SISO) models f j ðÞ; j ¼ 1; . . . ; p, are identified with an appropriate model building (MB) algorithm and compared in terms of a suita-

Figure 1. Flowchart of the Iterative Input Variable Selection (IIS) algorithm. ble distance metric between the output y and each SISO model prediction f j ðxj Þ. The best performing input x among the p considered is added to the set X iy of the variables selected to explain y. [11] Step 3. A MB algorithm is then run to identify a multiinput single-output (MISO) model mðÞ mapping the variables X iy so far selected into the output y. [12] The procedure is repeated using the residuals ðy  mðX iy ÞÞ as the new output variable in steps 1 and 2, and these operations iterated until either the best variable returned by the IR algorithm is already in the set X iy or the performance of the underlying model mðÞ, as measured by a suitable metric Dðy; mðX iy ÞÞ, does not significantly improve. The algorithm is then terminated and X iy is the optimal set X i; y :DðÞ can be a suitable accuracy index (e.g., coefficient of determination, mean-squared error, etc.) or a more sophisticated metric balancing accuracy and parsimoniousness (e.g., Akaike information criterion, Bayesian information criterion, or Young identification criterion [see Young, 2011, and references therein]). A tabular version of the algorithm is available as Algorithm 1, while more details can be found in Galelli [2010].

Algorithm 1. Iterative Input Selection (IIS) algorithm. Require : The data set D, the set X i of candidate input variables and the output variable y to be explained. Ensure: The set X iy of variables selected to explain y. Initialize: Set the variable ^y equal to y. repeat -With an Input Ranking (IR) algorithm, select the set X^y r of the p most relevant input variables to explain ^y . repeat -With a Model Building (MB) algorithm, estimate a model ^f ðÞ that explains ^y using as input the j-th component xj of the set X^y r.

4297

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

-Compute the distance metric Dj ð^y ; ^f ðxj ÞÞ between ^y and the model output ^f ðxj Þ. until j ¼ p -Select the most relevant variable x in the set X^y r, and add x to the set X iy . ^ iy Þ that explains y using as input -Estimate a model mðX i the set X y , and compute the residuals ^y between y and ^ iy Þ. mðX ^ iy ÞÞ between y -Compute the distance metric Dðy; mðX ^ iy Þ. and the model output mðX - Evaluate the variation D of the distance metric between the current and the previous iteration. until D <  return X iy [13] The evaluation of the metric DðÞ follows a k-fold cross-validation approach [Allen, 1974]: the training data set is randomly split into k mutually exclusive subsets of equivalent size, and the MB algorithm is run k times. Each time the underlying model is validated on one of the k folds and calibrated using the remaining k – 1 folds. The estimated prediction accuracy is then the average value of the metric DðÞ over the k validations. The k-fold cross validation is aimed at estimating the ability of the model to capture the behavior of unseen or future observation data from the same underlying process, and, as such, it minimizes the risk of overfitting the data [Wan Jaafar et al., 2011, and references therein]. [14] The reevaluation of the ranking on the model residuals every time a candidate variable is selected ensures that all the candidates that are highly correlated with the selected variable, and thus may become useless, are discarded. This strategy reinforces the SISO model-based evaluation in step 2 against the selection of redundant variables and it is independent of the model building (MB) and input ranking (IR) algorithms adopted. The choice of these two algorithms is fundamental to equip the IIS algorithm with the other desirable properties of modeling flexibility, i.e., the ability of characterizing strongly nonlinear relationships, computational efficiency, and scalability with respect to input dimensionality, i.e., the ability of handling many input variables with different range of variability. Among the many MB and IR alternatives available, we combined the IIS algorithm with Extremely Randomized Trees (ExtraTrees), a tree-based method proposed by Geurts et al. [2006], which was empirically demonstrated to outperform other MB approaches in terms of the abovementioned properties. In addition, Extra-Trees, like several other treebased ensemble methods [Jong et al., 2004], can directly be used as an IR procedure, since their particular structure can be exploited to infer the relative importance of the input variables and to order them accordingly [Wehenkel, 1998; Fonteneau et al., 2008]. 2.2. Extra-Trees [15] Extra-Trees is a nonparametric tree-based regression method already experimented within a wide range of applications, such as image classification [Maree et al., 2005], bioinformatics [Maree et al., 2006], environmental modeling [Jung et al., 2009; Castelletti et al., 2010a], and water reservoirs operation [Castelletti et al., 2010b]. Tree-based regressors are all based on the idea of decision trees, which

are tree-like structures, composed of decision nodes, branches, and leaves, which form a cascade of rules leading to numerical values. The tree is obtained by first partitioning at the top decision node, with a proper splitting criterion, the set of the input variables into two subsets, thus creating the former two branches. The splitting process is then repeated in a recursive way on each derived subset, until some termination criterion is met, e.g., the numerical values belonging to a subset vary just slightly or only few elements remain. When this process is over, the tree branches represent the hierarchical structure of the subset partitions, while the leaves are the smallest subsets associated to the terminal branches. Each leaf is finally labeled with a numerical value. Tree-based methods include both deterministic (e.g., classification and regression trees [Breiman et al., 1984], M5 [Quinlan, 1992]) and randomized methods (e.g., Bagging predictors [Breiman, 1996], Random Subspace method [Ho, 1998], Random forests [Breiman, 2001], and PERT [Cutler and Guohua, 2001]), where the splitting process is performed somehow randomly. Extra-Trees belong to this second class of approaches and are an ensemble method as explained next. 2.2.1. Model Building [16] Given  a regressionproblem with an output variable y, n inputs x1 ; x2 ; . . . ; xn and a training data set D composed of N input-output observations, the Extra-Trees MB algorithm grows an ensemble of M trees using a top-down approach. For each tree, the decision nodes are split using the following rule: K alternative ‘‘cut directions,’’ i.e., input variables xi, with i ¼ 1; . . . ; K, candidate to be the argument of the node splitting criterion, are randomly selected and, for each one, a random cut-point si is chosen; the variance reduction (or equivalently the coefficient of determination) is computed for each cut direction and the cut direction maximizing this score is adopted to split the node. Let’s  j be the jth nonterminal node in a tree composed of  nonterminal nodes (i.e., j ¼ 1; . . . ; ), then the variance reduction var ð j Þ associated to node  j is defined as   var  j ¼ 1 

jDl ðxi Þj l i jDj var fyjD ðx Þg

r

i

ðx Þj þ jDjDj var fyjDr ðxi Þg

var fyjDg

; ð1Þ

where the terms Dl ðxi Þ and Dr ðxi Þ are the two subsets of D satisfying the conditions xi < si and xi  si , respectively, with si being the randomly selected cut point. The algorithm stops partitioning a node if its cardinality is smaller than a predefined threshold nmin and the node is a leaf in the tree structure. Each leaf is assigned with a value, obtained  as the average  of the outputs y associated to the inputs x1 ; x2 ; . . . ; xn that fall in that leaf. The estimates produced by each single tree are finally aggregated with an arithmetic average over the ensemble of M trees, and the associated variance reduction (or coefficient of determination) of the ensemble is var ¼

M X  X

  var  j jDj:

ð2Þ

¼1 j¼1

[17] The rationale behind the approach is that the combined use of randomization and ensemble averaging

4298

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

provides more effective variance reduction than other randomized methods, while minimizing the bias of the final estimate [Geurts et al., 2006]. This means that the final model structure does not require any of the postprocessing methods (e.g., pruning or smoothing) commonly used to reduce the output variance (and thus the risk of over-fitting) of other tree-based methods [Jothiprakash and Kote, 2011]. Increasingly high values of M not only reduce the variance of the final estimate but also significantly add to the computational requirements of the MB algorithm. K regulates the level of randomness in the tree build building process and its optimal default value for regression problems is equal to the number n of inputs, and so the number of cut directions randomly selected. Finally, the threshold nmin is used to balance bias and variance reduction. Large values of nmin lead to small trees, with high bias and small variance; conversely, low values of nmin lead to fully grown trees, which may overfit the data. The optimal tuning of nmin can depend on the level of noise in the training data set : the noisier are the outputs, the higher should be the optimal value of nmin. Although this tuning might require some experiments, Geurts et al. [2006] have empirically demonstrated that a value of nmin equal to 5 is a robust choice in a broad range of conditions. For a detailed analysis of the sensitivity of Extra-Trees performance to the parameters M, K, and nmin, with an application to a streamflow modeling problem, the reader is referred to Galelli and Castelletti [2013]. [18] From the computational point of view, the complexity of the Extra-Trees building procedure is on the order of N  log ðN Þ, while the computational time linearly increases with M and K and logarithmically decreases for increasing nmin, meaning that the approach still remains computationally efficient, although based on the construction of an ensemble of trees [Geurts et al., 2006]. This is because the splitting rule they adopt is very simple, if compared to other splitting rules that locally optimize the cut points, as, for example, the one in classification and regression trees [Breiman et al., 1984]. 2.2.2. Input Ranking [19] The particular structure of Extra-Trees can be exploited to rank the importance of the n input variables in explaining the selected output behavior. This approach, as originally proposed by Wehenkel [1998], is based on the idea of scoring each input variable by estimating the variance reduction it can be associated with by propagating the training data set D over the M different trees composing the ensemble. More precisely, the relevance Gðxi Þ of the ith input variable xi in explaining the output y can be evaluated as follows: M X  X

  G xi ¼

ð j ; xi Þ  var ð j ÞjDj

¼1 j¼1

var

;

ð3Þ

where ð j ; xi Þ is equal to 1 if the variable xi is used to split the node  j (and 0 otherwise), jDj is the number of samples in the considered subset D, and var ð j Þ is the variance reduction associatedto node  j (see  equation (1)). Finally, the input variables x1 ; x2 ; . . . ; xn are sorted by decreasing values of their relevance.

3.

Synthetic Test Problems

[20] To assess the competence of the IIS algorithm in selecting the most significant and nonredundant subset of inputs, the method is tested on six synthetic data sets with a priori known dependence attributes. The first five data sets are generated by the autoregressive models introduced by Sharma [2000] and subsequently used by Bowden et al. [2005a] and Hejazi and Cai [2009] for the evaluation of input variables algorithms in hydrological and water resources problems. Three models are autoregressive (AR) models of different orders, while the last two are nonlinear threshold autoregressive (TAR) models. The sixth data set set is produced by a nonlinear (NL) model introduced by Bowden et al. [2005a]. The model equations are as follows: ARð1Þ : xt ¼ 0:9  xt1 þ 0:866  et ;

ð4Þ

ARð4Þ : xt ¼ 0:6  xt1  0:4  xt4 þ et ;

ð5Þ

ARð9Þ : xt ¼ 0:3  xt1  0:6  xt4  0:5  xt9 þ et ;

ð6Þ

 TARð1Þ : xt ¼  TARð9Þ : xt ¼

0:9  xt3 þ 0:1  et 0:4  xt3 þ 0:1  et

if if

xt3  0 ; xt3 > 0

0:5  xt6 þ 0:5  xt10 þ 0:1  et 0:8  xt10 þ 0:1  et

if if

ð7Þ

xt6  0 ; xt6 > 0

ð8Þ NL : y ¼ ðx2 Þ3 þ cos ðx6 Þ þ 0:3  sin ðx9 Þ :

ð9Þ

[21] Similarly to the previous studies, in all the test models, et is a standard Gaussian random variable, with a zero mean and unit standard deviation. For the first five models, 520 data points are generated, with the first 20 points discarded to reduce the effect of initialization. The first 15 lags of the data (i.e., xt1 ; xt2 ; . . . ; xt15 ) are considered as candidate inputs. As for the nonlinear model, 15 standard Gaussian random variables (i.e., x1 ; x2 ; . . . ; x15 ) are generated, each one consisting of 500 data points. The data set generation process is repeated 30 times for each model to provide a statistic analysis of the IIS algorithm results. [22] The IIS algorithm is run on each independent instance with a different configuration of its parameters p, k, and ". The number p of SISO models evaluated at each iteration (step 2) and the number k of folds adopted for the k-fold cross-validation process is, respectively, set to 1, 5, 10 and 2, 4, 6, 8, 10. The metric DðÞ used to evaluate the SISO and MISO model performance (steps 2 and 3) is the coefficient of determination R2. Since R2 lies in the range [0,1], the algorithm tolerance " is varied in the range [0,0.1], with a variation step equal to 104. When " is equal to 0.1, the algorithm is stopped if the selection of a further variable leads to an increase of R2 lower than 0.1. On the other hand, when " is equal to 0, the algorithm is stopped if the selection of a further variable leads to a decrease of R2, namely the underlying MISO model is brought to an overfitting condition. Considering the values set for the parameters p, k, and ", a total of 15  103 different parameterizations are explored for each instance of each model.

4299

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 2. Average selection accuracy (in percentage) of IIS algorithm for linear data sets (i.e., AR(1), AR(4) and AR(9)) with 500 data points. [23] As for the Extra-Trees setting, default values for the three parameters M, K, and nmin are set according to Geurts et al. [2006] indications and the subsequent authors’ experiences [Castelletti et al., 2010a; Galelli and Castelletti, 2013]: the number M of trees in an ensemble is 500, the number K of alternative cut-directions is equal to the number of inputs (i.e., 1 and 15 for the SISO and MISO evaluations, respectively), and the minimum cardinality nmin for splitting a node is 5. [24] The overall results of this benchmarking study are summarized in Figures 2 and 3, which show the average selection accuracy achieved on the data sets generated with linear (i.e., AR(1), AR(4), and AR(9), Figure 2) and nonlinear models (i.e., TAR(1), TAR(2), and NL, Figure 3). Results for AR(1) and AR(4) models show that the IIS algorithm produces a correct specification (i.e., only the exact input variables are selected) for a very broad range of parameters’ values. For a value of " and k larger than 0.02 and 4, respectively, the algorithm achieves 100% accuracy. The value of " is key in determining the difference between correct and over specification. When the value of " is too small, the algorithm is not sufficiently sensitive to irrelevant inputs and tends to select a larger set of input variables. The value of k is less important in this selection process, although it is evident that small values of k, such as 2 or 4, can slightly decrease the algorithm accuracy. This is because the adoption of large folds (small values of k) does not allow the underlying model to capture the behavior of unseen observations. [25] These large ‘‘islands of high performance’’ are somehow reduced for the AR(9) model. In addition, in this case, the algorithm can achieve high accuracy (around 90% of correct specification), but for values of " included between 0.01 and 0.04. Similar to the previous models, a small value of " is associated with an over specification

behavior, while for a large value of ", the algorithm shows an under specification behavior and provides an incomplete set of inputs. This can be explained by considering the nature of the AR(9) model. While the AR(1) and AR(4) models are characterized by few significant variables (xt1 and xt1 ; xt4 , respectively), the AR(9) has a larger number of useful inputs (i.e., xt1 ; xt4 and xt9 ), which can be discarded when the algorithm sensitivity is reduced by assuming a large value of ". In all three cases, the value of p appears to be less important than k and ". This is because p is used to minimize the risk of selecting redundant inputs while the three data sets are essentially characterized by a high level of noise rather than redundant information. [26] The results on the datasets generated with linear models are in line with what was found by May et al. [2008], who obtained a selection accuracy of about 95% when applying the PMIS algorithm to data sets of the same size (500 samples). Interestingly, the PMIS algorithm shows the best performance when using as termination criterion the tabulated critical values of MI, which inherently make an assumption regarding the distribution of the data, and this may affect the input selection results in case of deviations from the assumed distribution. On the other hand, the use of Extra-Trees as underlying model does not require any assumption on the statistical properties of the input data sets. [27] The results obtained for the two nonlinear models (i.e., TAR(1) and TAR(2), Figure 3) are similar to those obtained for the AR(9) one. In addition, for these two models, the islands of good performance are bounded by values of " between 0.01 and 0.04 and are slightly increased by values of k larger than 4. When the IIS algorithm parameters are within these ranges, the accuracy is about 90% of correct specification. Again, similar to the AR(9) model, a small value of " is associated to an over specification

4300

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 3. Average selection accuracy (in percentage) of IIS algorithm for nonlinear data sets (i.e., TAR(1), TAR(9), and NL) with 500 data points. behavior, while for a large value of ", the algorithm tends to provide an incomplete set of selected variables. Results for the NL model (Figure 3) are somehow in line with what found for the other nonlinear models but with a significant reduction in the islands of high performance. The highest value of correct specification is around 60%, and most importantly, this performance is achieved for a narrow range of the algorithm parameters. In particular, the IIS algorithm provides an incomplete set of selected variables for values of " larger than 0.01. In such a difficult test, where different variables (i.e., x2, x6, and x9) are related by a strong nonlinear function, a small increase in the algorithm tolerance can easily lead to the mis-selection of at least one variable. Again, these results are in line with May et al. [2008], who found a selection accuracy of about 75% (average on the three nonlinear data sets) when using the tabulated critical values of MI and almost 100% with the Akaike information criterion. The application to nonlinear data sets also shows that the amount of available data (i.e., the amount of observations N) and the number k of folds have an important impact, as discussed in the supporting information, where the algorithm sensitivity to the data set length is described.

4.

Case Study—The Upper Ticino River Basin

[28] We also evaluated the IIS algorithm on a real-world case study, where it was used to identify the most relevant input variables for daily streamflow prediction in the upper Ticino River Basin (Switzerland). 4.1. Study Site and Data [29] The upper Ticino River Basin (Figure 4) extends for 1515 km2 from the alpine continental divide to the northern shores of Lake Maggiore (Switzerland). It is characterized

by significant variations in hydroclimatic conditions and by pronounced orographic heterogeneity. The altitudinal variation in the basin ranges from 220 to 3402 m a.s.l., with more than 80% of the area located higher than 1000 m a.s.l. and a mean slope exceeding the 27%. The dominance of a subalpine climate regime results in extremely variable weather conditions, which cause notable floods and drought periods. [30] The hydrometeorological observational data used in this study consist of time series covering a 4 year period (2004 and 2007–2009). As shown in Figure 4, the meteorological data are collected in five stations located in different parts of the basin, while the streamflow is measured at the Bellinzona station. The data set consists of daily streamflow (y),  precipitation  (p), maximum and minimum tempermin ature T max , maximum and minimum relative air ; T  moisture rmax ; rmin , net solar radiation (Rn, calculated from the amount of sun hours/overall solar radiation), and wind speed (u). Moreover, the daily potential evapotranspiration (ET) is computed for each meteorological station using the recorded meteorological data and the FAO-Penman-Monteith equation. For further details, see Table 1. 4.2. Candidate Input Variables [31] To assess the ability of the IIS algorithm in selecting the most significant input variables for predicting the daily streamflow at time t þ 1 (i.e., ytþ1 ), two different experiments are considered. 4.2.1. Experiment 1 [32] The candidate input variables considered are the precipitation pit and the potential evapotranspiration ET it with one and two temporal delay in the five measurement stations available, i ¼ S1; . . . ; S5 (see Figure 4). No a priori assumption is made on the spatial dynamics of the

4301

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 4. The upper Ticino River Basin (Switzerland) showing the measurement stations used in the input selection problem: S1, Robiei; S2, Piotta; S3, Acquarossa; S4, San Bernardino; S5, Magadino. hydrometeorological process (e.g., spatial aggregation). Moreover, following Young et al. [2006], the previous day streamflow yt is also considered. This results in a total of 21 candidate inputs. 4.2.2. Experiment 2 [33] The precipitation and streamflow variables are the same as in Experiment 1, while the potential evapotranspiration is substituted for by the variables appearing in the FAO-Penman-Monteith formula. Therefore, the input selection exercise considers the previous day streamflow yt and all the variables with one and two temporal delay in the five meteorological stations i ¼ S1; . . . ; S5, i.e., the i precipitation  pt , maximum and minimum temperature max ;i min ; Tt Tt , maximum and minimum relative air mois  and wind speed ture ðrtmax ;i ; rtmin ;i Þ, net solar radiation Rn;i t Table 1. Summary of the Hydrometeorological Observational Data Used in This Study (Daily Values)a Variable y pi ETi T max ;i T min ;i rmax ;i rmin ;i Rn;i ui a

Description Streamflow Precipitation Potential evapotranspiration Maximum temperature Minimum temperature Maximum relative air moisture Minimum relative air moisture Net solar radiation Wind speed at 2 m height

Unit of Measurement m3/s mm mm C C % % MJ/m2 m/s

The superscript i indicates the station as detailed in Figure 4.

 i ut . In this case, the number of candidate input variables is 71 (seven relevant variables with one and two temporal delay in five stations, plus the previous day streamflow). 4.3. IIS Algorithm Setting [34] The IIS algorithm parameters p, k, and " are set according to the results of the sensitivity analysis described in section 3. The number p and k of SISO models and folds adopted for the k-fold cross-validation process are both set to 5, while the algorithm tolerance " is equal to 0.001 (with the coefficient of determination R2 as the distance metric). As discussed in section 3, this latter value might lead to an over specification behavior, i.e., to the selection of one or more redundant variables. However, considering that the hydrological processes to be modeled are highly nonlinear and the correct set of input variables is unknown, we prefer taking a chance on selecting a larger set of inputs rather than missing one or more inputs. The Extra-Trees setting follows the empirical evaluations in Geurts et al. [2006] : M, the number of trees in an ensemble, is 500, K is assumed equal to the number of input variables considered (i.e., 1 for the SISO evaluation, and 21 and 71 for the MIMO evaluation in Experiments 1 and 2, respectively), nmin, the minimum cardinality for splitting a node, is 5. 4.4. Results and Discussion 4.4.1. Experiment 1 [35] The results obtained for the first input selection experiment are reported in Figure 5, which shows the cumulated performance R2 of the underlying model, as well as the contribution R2 of each selected variable, evaluated

4302

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 5. Performance improvement by each selected variable in Experiment 1. The continuous line is the cumulated performance of the underlying model. as the variation of the coefficient of determination R2 at each iteration of the IIS algorithm. The cumulated performance increases monotonically with the number of selected variables, up to the fifth, when the selection of an additional variable does not lead any further significant increase in the underlying model performance, and the algorithm tolerance " is reached. A good percentage of the streamflow process can thus be described by means of four variables, which relate to the main driving forces of the streamflow formation process: (i) the previous day streamflow yt can be seen as a proxy of the overall catchment condition, and, as such, it is characterized by a relevant contribution R2 of about S1 0.63; (ii) the measured precipitation pS4 t and pt , at San Bernardino and Robiei station, have a direct, positive impact on streamflow generation ; and (iii) the potential evapotranspiration ET S1 at Robiei station, together with t S1 the precipitation pS4 t and pt , provides an indirect representation of the soil moisture content available in the catchment. Moreover, ET S1 can negatively affect the output t dynamics and introduces the annual periodicity of the process being modeled, as it depends on a set of meteorological variables with a strong seasonal/annual pattern. Interestingly, the algorithm selects two stations for the measured precipitation (i.e., San Bernardino and Robiei) lying in the boundaries of the basin, and not in the middle, such as Piotta or Acquarossa (see Figure 4). This may be due to the large spatial heterogeneity of rainfall, which makes essential to use two instead of a single station representative of the basin. Indeed, the cross correlation between S1 the measured precipitation pS4 t and pt is only 0.70. In addition, the contribution of the precipitation (especially pS4 t ) is more prominent than the potential evapotranspiration ET S1 t . The catchment is indeed characterized by a short time of concentration (slightly lower than 24 h) and relatively fast dynamics, so the IIS algorithm privileges the precipitation, highly correlated to surface runoff (fast flow component), rather than the potential evapotranspiration, correlated to the interflow/baseflow linkage representing the low flow component. Moreover, all the selected variables have one temporal (i.e., day) delay only, while variables with a 2 days delay are completely discarded. Again, this might be due the short time of concentration of the Ticino catchment. This analysis is confirmed by the scatterplot and the trajectories produced by the underlying model during the variable selection process (see Figure 6). The

usage of the previous day streamflow yt only (1 input, Figure 6a) allows to describe the main trends in the streamflow dynamics, while the introduction of the measured precipitaS1 tion pS4 t and pt (2 and 3 inputs, Figures 6b and 6c) enhances the model predictive capabilities and permits to capture the sudden increases in the river streamflow (see, for example, the comparison in Figure 6e between the measured and predicted flow on the 26th September). Finally, the inclusion of the potential evapotranspiration ET S1 t provides a further, yet modest, improvement of the model performance (Figure 6d). [36] The model with the four inputs selected by the IIS algorithm is then compared against (i) a model fed by the previous day streamflow yt, plus the daily mean areal precipitation pt and potential evapotranspiration ETt (obtained through the Thiessen polygon method), (ii) a model fed by all the potential 21 candidate inputs, and (iii) a naive model, with the predicted streamflow ytþ1 equal to the previous day streamflow yt (i.e., ytþ1 ¼ yt ), used to assess the true utility of the developed models. The same Extra-Trees setting adopted for the four input model (i.e., M ¼ 500, nmin ¼ 5, and K equal to the number of considered inputs) ensures the best performance for these three further models also. The comparison is based on multi-assessment criteria, aimed at describing the models behavior under different flow conditions. These criteria are the coefficient of determination (R2) and the relative root-mean-squared error (RRMSE), namely normalized statistics providing a description of the models behavior over the whole range of flows; the root–mean-squared error (RMSE), which measures the goodness of fit relevant to high flows; the mean absolute error (MAE), which indicates the goodness of fit at moderate flow values. Results in Table 2 show that the model fed by the selected inputs can (slightly) outperform the first two models with respect to all the criteria: In particular, it is interesting to notice that the model fed by the daily mean areal precipitation pt and potential evapotranspiration ETt is less performing in high flow conditions (as shown by the higher value of RMSE). This effect is probably due to the fact that the spatial averaging of the Thiessen polygon method reduces the information content of the spatiotemporal variation of pt and ETt, which is fully exploited by the IIS algorithm. This is further demonstrated by an analysis of the scatterplot and the hydrograph (Figure 7), which show that the model fed by mean areal

4303

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 6. Comparison between the measured and predicted daily streamflow with the different inputs selected by the IIS algorithm in Experiment 1: (a–d), scatterplots and (e) specimen of the corresponding hydrograph for the period Sept.–Oct. 2004. precipitation and evapotranspiration has a tendency to underestimate the observed streamflow, while the model with the four selected inputs is characterized by a lower prediction error. Moreover, Table 2 shows that all the models based on Extra-Trees can largely outperform the naive model: The hydrological processes leading to the streamflow generation are strongly driven by an autoregressive process, but knowing this latter does not allow to predict the quick streamflow increases driven by precipitation and potential evapotranspiration correctly. 4.4.2. Experiment 2 [37] The results of the input selection process for the 71 candidates considered in the second experiment are Table 2. Comparison of the Model Performance, Evaluated in k-Fold Cross Validation (With k ¼ 5), Obtained in Experiment 1 Model inputs S1 S1 yt ; pS4 t ; pt ; ET t yt ; pt ; ET t 21 candidate inputs yt (naive model)

R2

RMSE (m3/s)

RRMSE

MAE (m3/s)

0.762 0.757 0.753 0.574

25.052 25.745 25.389 33.023

0.495 0.509 0.502 0.652

10.971 11.025 11.080 12.557

reported in Figure 8, together with the cumulated performance R2 of the underlying model. Similar to the first experiment, the cumulated performance increases monotonically with the number of selected variables, up to the sixth, when the overfitting effect prevails and the model performance decreases (i.e., the contribution R2 of the sixth variable is S1 negative). The presence of yt ; pS4 t , and pt in the first three positions empirically confirms that the previous day streamflow is a proxy of the overall catchment condition and that the precipitation is the main driving factor of the streamflow process in the Ticino catchment. With respect to Experiment 1, the potential evapotranspiration ET S1 t is replaced by the minimum temperature Ttmin ;S2 and the minimum relative air moisture rtmin ;S1 in Piotta and Robiei station. The selection of the minimum temperature Ttmin ;S2 can be related to the snowmelt process, which has a significant contribution to the streamflow generation in late spring. [38] The scatterplot and the trajectories produced by the underlying model during the variable selection process (Figure 9) show that the selection of the measured precipiS1 tation pS4 Figures 9b and 9c) and t and pt (2 and 3 inputs, the minimum temperature Ttmin ;S2 (4 inputs, Figure 9d) allows the model to predict the rapid increase in the river

4304

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 7. Comparison between the measured and predicted daily streamflow: (a) scatterplot and (b) hydrograph, with a model fed by the selected inputs in Experiment 1 (red line), all the potential inputs (black line) and the mean areal precipitation and evapotranspiration (green line). streamflow. On the other hand, the selection of the minimum relative air moisture rtmin ;S1 (Figure 9e) has a minimum impact on the model performance. The same multiassessment criteria adopted for the previous experiment show that this model can outperform a model fed by all the potential 71 candidate inputs (Table 3) in a wide range of flow conditions (as well as the naive model adopted in the previous experiment). This is also confirmed by the scatterplot and the hydrograph reported in Figure 10. 4.4.3. Discussion [39] The results obtained for Experiment 1 reveal that the IIS algorithm has the ability of selecting the most significant variables among an initial set of 21 candidates. This variable selection process permits to minimize the redundancy of the initial set, with consequent (modest) benefits for the underlying model. The MB algorithm is thus not

interfered by nonrelevant information and can effectively exploit the available data. As far as modeling flexibility is concerned, the underlying model is also empirically demonstrated to well approximate the nonlinear relationship of streamflow formation. In this respect, it is important to consider that the structural error characterizing all the models is due to the physical characteristics of the catchment. The catchment has indeed a time of concentration slightly lower than 24 h, so the models performance is somehow ‘‘bounded’’ by the usage of measured external forcing only. [40] Experiment 2, with a total of 71 candidate input variables, empirically confirms that the IIS algorithm scales well to large data sets. Interestingly, the substitution of the potential evapotranspiration ET it in each station for the raw meteorological information available at the same station is demonstrated to increase the underlying model

Figure 8. Performance improvement by each selected variable in Experiment 2. The continuous line is the cumulated performance of the underlying model. 4305

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 9. Comparison between the measured and predicted daily streamflow with the different inputs selected by the IIS algorithm in Experiment 2: (a–e) scatterplots and (f) specimen of the corresponding hydrograph for the period Sept.–Oct. 2004. performance (see the multiassessment criteria in Table 3 and the hydrograph in Figure 11). In particular, it is interesting to further comment on the selection of the minimum temperature Ttmin ;S2 . As shown in Figure 8, Ttmin ;S2 is selected at the fourth iteration of IIS, but it contributes to R2 more than other variables. In other words, the contribution of the single variables does not monotonically decreases during the variable selection process. The possible reason for this behavior is that a variable, when considered alone, can be quite useless in explaining the output, while it becomes much more informative when considered together with other variables [Guyon and Elisseeff, 2003]. Table 3. Comparison of the Model Performance, Evaluated in kFold Cross Validation (With k ¼ 5), Obtained in Experiment 2 Model Inputs min ;S2 min ;S1 S1 ; rt yt ; pS4 t ; pt ; Tt 71 candidate inputs yt (naive model)

R2 (m3/s)

RMSE

RRMSE

MAE (m3/s)

0.775 0.737 0.574

24.405 26.202 33.023

0.482 0.518 0.652

10.603 11.240 12.557

[41] Finally, the selection of few, significant, and nonredundant variables leads to a more parsimonious and compact representation of the observational data set, and this is key in assisting with the interpretation of the underlying physical processes. In the present study, both the input selection experiments explore whether the direct usage of originally measured variables without any a priori assumption (i.e., spatial aggregation and/or postprocessing of the meteorological data) is more convenient in terms of accuracy of the streamflow prediction and providing a physical insight of the streamflow formation process. For example, the first experiment shows that it is more convenient to select precipitation and evapotranspiration information in few relevant stations, rather than using the Thiessen polygon method to calculate the mean areal precipitation and evapotranspiration. 4.4.4. Comparison Against PMI-Based Input Selection [42] To further validate the results obtained with the IIS algorithm, Experiments 1 and 2 are solved by means of the PMIS algorithm [Sharma, 2000; May et al., 2008], which found successful application in different hydrological problems. More specifically, the PMIS algorithm is run with a

4306

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

Figure 10. Comparison between the measured and predicted daily streamflow : (a) scatterplot and (b) hydrograph, with a model fed by the selected inputs in Experiment 2 (blue line) and all the potential inputs (black line).

Figure 11. Comparison between the measured and predicted daily streamflow : (a) scatterplot and (b and c) hydrograph, with a model fed by the inputs selected in Experiments 1 and 2. 4307

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

bootstrap size of 100 and without specifying any termination criteria. This is done to study the relative importance for the PMIS algorithm of the inputs selected by the IIS approach. [43] The results obtained for Experiment 1 (first 10 iterations) are reported in Table 4 and show a good agreement between the two algorithms, with three inputs selected by the IIS algorithm chosen by the PMIS one at the first four iterations. In particular, (i) the previous day streamflow yt is selected at the first iteration and it is characterized by the highest value of mutual information I equal to 0.851; (ii) S1 S4 the measured precipitation pS2 t ; pt , and pt , at Piotta, Robiei, and San Bernardino stations, are subsequently selected because of their direct impact on streamflow generation ; and (iii) the potential evapotranspiration from the different stations is then included. The two algorithms thus follow the same order in selecting the most relevant variables, with the only difference that the measured precipitation pS2 t in Piotta station is not chosen by the IIS algorithm S4 (only pS1 t and pt in Robiei and San Bernardino stations are selected) and that the potential evapotranspiration ET S1 t in Robiei station is selected after eight iterations of the PMIS algorithm. This could be explained by the fact that the contribution of the potential evapotranspiration is low and very similar between the different stations (as shown by the value of the mutual information), so, the two algorithms tend to choose randomly among different inputs when this uncertainty arises. Table 4 also reports the performance, in terms of R2, of 10 models fed by the inputs iteratively selected by the PMIS algorithm. The same Extra-Trees setting adopted for the IIS algorithm (i.e., M ¼ 500; nmin ¼ 5, and K equal to the number of inputs) is used in order to perform a fair comparison. Results show that the value of R2 is slightly lower than the one obtained with the inputs selected by the IIS algorithm. This is because the optimal subset of inputs selected by the IIS algorithm depends on Extra-Trees as underlying model and, in turn, the ExtraTrees prediction performance depends on the particular inputs selected, so the interaction between the IIS algorithm and Extra-Trees is expected to enhance the prediction performance. [44] These findings are confirmed in Experiment 2, where with four input variables selected by the IIS algorithm chosen by the PMIS one at the first seven iterations Table 4. Variables Selected by the PMIS Algorithm During the First 10 Iterations in Experiment 1 and Corresponding Mutual Information Ia Variable yt pS2 t pS1 t pS4 t ET S4 t ET S2 t ET S3 t ET S5 t ET S1 t S5 pt

I

R2

0.851 0.078 0.013 0.032 0.012 0.019 0.012 0.016 0.015 0.007

0.567 0.712 0.724 0.730 0.746 0.752 0.758 0.756 0.771 0.764

a The last column shows the performance of an ensemble of Extra-Trees evaluated in k-fold cross validation (with k ¼ 5) and fed by the selected inputs.

Table 5. Variables Selected by the PMIS Algorithm During the First 10 Iterations in Experiment 2 and Corresponding Mutual Information Ia Variable

I

R2

yt pS2 t Ttmin ;S5 Ttmin ;S3 Ttmin ;S2 rtmin ;S1 pS1 t Ttmax ;S3 Ttmin ;S4 Ttmax ;S5

0.851 0.078 0.021 0.026 0.027 0.026 0.040 0.035 0.030 0.027

0.568 0.712 0.649 0.745 0.745 0.742 0.757 0.748 0.749 0.752

a

The last column shows the performance of an ensemble of Extra-Trees evaluated in k-fold cross validation (with k ¼ 5) and fed by the selected inputs.

(see Table 5). These variables are the previous day streamflow yt, the minimum temperature and humidity Ttmin ;S2 and rtmin ;S1 at Piotta and Robiei stations and the measured preS4 cipitation pS1 t at Robiei. Only the precipitation pt at San Bernardino is not selected. This difference between IIS and PMIS result can be explained by considering that the second experiment has a larger number of candidate input variables (i.e., 71), so there is more uncertainty about which subset to use. Overall, the experiments show that the two algorithms are in agreement when there is a clear correlation between candidate inputs and output. This is demonstrated by the fact that they both select the previous streamflow and measured precipitation in three stations (Piotta, Robiei, and San Bernardino) and that they discard variables with 2 days delay.

5.

Conclusions

[45] The paper presents a novel tree-based algorithm for input variable selection for hydrological and water resources modeling studies. The approach combines the advantages of model-based selection with the efficiency of model-free methods. The tree-based nature of the underlying model ensures modeling flexibility, computational efficiency, and scalability, while the iterative nature of the algorithm minimizes redundancy. The IIS algorithm accuracy and sensitivity to parameters p, k, and " is first evaluated on a synthetic test case study from the input variable selection literature. Following that, the algorithm is demonstrated for a real-world daily streamflow prediction problem in the upper Ticino River Basin. Results and comparison against PMIS algorithm indicate that the algorithm is capable of selecting the most significant and nonredundant inputs under different conditions, as the presence of noise (synthetic test case study) or several redundant variables (real-world problem) in the sample data set. This permits to identify more parsimonious and compact representations of the observational data set, with increased model performance and assisted interpretation of the underlying physical processes. Moreover, results show that the algorithm provides corrects results for a broad range of the parameters’ values, although its tolerance " may require some trial-and-error tuning. Although the IIS algorithm is essentially conceived for the development of tree-based

4308

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION

regression/classification methods, its nature allows its usage for the selection of the most significant input variables for any (linear or nonlinear) regression method. [46] Future research will concentrate on improving the predictive performance of Extra-Trees, which show a tendency to underestimate the flow peaks because of the averaging performed at the trees leaves [Galelli and Castelletti, 2013]. Another aspect that deserves further study is the physical interpretability of the selected variables. The IIS algorithm, similarly to the majority of the input selection algorithms available in literature, selects the most relevant variables on the basis of their numerical correlation, which does not necessarily mean physical causality. It is therefore interesting to combine the IIS algorithm with causality detection approaches [e.g., Bizzi et al., 2013]. [47] Acknowledgments. The research presented in this work was carried out as part of the Singapore-Delft Water Alliance (SDWA) MultiObjective Multiple-Reservoir Management research program (R-303-001005–272). The authors are grateful to Marcello Restelli and Rodolfo Soncini-Sessa for the essential contribution in developing the algorithm. The meteorological and hydrological data used in this study have been provided by the Swiss Federal Office for Meteorology (MeteoSwiss) and the Swiss Federal Office for Water and Geology (BWG).

References Alcazar, J., and A. Palau (2010), Establishing environmental flow regimes in a Mediterranean watershed based on a regional classification, J. Hydrol., 388(1-2), 41–51. Allen, D. (1974), The relationship between variable selection and data augmentation and a method for prediction, Technometrics, 16(1), 125–127. Barnett, F. A., S. T. Gray, and G. A. Tootle (2010), Upper Green River Basin (United States) streamflow reconstructions, J. Hydrol. Eng., 15(7), 567–579. Bizzi, S., B. Surridge, and D. Lerner (2013), Structural equation modelling: A novel statistical framework for exploring the spatial distribution of benthic macroinvertebrates in riverine ecosystems, River Res. Appl., 29(6), 743–759. Bowden, G., G. Dandy, and H. Maier (2005a), Input determination for neural network models in water resources applications. Part 1. Background and methodology, J. Hydrol., 301(1-4), 75–92. Bowden, G., G. Dandy, and H. Maier (2005b), Input determination for neural network models in water resources applications. Part 2. Case study: Forecasting salinity in a river, J. Hydrol., 301(1-4), 93–107. Breiman, L. (1996), Bagging predictors, Mach. Learn., 24(2), 123–140. Breiman, L. (2001), Random forests, Mach. Learn., 45(1), 5–32. Breiman, L., J. Friedman, R. Olsen, and C. Stone (1984), Classification and Regression Trees, Wadsworth & Brooks, Pacific Grove, Calif. Castelletti, A., S. Galelli, and R. Soncini-Sessa (2010a), A tree-based feature ranking approach to enhance emulation modelling of 3D hydrodynamic-ecological models, paper presented at the International Environmental Modelling and Software Society, Ottawa, Ont., Canada, 5–8 July. Castelletti, A., S. Galelli, M. Restelli, and R. Soncini-Sessa (2010b), Treebased batch-mode reinforcement learning for optimal water reservoir operation, Water Resour. Res., 46, W09507, doi:10.1029/2009WR 008898. Chow, T., and D. Huang (2005), Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information, IEEE Trans. Neural Networks, 16(1), 213–224. Cutler, A., and Z. Guohua (2001), PERT—Perfect random trees ensembles, Comput. Sci. Stat., 33, 490–497. Das, S. (2001), Filters, wrappers and a boosting-based hybrid for feature selection, in Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 74–81, Morgan Kaufmann, San Francisco, Calif. Fernando, T., H. Maier, and G. Dandy (2009), Selection of input variables for data driven models: An average shifted histogram partial mutual information estimator approach, J. Hydrol., 367(3-4), 165–176. Fonteneau, R., L. Wehenkel, and D. Ernst (2008), Variable selection for dynamic treatment regimes: A reinforcement learning approach, paper

presented at 8th European Workshop on Reinforcement Learning, Villeneuve d’Ascq, France, 30 June–4 July. Galelli, S. (2010), Dealing with complexity and dimensionality in water resources management, PhD thesis, Politec. di Milano, Milan, Italy. Galelli, S., and A. Castelletti (2013), Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling, Hydrol. Earth Syst. Sci., 17, 2669–2684. George, E. (2000), The variable selection problem, J. Am. Stat. Assoc., 95(452), 1304–1308. Geurts, P., D. Ernst, and L. Wehenkel (2006), Extremely randomized trees, Mach. Learn., 63(1), 3–42. Guyon, I., and A. Elisseeff (2003), An introduction to variable and feature selection, J. Mach. Learn. Res., 3, 1157–1182. Hart, J. K., and K. Martinez (2006), Environmental sensor networks: A revolution in the earth system science?, Earth Sci. Rev., 78(3-4), 177–191. Hejazi, M., and X. Cai (2009), Input variable selection for water resources systems using a modified minimum redundancy maximum relevance (mMRMR) algorithm, Adv. Water Resour., 32(4), 582–593. Heuvelmans, G., B. Muys, and J. Feyen (2006), Regionalisation of the parameters of a hydrological model: Comparison of linear regression models with artificial neural nets, J. Hydrol., 319(1-4), 245–265. Hill, D., Y. Liu, L. Marini, R. Kooper, A. Rodriguez, J. Futrelle, B. Minsker, J. Myers, and T. McLaren (2011), A virtual sensor system for usergenerated, real-time environmental data products, Environ. Modell. Software, 26(12), 1710–1724. Ho, T. (1998), The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell., 20(8), 832–844. Huang, W., and S. Foo (2002), Neural network modeling of salinity variation in Apalachicola river, Water Res., 36(1), 356–362. Jong, K., J. Mary, A. Cornuejols, E. Marchiori, and M. Sebag (2004), Ensemble feature ranking, in Knowledge Discovery in Databases, edited by J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, pp. 267–278, Springer, Germany. Jothiprakash, V., and A. Kote (2011), Effect of pruning and smoothing while using M5 model tree technique for reservoir inflow prediction, ASCE J. Hydrol. Eng., 16(7), 563–574. Jung, M., M. Reichstein, and A. Bondeau (2009), Towards global empirical upscaling of FLUXNET eddy covariance observations: Validation of a model tree ensemble approach using a biosphere model, Biogeosciences, 6(10), 2001–2013. Kohavi, R., and H. John (1997), Wrappers for feature subset selection, Artif. Intell., 97(1–2), 273–324. Kwak, N., and C. Choi (2002), Input feature selection for classification problems, IEEE Trans. Neural Networks, 13(1), 143–159. Maier, H., A. Jain, G. Dandy, and K. Sudheer (2010), Methods used for the development of neural networks for the prediction of water resource variables in river systems: Current status and future directions, Environ. Modell. Software, 25(8), 891–909. Maree, R., P. Geurts, J. Piater, and L. Wehenkel (2005), Random subwindows for robust image classification, paper presented at IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Comput. Soc., Inst. of Electr. and Electron. Eng., San Diego, Calif., 20–25 June. Maree, R., P. Geurts, and L. Wehenkel (2006), Random subwindows and extremely randomized trees for image classification in cell biology, paper presented at International Workshop on Multiscale Biological Imaging, Data Mining and Informatics, Santa Barbara, Calif., 7–8 Sept. May, R., H. Maier, G. Dandy, and T. G. Fernando (2008), Non-linear variable selection for artificial neural networks using partial mutual information, Environ. Modell. Software, 23(10-11), 1312–1326. Miller, A. (1990), Subset Selection in Regressions, Chapman and Hall, London. Moghaddamnia, A., M. G. Gousheh, J. Piri, S. Amin, and D. Han (2009), Evaporation estimation using artificial neural networks and adaptive neuro-fuzzy inference system techniques, Adv. Water Resour., 32(1), 88–97. Noori, R., A. Karbassi, A. Moghaddamnia, D. Han, M. Zokaei-Ashtiani, A. Farokhnia, and M. G. Gousheh (2011), Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction, J. Hydrol., 401(3-4), 177–189. Peng, H., F. Long, and C. Ding (2005), Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27(8), 1226–1238.

4309

GALELLI AND CASTELLETTI: TREE-BASED INPUT SELECTION Phatak, A., B. Bates, and S. Charles (2011), Statistical downscaling of rainfall data using sparse variable selection methods, Environ. Modell. Software, 26(11), 1363–1371. Quinlan, J. (1992), Learning with continuous classes, paper presented at 5th Australia Joint Conference on Artificial Intelligence, Hobart, Tas., Australia, 16–18 Nov. Salas, J. D., C. Fu, and B. Rajagopalan (2011), Long-range forecasting of Colorado streamflows based on hydrologic, atmospheric, and oceanic data, J. Hydrol. Eng., 16(6), 508–520. Sharma, A. (2000), Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1—A strategy for system predictor identification, J. Hydrol., 239(1–4), 232–239. Sharma, A., K. C. Luk, I. Cordery, and U. Lall (2000), Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 2—Predictor identification of quarterly rainfall using oceanatmosphere information, J. Hydrol., 239(1–4), 240–248. Sharma, A., and S. Chowdhury (2011), Coping with model structural uncertainty in medium-term hydro-climatic forecasting, Hydrol. Res., 42(2–3), 113–127. Ssegane, H., E. Tollner, Y. Mohamoud, T. Rasmussen, and J. Dowd (2012), Advances in variable selection methods. I: Causal selection methods ver-

sus stepwise regression and principal component analysis on data of known and unknown functional relationships, J. Hydrol., 438–439, 16–25. Tarquis, A., J. de Lima, W. Krajewski, Q. Cheng, and H. Gaonac’h (2011), Preface ‘‘nonlinear and scaling processes in hydrology and soil science,’’ Nonlinear Processes Geophys., 18(6), 899–902, doi:10.5194/npg-18– 899-2011. Traveria, M., A. Escribano, and P. Palomo (2010), Statistical wind forecast for Reus airport, Meteorol. Appl., 17(4), 485–495. Wan Jaafar, W. Z., J. Liu, and D. Han (2011), Input variable selection for median flood regionalization, Water Resour. Res., 47, W07503, doi:10.1029/2011WR010436. Wang, W.-C., K.-W. Chau, C.-T. Cheng, and L. Qiu (2009), A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series, J. Hydrol., 374(3-4), 294–306. Wehenkel, L. (1998), Automatic Learning Techniques in Power Systems, Kluwer, Boston, Mass. Young, P. (2011), Recursive Estimation and Time Series Analysis: An Introduction for the Student and Practitioner, Springer, Heidelberg, Germany. Young, P., A. Castelletti, and F. Pianosi (2006), The data-based mechanistic approach in hydrological modeling, in Topics on System Analysis and Integrated Water Resource Management, edited by A. Castelletti and R. Soncini-Sessa, pp. 27–48, Elsevier, Amsterdam.

4310