Water Resour Manage DOI 10.1007/s11269-014-0699-7

Identification of Homogeneous Rainfall Regimes in Northeast Region of India using Fuzzy Cluster Analysis Manish Kumar Goyal & Vivek Gupta

Received: 30 January 2014 / Accepted: 26 May 2014 # Springer Science+Business Media Dordrecht 2014

Abstract Regionalization methods are often used in hydrology for frequency analysis of floods. The hydrologically homogeneous regions should be determined using cluster analysis instead of the geographically close stations. In view of the ongoing environmental and climate changes in the Northeastern of India, regionalization of homogeneous rainfall region is essential to lay out an effective flood frequency analysis of this region. The choice of appropriate cluster approach used according to the data of the basin is also significant. In the context of this study, total precipitation data of stations operated by Indian Meteorological Department (IMD) in Northeastern of India basins for cluster analysis are used. Further, five cluster validity indices, namely Partition Coefficient, Partition Entropy, Extended Xie-Beni index, Fukuyama-Sugeno index and Kwon index have been tested to determine the effectiveness in identifying optimal partition provided by the fuzzy c mean clustering algorithm (FCM). A comparison is also performed using K- Mean clustering algorithm. Additionally, regional homogeneity tests based on L-moments approach are used to check homogeneity of regions identified by both cluster analysis approaches. It was concluded that regional homogeneity test results show that regions defined by FCM method are sufficiently homogeneous for regional frequency analysis. Keywords Regionalization . Clustering . Flood frequency analysis

1 Introduction It is necessary to estimate frequencies and magnitude of extreme events such as floods. There is paucity of data at many of sites of interest in India. Therefore, at-site frequency analysis may give erroneous results. To solve this issue, several studies suggested dividing the whole catchment in some homogeneous regions (Rao and Srinivas 2006a; Lin and Chen 2003). In past, geographical, political and physiographical boundaries are the basis of making the region (Natural Environmental Research 1975; Beable and McKercher 1982; Matlas et al. 1975), but the main disadvantage of using political or physiographic region as homogeneous region is that it does not generally result in hydrologically homogeneous region (Burn 1997; Dikbas M. K. Goyal (*) : V. Gupta Dept of Civil Engineering, Indian Institute of Technology, Guwahati, India e-mail: [email protected]

M.K. Goyal, V. Gupta

et al. 2012) which may lead to less reliable statistical analysis. This made identification of homogeneous regions a significant issue to allow the analysis of regional aspects. In recent years north eastern India has undergone lot of environmental changes due to various development and urbanization activities, so it is necessary to examine the parameters related to ecological and environmental equilibrium. One of the most important parameters for the hydrological regime is rainfall, which needs to be studied in space and time. However, the real issue is to understand the nature of the rainfall distribution and variability on a local scale properly (Venkatesh and Jose 2007). The varied physiological features and altitudinal differences in Northeast India gives rise to various type of climate ranging from near tropical to temperate and alpine, which make rainfall features as irregular and complex with respect to time and space (Das et al. 2009). Due to such deviation in the precipitation pattern, homogeneity in rainfall distribution cannot be expected. Thus, it becomes essential to recognize several homogeneous regions of similar rainfall distribution (Venkatesh and Jose 2007; Dikbas et al. 2012). The procedure of identifying a homogeneous region is called as ‘regionalization’ and the frequency analysis, based on these homogeneous regions, is called ‘regional frequency analysis’ (RFA). The approaches developed for regionalization flood frequency analysis (RFFA) such as (i) Method of residuals (MOR), (ii) The canonical correlation analysis (CCA), (iii) The region of influence (ROI) (Burn 1990a,b), (iv) hierarchical approach (Gabriele and Aenell 1991), (V) cluster analysis (Rao and Srinivas 2006a). The MOR approach uses the positive and the negative signs of residuals of regional regression model for flood quantile relating to the characteristics of watersheds at each gauged site. In this method, the regions are often arranged to be coincident with recognized geographic and/or meteorological boundaries, political or administrative areas. Bhaskar and O’Connor (1989) compared MOR with cluster analysis and it was found that regions formed by MOR method were close to geographical boundaries but were different from regions formed by the cluster analysis. Regions formed by cluster analysis were more similar in terms of their hydrological behavior. Canonical correlation analysis (CCA) (Cavadias 1989, 1990) represents drainage basins as points in the spaces of pairs of flood-related uncorrelated canonical variables and pairs of basin-related uncorrelated canonical variables. Similar patterns of these points are considered as ‘regions’. The results of CCA based approach depend on at-site estimates of extreme quantiles and because of paucity of flood data reliable estimation of extreme quantiles is not possible. (Hosking and Wallis 1997, p. 147). In ROI approach (Burn 1990a,b) each site has its own region. All those sites which are having a distance less than threshold from the site can be considered in the region of that site. Distance is generally calculated in weighted multi-dimensional attribute space. Choice of weights of attributes and sites is a significant problem as no exact mathematical solution is available (Bobee and Rasmussen 1995). Cluster analysis is unsupervised multivariate analysis which classifies the given data in to similar overlapping or non-overlapping groups. Classification of clustering algorithms is shown in Fig. 1. The K-means clustering assigns all feature vectors to various clusters which are having non-overlapping boundaries between them, if a feature vector belongs to a cluster then it will have a degree of membership 1else 0 for that cluster (Rao and Srinivas 2006a; Dikbas et al. 2012). Fuzzy clustering algorithm permits a feature vector to belong to all the clusters simultaneously with a certain degree of membership. The value of fuzzy membership of a feature vector in a cluster specifies the strength with which it belongs to the cluster (Rao and Srinivas 2006b).

Identification of homogeneous rainfall regimes in Northeast India

Fig. 1 Classification of clustering algorithms

The objective of this study is to identifying the homogeneous rainfall region in northeast India by fuzzy and K-means clustering analysis and then, examining the homogeneity of formed region by an approach based on L-moment as elucidated by Hosking and Wallis (1997). Several cluster validation measures were also evaluated for determining optimal partition in fuzzy c-means and K-means algorithm. Five cluster validity indices, namely Partition Coefficient, Partition Entropy, Extended Xie-Beni index, Fukuyama-Sugeno index and Kwon index have been tested for determining optimum number of clusters in fuzzy clustering algorithms while Dunn index and average silhouette width are used as cluster validity indices for K- means clustering.

2 Study Area & Data The NE India region stretches between 21°50’ and 29°34’ N latitude and 85°34’ and 97°50’ E longitude and has total geographical area of 26.2 million hectares, which contains 8 % of total area of the country. Out of the total geographical area of Northeast India, 28.3 % has an elevation more than 1200 m, 17.9 % between 600 and 1200 m and about 10.8 % between 300 m and 600 m above mean sea level. On an average, the NE region receives about 2450 mm of rainfall. The Cherrapunji-Mawsynram range, located in NE India receives rainfall as high as 11,500 mm, annually (Das et al. 2009; Dash et al. 2012). A total of 68 gauging stations with observation period of 102 years were selected. Latitude, longitude, altitude, average total annual rainfall, coefficient of variation, maximum average annual rainfall and minimum average annual rainfall data of the stations were used in the cluster analysis as shown in Table 1. Variable with different units generally influence the clustering results so literature suggest the data to normalize with appropriate transformation functions (Cannarozzo et al. 2009; Lim and Voeller 2009; Dikbas et al. 2012). The data were normalized by using following transformation functions before being used in cluster analysis (Dikbas et al. 2012) X Nij ¼

X ij− X i;min X i;max −X i;min

ð1Þ

Where Xij is the ith attribute of jth station; Xi,min is the minimum ith attribute in all stations; Xi,max is the maximum ith attribute in all stations and XNij is normalized ith attribute of jth station.

M.K. Goyal, V. Gupta Table 1 Attributes considered in study Attribute

Range

Average annual precipitation (mm)

1602.10 to 4082.25

Coefficient of variation

−1817.62 to 1451.42

Maximum annual precipitation (mm)

1899.72 to 7272.82

Minimum annual precipitation (mm) Latitude

1117.907 to 2680.17 22.4833°N to 28.7°N

Longitude

88.2° E to 96.5° E

Altitude (m)

28.05 to 4544.568

3 Algorithm 3.1 K-means Clustering K-means (MacQueen 1967) is an unsupervised learning algorithm for solving multivariate classification problem. This algorithm aims at minimizing an objective function as given in Eq. 2. Xk XM

ð jÞ ð2Þ J¼

Y −C j 2 j¼1

i¼1

i

Where ‖Yi(j) −Cj‖ is squared Euclidean distance between ith data point and jth cluster center. In this algorithm primarily the data is assigned at random to K-clusters having K- random centroids. These centroids should be placed very carefully because this initial guess can affect the final results. So, centroid should be placed as far as possible from each other. Now distance of each data point from each of the K-center is calculated and each point is assigned to the cluster whose center is at a minimum distance from data point. Now new cluster center is calculated for each cluster by using Eq.3 1 X ð3Þ Y Zj ¼ p ∀Y p ∈C j nj Where nj is the total number of members in jth cluster. Again the data points are assigned to their nearest centroid cluster. This process is repeated until no further change in cluster centroid is found. Although K-means algorithm converges always but it is very sensitive to initial guess of centroids, because of that it may trap in local minima in place of global minima. For solving this problem replicate parameter is used that runs the algorithm again and again and selects the best results (Dalton et al. 2009). The most commonly used distances are the Euclidian, correlation, city block distance and Cosine. 3.2 Cluster Validation Indices for K-means Clustering For examining the optimum number of clusters and quality of the formed cluster, cluster validity indices are used. Some of commonly used indices are: Dunn Index (Dunn 1974b), Average Silhouette width (Rousseeuw 1987) and Davies and Bouldin index (Davies and Bouldin 1979). In this study we have used Average Silhouette Width and Dunn index for validation of clusters.

Identification of homogeneous rainfall regimes in Northeast India

Silhouette value can be given as. S ðiÞ ¼

bðiÞ−aðiÞ maxðaðiÞ; bðiÞÞ

ð4Þ

average dissimilarity of ith data point with all other data points within the same cluster. b(i) lowest average dissimilarity of ith data point with any other cluster to which ith data point does not belong. The cluster with lowest average dissimilarity is said to be the neighboring cluster of ith data point. also, Dunn index as given by Dunn (1974b) is. a(i)

C¼ dmin dmax

d min d max

ð5Þ

minimum distance of points of different cluster. largest distance between points of same cluster.

3.3 Fuzzy c-means (FCM) Fuzzy c-means algorithm was proposed by Dunn (1974a) and extended by Bezdek in Bezdek et al. 1984. For a data set having M objects of c classes, if YK the data vector for kth object, k=1,2,…,M. Fuzzy c-means algorithm aims to minimize the objective function given in Eq. 4. J ðU ; C Þ ¼

XM Xc j¼1

ua Y k −C i 2 i¼1 ik

ð6Þ

In which, uik is the membership value of kth data point in ith cluster, ‖Yk −Ci‖2 is squared Euclidean distance between data vector k and center of ith cluster center, Ci is the center of ith cluster and α is generally called as fuzzifier, it can have any value greater than 1. In general, its value is set between 1 and 2.5 (Pal and Bezdek 1995). 3.4 Fuzzy c-means Algorithm Steps 1. Initially, Value of number of clusters and data vector of cluster center is assumed randomly. 2. Then, membership matrix is calculated using Eq. 7.

tþ1 ui< k

2 3−1 2 " # a−1 Xc ky −c k 5

k i ¼4 j¼1 y −c j k

ð7Þ

Where i=1,2,…..c, k=1,2,…,M. 3. By using updated membership values and Eq. 6, new cluster centers can be calculated XM uaik yk C i ¼ Xk¼1 ð8Þ M a u k¼1 ik

M.K. Goyal, V. Gupta

3.5 Parameters of the FCM Algorithm The results of FCM algorithm are very sensitive to certain parameters. The number of clusters, c, value of fuzzifier, α, and Stopping criteria, є, are some of the parameters which control the FCM algorithm. Value of these parameters should be chosen carefully for good clustering results. 3.5.1 Number of Clusters Number of clusters c is having more influence on partitioning than other parameters. Optimal value of number of clusters is also decreases with increase in number of clusters (Rao and Srinivas 2006b). It becomes very important to choose the optimal value of number of clusters for getting well separated and compact clusters. FCM tries to divide the data in well separated and compact clusters. For addressing this issue Bezdek (1981) stated about the concept of cluster validity. Validity measures generally assess the goodness of the obtained partition. Consequently, number of Validity indices has been proposed in literature. For the FCM algorithm, Partition Coefficient (Bezdek 1974a), Partition Entropy (Bezdek 1974b), Extended Xie-Beni index (Xie and Beni 1991), Fukuyama-Sugeno index (Fukuyama and Sugeno 1989) and Kwon index (Kwon 1998) have been found to perform well in practice. i) Partition Coefficient (VPC)

1 Xc XM 2 u i¼1 k¼1 ik m

ð9Þ

i 1 hXc XM uik loga ðuik Þ i¼1 k¼1 m

ð10Þ

VPC ðU Þ ¼

ii) Partition Entropy (VPE) VPE ðU Þ ¼

VPC may have values between 1/c and 1. Maximum value of VPC indicates good clustering. VPC =1 indicates that there is no membership sharing between clusters i.e. any data can belongs to either one cluster or other cluster and VPC =1/c indicates equally shared cluster i.e. memberships of each data point in all the clusters are same (i.e. uik =1/c∀i,k). VPE may have the value in between 0 and loga (c). VPE =0 indicates no membership sharing between clusters. VPE =loga (c) indicates equally shared clusters (i.e. uik =1/c ∀ i, k). Minimum value of VPE represents good clustering. VPC and VPE are not directly related to any property of the data (Xie and Beni 1991). VPC generally shows monotonic decreasing tendency with increase in the number of clusters, while VPE exhibits monotonic increasing tendency with increase in the number of clusters (Rao and Srinivas 2006b) Also, VPC and VPE are very sensitive to the value of fuzzifier, α, as α→1 and α→∞ (Halkidi et al. 2001). iii) Fukuyama and Sugeno index VFS ðU ; C : Y Þ ¼

XM Xc

XM Xc

k¼1

k¼1

ua kc −yk k2A − i¼1 ik i

ua ci −c− k2A i¼1 ik

ð11Þ

Identification of homogeneous rainfall regimes in Northeast India

Minimum value of VFS indicates compact and well separated clusters, in other words it indicates optimal partitioning. iv) Extended Xie-Beni Index Xie and Beni (1991) proposed a validity measure and extended it for the value of fuzzifier. Pal and Bezdek (1995) called this index as Extended Xie-Beni index. Xc XM

ðuik Þa ci −yk k2 i¼1 k ð12Þ V XB;m ðU ; V : X Þ ¼ M mini≠k kvi −yk k2 Minimum Value of VXB,m indicates optimal clustering. v) Kwon Index VXB is having monotonically decreasing tendency when c→M (Kwon 1998). For addressing this issue Kwon (1998) provided a new cluster validity index VK, which is having an ad hoc punishing function in numerator. Xc XM V K ðU ; V : X Þ ¼

i¼1

ðu Þa kci −yk k2 þ k¼1 ik mini≠k kci −yk k2

1 Xc kc −c¯k2 i¼1 i c

ð13Þ

3.5.2 Fuzziness Parameter In FCM algorithm fuzzifier, α, controls the extent of fuzziness in results. Large value of α represents fuzziest partition and less value of α represent lesser fuzzy partition. Therefore α=1 represents hard or crisp partitioning (uik ∈ {0, 1}) i.e. one data set can belong to only one cluster, and at α=∞, the partition becomes completely fuzzy (uik =1/c) i.e. each data point having equal membership (1/c) for each cluster. Usually, α is taken in the range of [1.5,2.5] (Pal and Bezdek 1995) 3.5.3 Stopping Criterion For, as 0.001.

. FCM algorithm stops the iterations. Generally є is taken

4 L-Moment for the Data Samples L-moments can be considered as another system of describing the shape of probability distribution (Hosking and Wallis 1997). L-moments are developed by modification of “probability weighted moments” of Greenwood et al. (1979). Sample probability weighted moments as defined by J. A. Greenwood et al. (1979) can be give as

bo ¼ n−1

n X j¼1

xj

ð14Þ

M.K. Goyal, V. Gupta

br ¼ n−1

Xn

ð j−1Þð j−2Þ…ð j−rÞ

x j¼rþ1 ðn−1Þðn−2Þ…ðn−r Þ j

ð15Þ

L-moments are summary statistics for probability distributions and data samples. Similar to ordinary moments, L- moments also provides measures of location, dispersion, skewness, peakedness, and other features of the shape of probability distributions or data samples but are computed from linear combinations of the ordered data values (Hosking 1990). L-Moments are specific linear combinations of probability weighted moments. First few moments can be defined as l 1 ¼ b0

ð16Þ

l 2 ¼ 2b1 −b0

ð17Þ

l3 ¼ 6b2 −6b1 þ b0

ð18Þ

l 4 ¼ 20b3 −30b2 þ 12b1 −b0

ð19Þ

The coefficients are those of the shifted Legendre polynomial. The first L-moment represent the sample mean, a measure of location while the second L-moment is a measure of dispersion of data values about their mean. L-moment ratios can be calculated by dividing the higher order L-moments by dispersion measure. . tr ¼ lr l2

ð20Þ

where tr are dimensionless quantities independent of units of measurement of data. t3 is a generally called as L-skewness, it is a measure of skewness of data samples about their mean. t4 is generally called as L-kurtosis, it is a measure of kurtosis. L-CV is similar to coefficient of variation and can be defined as . ð21Þ t ¼ l2 l1

5 Discordancy and Regional Homogeneity Test For examining the homogeneity of region formed by clustered data Hosking and Wallis (1993) proposed a discordancy measure, a homogeneity measure and a goodness of fit measure. 5.1 Discordancy Measure For preliminary screening of data Hosking and Wallis (1993) proposed a discordancy measure (Di). This identifies those at site L-moments which are very much different from other in the region.

Identification of homogeneous rainfall regimes in Northeast India

Di ¼

1 ðui − u− ÞT A−1 ðui − u− Þ 3

ð22Þ

In this expression A is sample covariance matrix which is given by : 1 X ðui − u− Þðui − u− ÞT ðN −1Þ i−1 n

A¼

ð23Þ

(i) Where ui =vector containing L-moment ratios [t (i),t (i) 3 ,t 4 ].

ū=mean of vector ui. Hosking and Wallis (1993) proposed that any index station will not be homogeneous with the region in which it belongs if the value of discordancy measure Di is greater than certain critical value (for more than 15 station critical value is 3). Di can also be used for identifying the sites within some large geographical area which are having gross error in their data. 5.2 Heterogeneity Measure Heterogeneity measure is used to examine degree of heterogeneity within the region. It compares variation in between site L-moments with what would be expected for a homogeneous region. (1) Heterogeneity measure based on L-CV H1 ¼

V −μV σV

ð24Þ

(2) A measure based on L-CV and L-skewness H2 ¼

V 2 −μV 2 σV 2

ð25Þ

(3) A measure based on L-Skewness and L-Kurtosis:

H3 ¼

V 3 −μV 3 σV 3

ð26Þ

In these measures, V denote the weighted standard deviation of the at site sample L-CV; V2 represents the weighted average distance from the site to group weighted mean in the 2- dimensional space of L-CV and L-skewness; V3 refers to weighted average distance from the site to group weighted mean in the two dimensional space of L-Skewness and L-kurtosis; μV, μV2 and μV3 denotes the mean of Nsim values of V, V2 and V3 respectively; σV, σV2 and σV3 represent the standard deviation of Nsim Values of V, V2 and V3 respectively. In this study the number of simulations, Nsim, is chosen as 500.

M.K. Goyal, V. Gupta

2 ðiÞ ðRÞ n t −t i i¼1 XN R n i¼1 i

XN R V ¼ XN R V2 ¼

XN R V2 ¼

n i¼1 i

n i¼1 i

2 2 1=2 ði Þ tðiÞ −tðRÞ þ t3 −tR3 XN R n i¼1 i

2 1=2 ðiÞ ðRÞ 2 ði Þ t3 −t 3 þ t4 −tR4 XN R n i¼1 i

ð27Þ

ð28Þ

ð29Þ

In this equation ni denotes record length of peak flows at gauging station i whose sample Lmoment ratios L-CV, L-Skewness and L-kurtosis are denoted as t(i), t3(i) and t4(i) respectively. The number of sites in the region is NR, and t R, t3R and t4R refer to the weighted regional average L-CV, L-skewness and L-Kurtosis respectively. A region can be regarded as ‘acceptably homogeneous’ if HM

Identification of Homogeneous Rainfall Regimes in Northeast Region of India using Fuzzy Cluster Analysis Manish Kumar Goyal & Vivek Gupta

Received: 30 January 2014 / Accepted: 26 May 2014 # Springer Science+Business Media Dordrecht 2014

Abstract Regionalization methods are often used in hydrology for frequency analysis of floods. The hydrologically homogeneous regions should be determined using cluster analysis instead of the geographically close stations. In view of the ongoing environmental and climate changes in the Northeastern of India, regionalization of homogeneous rainfall region is essential to lay out an effective flood frequency analysis of this region. The choice of appropriate cluster approach used according to the data of the basin is also significant. In the context of this study, total precipitation data of stations operated by Indian Meteorological Department (IMD) in Northeastern of India basins for cluster analysis are used. Further, five cluster validity indices, namely Partition Coefficient, Partition Entropy, Extended Xie-Beni index, Fukuyama-Sugeno index and Kwon index have been tested to determine the effectiveness in identifying optimal partition provided by the fuzzy c mean clustering algorithm (FCM). A comparison is also performed using K- Mean clustering algorithm. Additionally, regional homogeneity tests based on L-moments approach are used to check homogeneity of regions identified by both cluster analysis approaches. It was concluded that regional homogeneity test results show that regions defined by FCM method are sufficiently homogeneous for regional frequency analysis. Keywords Regionalization . Clustering . Flood frequency analysis

1 Introduction It is necessary to estimate frequencies and magnitude of extreme events such as floods. There is paucity of data at many of sites of interest in India. Therefore, at-site frequency analysis may give erroneous results. To solve this issue, several studies suggested dividing the whole catchment in some homogeneous regions (Rao and Srinivas 2006a; Lin and Chen 2003). In past, geographical, political and physiographical boundaries are the basis of making the region (Natural Environmental Research 1975; Beable and McKercher 1982; Matlas et al. 1975), but the main disadvantage of using political or physiographic region as homogeneous region is that it does not generally result in hydrologically homogeneous region (Burn 1997; Dikbas M. K. Goyal (*) : V. Gupta Dept of Civil Engineering, Indian Institute of Technology, Guwahati, India e-mail: [email protected]

M.K. Goyal, V. Gupta

et al. 2012) which may lead to less reliable statistical analysis. This made identification of homogeneous regions a significant issue to allow the analysis of regional aspects. In recent years north eastern India has undergone lot of environmental changes due to various development and urbanization activities, so it is necessary to examine the parameters related to ecological and environmental equilibrium. One of the most important parameters for the hydrological regime is rainfall, which needs to be studied in space and time. However, the real issue is to understand the nature of the rainfall distribution and variability on a local scale properly (Venkatesh and Jose 2007). The varied physiological features and altitudinal differences in Northeast India gives rise to various type of climate ranging from near tropical to temperate and alpine, which make rainfall features as irregular and complex with respect to time and space (Das et al. 2009). Due to such deviation in the precipitation pattern, homogeneity in rainfall distribution cannot be expected. Thus, it becomes essential to recognize several homogeneous regions of similar rainfall distribution (Venkatesh and Jose 2007; Dikbas et al. 2012). The procedure of identifying a homogeneous region is called as ‘regionalization’ and the frequency analysis, based on these homogeneous regions, is called ‘regional frequency analysis’ (RFA). The approaches developed for regionalization flood frequency analysis (RFFA) such as (i) Method of residuals (MOR), (ii) The canonical correlation analysis (CCA), (iii) The region of influence (ROI) (Burn 1990a,b), (iv) hierarchical approach (Gabriele and Aenell 1991), (V) cluster analysis (Rao and Srinivas 2006a). The MOR approach uses the positive and the negative signs of residuals of regional regression model for flood quantile relating to the characteristics of watersheds at each gauged site. In this method, the regions are often arranged to be coincident with recognized geographic and/or meteorological boundaries, political or administrative areas. Bhaskar and O’Connor (1989) compared MOR with cluster analysis and it was found that regions formed by MOR method were close to geographical boundaries but were different from regions formed by the cluster analysis. Regions formed by cluster analysis were more similar in terms of their hydrological behavior. Canonical correlation analysis (CCA) (Cavadias 1989, 1990) represents drainage basins as points in the spaces of pairs of flood-related uncorrelated canonical variables and pairs of basin-related uncorrelated canonical variables. Similar patterns of these points are considered as ‘regions’. The results of CCA based approach depend on at-site estimates of extreme quantiles and because of paucity of flood data reliable estimation of extreme quantiles is not possible. (Hosking and Wallis 1997, p. 147). In ROI approach (Burn 1990a,b) each site has its own region. All those sites which are having a distance less than threshold from the site can be considered in the region of that site. Distance is generally calculated in weighted multi-dimensional attribute space. Choice of weights of attributes and sites is a significant problem as no exact mathematical solution is available (Bobee and Rasmussen 1995). Cluster analysis is unsupervised multivariate analysis which classifies the given data in to similar overlapping or non-overlapping groups. Classification of clustering algorithms is shown in Fig. 1. The K-means clustering assigns all feature vectors to various clusters which are having non-overlapping boundaries between them, if a feature vector belongs to a cluster then it will have a degree of membership 1else 0 for that cluster (Rao and Srinivas 2006a; Dikbas et al. 2012). Fuzzy clustering algorithm permits a feature vector to belong to all the clusters simultaneously with a certain degree of membership. The value of fuzzy membership of a feature vector in a cluster specifies the strength with which it belongs to the cluster (Rao and Srinivas 2006b).

Identification of homogeneous rainfall regimes in Northeast India

Fig. 1 Classification of clustering algorithms

The objective of this study is to identifying the homogeneous rainfall region in northeast India by fuzzy and K-means clustering analysis and then, examining the homogeneity of formed region by an approach based on L-moment as elucidated by Hosking and Wallis (1997). Several cluster validation measures were also evaluated for determining optimal partition in fuzzy c-means and K-means algorithm. Five cluster validity indices, namely Partition Coefficient, Partition Entropy, Extended Xie-Beni index, Fukuyama-Sugeno index and Kwon index have been tested for determining optimum number of clusters in fuzzy clustering algorithms while Dunn index and average silhouette width are used as cluster validity indices for K- means clustering.

2 Study Area & Data The NE India region stretches between 21°50’ and 29°34’ N latitude and 85°34’ and 97°50’ E longitude and has total geographical area of 26.2 million hectares, which contains 8 % of total area of the country. Out of the total geographical area of Northeast India, 28.3 % has an elevation more than 1200 m, 17.9 % between 600 and 1200 m and about 10.8 % between 300 m and 600 m above mean sea level. On an average, the NE region receives about 2450 mm of rainfall. The Cherrapunji-Mawsynram range, located in NE India receives rainfall as high as 11,500 mm, annually (Das et al. 2009; Dash et al. 2012). A total of 68 gauging stations with observation period of 102 years were selected. Latitude, longitude, altitude, average total annual rainfall, coefficient of variation, maximum average annual rainfall and minimum average annual rainfall data of the stations were used in the cluster analysis as shown in Table 1. Variable with different units generally influence the clustering results so literature suggest the data to normalize with appropriate transformation functions (Cannarozzo et al. 2009; Lim and Voeller 2009; Dikbas et al. 2012). The data were normalized by using following transformation functions before being used in cluster analysis (Dikbas et al. 2012) X Nij ¼

X ij− X i;min X i;max −X i;min

ð1Þ

Where Xij is the ith attribute of jth station; Xi,min is the minimum ith attribute in all stations; Xi,max is the maximum ith attribute in all stations and XNij is normalized ith attribute of jth station.

M.K. Goyal, V. Gupta Table 1 Attributes considered in study Attribute

Range

Average annual precipitation (mm)

1602.10 to 4082.25

Coefficient of variation

−1817.62 to 1451.42

Maximum annual precipitation (mm)

1899.72 to 7272.82

Minimum annual precipitation (mm) Latitude

1117.907 to 2680.17 22.4833°N to 28.7°N

Longitude

88.2° E to 96.5° E

Altitude (m)

28.05 to 4544.568

3 Algorithm 3.1 K-means Clustering K-means (MacQueen 1967) is an unsupervised learning algorithm for solving multivariate classification problem. This algorithm aims at minimizing an objective function as given in Eq. 2. Xk XM

ð jÞ ð2Þ J¼

Y −C j 2 j¼1

i¼1

i

Where ‖Yi(j) −Cj‖ is squared Euclidean distance between ith data point and jth cluster center. In this algorithm primarily the data is assigned at random to K-clusters having K- random centroids. These centroids should be placed very carefully because this initial guess can affect the final results. So, centroid should be placed as far as possible from each other. Now distance of each data point from each of the K-center is calculated and each point is assigned to the cluster whose center is at a minimum distance from data point. Now new cluster center is calculated for each cluster by using Eq.3 1 X ð3Þ Y Zj ¼ p ∀Y p ∈C j nj Where nj is the total number of members in jth cluster. Again the data points are assigned to their nearest centroid cluster. This process is repeated until no further change in cluster centroid is found. Although K-means algorithm converges always but it is very sensitive to initial guess of centroids, because of that it may trap in local minima in place of global minima. For solving this problem replicate parameter is used that runs the algorithm again and again and selects the best results (Dalton et al. 2009). The most commonly used distances are the Euclidian, correlation, city block distance and Cosine. 3.2 Cluster Validation Indices for K-means Clustering For examining the optimum number of clusters and quality of the formed cluster, cluster validity indices are used. Some of commonly used indices are: Dunn Index (Dunn 1974b), Average Silhouette width (Rousseeuw 1987) and Davies and Bouldin index (Davies and Bouldin 1979). In this study we have used Average Silhouette Width and Dunn index for validation of clusters.

Identification of homogeneous rainfall regimes in Northeast India

Silhouette value can be given as. S ðiÞ ¼

bðiÞ−aðiÞ maxðaðiÞ; bðiÞÞ

ð4Þ

average dissimilarity of ith data point with all other data points within the same cluster. b(i) lowest average dissimilarity of ith data point with any other cluster to which ith data point does not belong. The cluster with lowest average dissimilarity is said to be the neighboring cluster of ith data point. also, Dunn index as given by Dunn (1974b) is. a(i)

C¼ dmin dmax

d min d max

ð5Þ

minimum distance of points of different cluster. largest distance between points of same cluster.

3.3 Fuzzy c-means (FCM) Fuzzy c-means algorithm was proposed by Dunn (1974a) and extended by Bezdek in Bezdek et al. 1984. For a data set having M objects of c classes, if YK the data vector for kth object, k=1,2,…,M. Fuzzy c-means algorithm aims to minimize the objective function given in Eq. 4. J ðU ; C Þ ¼

XM Xc j¼1

ua Y k −C i 2 i¼1 ik

ð6Þ

In which, uik is the membership value of kth data point in ith cluster, ‖Yk −Ci‖2 is squared Euclidean distance between data vector k and center of ith cluster center, Ci is the center of ith cluster and α is generally called as fuzzifier, it can have any value greater than 1. In general, its value is set between 1 and 2.5 (Pal and Bezdek 1995). 3.4 Fuzzy c-means Algorithm Steps 1. Initially, Value of number of clusters and data vector of cluster center is assumed randomly. 2. Then, membership matrix is calculated using Eq. 7.

tþ1 ui< k

2 3−1 2 " # a−1 Xc ky −c k 5

k i ¼4 j¼1 y −c j k

ð7Þ

Where i=1,2,…..c, k=1,2,…,M. 3. By using updated membership values and Eq. 6, new cluster centers can be calculated XM uaik yk C i ¼ Xk¼1 ð8Þ M a u k¼1 ik

M.K. Goyal, V. Gupta

3.5 Parameters of the FCM Algorithm The results of FCM algorithm are very sensitive to certain parameters. The number of clusters, c, value of fuzzifier, α, and Stopping criteria, є, are some of the parameters which control the FCM algorithm. Value of these parameters should be chosen carefully for good clustering results. 3.5.1 Number of Clusters Number of clusters c is having more influence on partitioning than other parameters. Optimal value of number of clusters is also decreases with increase in number of clusters (Rao and Srinivas 2006b). It becomes very important to choose the optimal value of number of clusters for getting well separated and compact clusters. FCM tries to divide the data in well separated and compact clusters. For addressing this issue Bezdek (1981) stated about the concept of cluster validity. Validity measures generally assess the goodness of the obtained partition. Consequently, number of Validity indices has been proposed in literature. For the FCM algorithm, Partition Coefficient (Bezdek 1974a), Partition Entropy (Bezdek 1974b), Extended Xie-Beni index (Xie and Beni 1991), Fukuyama-Sugeno index (Fukuyama and Sugeno 1989) and Kwon index (Kwon 1998) have been found to perform well in practice. i) Partition Coefficient (VPC)

1 Xc XM 2 u i¼1 k¼1 ik m

ð9Þ

i 1 hXc XM uik loga ðuik Þ i¼1 k¼1 m

ð10Þ

VPC ðU Þ ¼

ii) Partition Entropy (VPE) VPE ðU Þ ¼

VPC may have values between 1/c and 1. Maximum value of VPC indicates good clustering. VPC =1 indicates that there is no membership sharing between clusters i.e. any data can belongs to either one cluster or other cluster and VPC =1/c indicates equally shared cluster i.e. memberships of each data point in all the clusters are same (i.e. uik =1/c∀i,k). VPE may have the value in between 0 and loga (c). VPE =0 indicates no membership sharing between clusters. VPE =loga (c) indicates equally shared clusters (i.e. uik =1/c ∀ i, k). Minimum value of VPE represents good clustering. VPC and VPE are not directly related to any property of the data (Xie and Beni 1991). VPC generally shows monotonic decreasing tendency with increase in the number of clusters, while VPE exhibits monotonic increasing tendency with increase in the number of clusters (Rao and Srinivas 2006b) Also, VPC and VPE are very sensitive to the value of fuzzifier, α, as α→1 and α→∞ (Halkidi et al. 2001). iii) Fukuyama and Sugeno index VFS ðU ; C : Y Þ ¼

XM Xc

XM Xc

k¼1

k¼1

ua kc −yk k2A − i¼1 ik i

ua ci −c− k2A i¼1 ik

ð11Þ

Identification of homogeneous rainfall regimes in Northeast India

Minimum value of VFS indicates compact and well separated clusters, in other words it indicates optimal partitioning. iv) Extended Xie-Beni Index Xie and Beni (1991) proposed a validity measure and extended it for the value of fuzzifier. Pal and Bezdek (1995) called this index as Extended Xie-Beni index. Xc XM

ðuik Þa ci −yk k2 i¼1 k ð12Þ V XB;m ðU ; V : X Þ ¼ M mini≠k kvi −yk k2 Minimum Value of VXB,m indicates optimal clustering. v) Kwon Index VXB is having monotonically decreasing tendency when c→M (Kwon 1998). For addressing this issue Kwon (1998) provided a new cluster validity index VK, which is having an ad hoc punishing function in numerator. Xc XM V K ðU ; V : X Þ ¼

i¼1

ðu Þa kci −yk k2 þ k¼1 ik mini≠k kci −yk k2

1 Xc kc −c¯k2 i¼1 i c

ð13Þ

3.5.2 Fuzziness Parameter In FCM algorithm fuzzifier, α, controls the extent of fuzziness in results. Large value of α represents fuzziest partition and less value of α represent lesser fuzzy partition. Therefore α=1 represents hard or crisp partitioning (uik ∈ {0, 1}) i.e. one data set can belong to only one cluster, and at α=∞, the partition becomes completely fuzzy (uik =1/c) i.e. each data point having equal membership (1/c) for each cluster. Usually, α is taken in the range of [1.5,2.5] (Pal and Bezdek 1995) 3.5.3 Stopping Criterion For, as 0.001.

. FCM algorithm stops the iterations. Generally є is taken

4 L-Moment for the Data Samples L-moments can be considered as another system of describing the shape of probability distribution (Hosking and Wallis 1997). L-moments are developed by modification of “probability weighted moments” of Greenwood et al. (1979). Sample probability weighted moments as defined by J. A. Greenwood et al. (1979) can be give as

bo ¼ n−1

n X j¼1

xj

ð14Þ

M.K. Goyal, V. Gupta

br ¼ n−1

Xn

ð j−1Þð j−2Þ…ð j−rÞ

x j¼rþ1 ðn−1Þðn−2Þ…ðn−r Þ j

ð15Þ

L-moments are summary statistics for probability distributions and data samples. Similar to ordinary moments, L- moments also provides measures of location, dispersion, skewness, peakedness, and other features of the shape of probability distributions or data samples but are computed from linear combinations of the ordered data values (Hosking 1990). L-Moments are specific linear combinations of probability weighted moments. First few moments can be defined as l 1 ¼ b0

ð16Þ

l 2 ¼ 2b1 −b0

ð17Þ

l3 ¼ 6b2 −6b1 þ b0

ð18Þ

l 4 ¼ 20b3 −30b2 þ 12b1 −b0

ð19Þ

The coefficients are those of the shifted Legendre polynomial. The first L-moment represent the sample mean, a measure of location while the second L-moment is a measure of dispersion of data values about their mean. L-moment ratios can be calculated by dividing the higher order L-moments by dispersion measure. . tr ¼ lr l2

ð20Þ

where tr are dimensionless quantities independent of units of measurement of data. t3 is a generally called as L-skewness, it is a measure of skewness of data samples about their mean. t4 is generally called as L-kurtosis, it is a measure of kurtosis. L-CV is similar to coefficient of variation and can be defined as . ð21Þ t ¼ l2 l1

5 Discordancy and Regional Homogeneity Test For examining the homogeneity of region formed by clustered data Hosking and Wallis (1993) proposed a discordancy measure, a homogeneity measure and a goodness of fit measure. 5.1 Discordancy Measure For preliminary screening of data Hosking and Wallis (1993) proposed a discordancy measure (Di). This identifies those at site L-moments which are very much different from other in the region.

Identification of homogeneous rainfall regimes in Northeast India

Di ¼

1 ðui − u− ÞT A−1 ðui − u− Þ 3

ð22Þ

In this expression A is sample covariance matrix which is given by : 1 X ðui − u− Þðui − u− ÞT ðN −1Þ i−1 n

A¼

ð23Þ

(i) Where ui =vector containing L-moment ratios [t (i),t (i) 3 ,t 4 ].

ū=mean of vector ui. Hosking and Wallis (1993) proposed that any index station will not be homogeneous with the region in which it belongs if the value of discordancy measure Di is greater than certain critical value (for more than 15 station critical value is 3). Di can also be used for identifying the sites within some large geographical area which are having gross error in their data. 5.2 Heterogeneity Measure Heterogeneity measure is used to examine degree of heterogeneity within the region. It compares variation in between site L-moments with what would be expected for a homogeneous region. (1) Heterogeneity measure based on L-CV H1 ¼

V −μV σV

ð24Þ

(2) A measure based on L-CV and L-skewness H2 ¼

V 2 −μV 2 σV 2

ð25Þ

(3) A measure based on L-Skewness and L-Kurtosis:

H3 ¼

V 3 −μV 3 σV 3

ð26Þ

In these measures, V denote the weighted standard deviation of the at site sample L-CV; V2 represents the weighted average distance from the site to group weighted mean in the 2- dimensional space of L-CV and L-skewness; V3 refers to weighted average distance from the site to group weighted mean in the two dimensional space of L-Skewness and L-kurtosis; μV, μV2 and μV3 denotes the mean of Nsim values of V, V2 and V3 respectively; σV, σV2 and σV3 represent the standard deviation of Nsim Values of V, V2 and V3 respectively. In this study the number of simulations, Nsim, is chosen as 500.

M.K. Goyal, V. Gupta

2 ðiÞ ðRÞ n t −t i i¼1 XN R n i¼1 i

XN R V ¼ XN R V2 ¼

XN R V2 ¼

n i¼1 i

n i¼1 i

2 2 1=2 ði Þ tðiÞ −tðRÞ þ t3 −tR3 XN R n i¼1 i

2 1=2 ðiÞ ðRÞ 2 ði Þ t3 −t 3 þ t4 −tR4 XN R n i¼1 i

ð27Þ

ð28Þ

ð29Þ

In this equation ni denotes record length of peak flows at gauging station i whose sample Lmoment ratios L-CV, L-Skewness and L-kurtosis are denoted as t(i), t3(i) and t4(i) respectively. The number of sites in the region is NR, and t R, t3R and t4R refer to the weighted regional average L-CV, L-skewness and L-Kurtosis respectively. A region can be regarded as ‘acceptably homogeneous’ if HM