Spatial Characteristics and Comparisons of ... - Wiley Online Library

19 downloads 14126 Views 1MB Size Report
An emerging area of research is the development of automated approaches ...... “An Analysis of the Coffee Cooperative Marketing System in Busoga, Uganda:.
Alan T.Murray

Spatial Characteristics and Comparisons of Interaction and Median Clustering Models

Cluster analysis has been pursued from a number of directions for identijiying interesting relationships and patterns in spatial information. A major emphasis is currently on the development and refinement of optimization-based clustering models for the purpose of exploring spatially referenced data. Within this context, two basic methods exist for identijiying clusters that are most similar. An interesting feature of these two approaches is that one method approximates the relationships inherent in the other method. This is signijicant given that the approximation approach is invariably utilized for cluster detection in spatial and aspatial analysis. A number of spatial applications are investigated which highlight the differences in clusters produced by each model. This is an important contribution because the differences are in fact quite signijicant, yet these contrasts are not widely known or acknowledged. The analysis of spatial information is an increasingly complex task. On one hand, identifying spatial patterns in data and establishing some sort of significance can be challenging. On the other hand, the virtual explosion of spatial data tracking hundreds or even thousands of attributes and relationships associated with geographic objects has basically overwhelmed simple techniques for summarizing and understanding information. Geographic information systems (GIS) have certainly contributed to current capabilities of information synthesis because of their ability to integrate numerous layers of spatial information and represent relationships visually. There are limits, however, to the extent that relationships may be understood or recognized using simple map-based summaries and displays. Given this, advanced modeling approaches are essential for assisting in the investigation of spatial information. An emerging area of research is the development of automated approaches for exploring spatial and aspatial data. A few of the more well known fields are data mining, knowledge discovery in databases, and pattern spotting. A binding feature of these approaches is the notion of searching out potential relationships in information. Interestingly, these fields may be categorized by their application to certain information. In data mining and knowledge discovery the developed approaches are typically applied to generic databases with no explicit This research was supported in part by a grant from the Australian Research Council while the author was a research fellow at the University of Queensland in the Australian Housing and Urban Research Institute. The author thanks the anonymous reviewers for their constructive comments.

Alan T.Murray is an assistant professor of geography at Ohio State University.

Geographical Analysis, Vol. 32, No. 1 (January 2000) The Ohio State University

2

/ Geographical Analysis

geographic attributes (see Fayyad, Piatetsky-Shapiro, and Smyth 1996; EstivillCastro and Murray 1998). Alternatively, pattern-spotting approaches have been developed for spatially based applications as techniques for exploratory spatial data analysis (ESDA). Representative work in this area is discussed by Openshaw (1991), Anselin (1994), and Murray and Estivill-Castro (1998). ESDA involves the use of methods, approaches, and tools to develop a better understanding of relationships in spatial information. This ideally compliments the basic functionality provided in GIS (Openshaw 1991; O’KeUy 1994; Batty and Xie 1994). Recent work in data mining and knowledge discovery has begun to migrate to the spatial domain (Zhang, Ramakrishnan, and Livny 1996; Ester, Kriegel, and Sander 1997), but there are fundamental issues of space and geographic association that are absent from this body of developmental research (Murray and Estivill-Castro 1998). Given this, it is the ESDA perspective and application focus that is of most interest in this paper. Numerous applications of pattern identification approaches may be found in the literature (see Rosing and ReVelle 1986; Kaufman and Rousseeuw 1990; Anselin 1994; Zhang, Ramakrishnan, and Livny 1996, among others). A problem application that the author is currently investigating is associated with criminal activity in the Brisbane region of Queensland, Australia. Crime occurrence in suburbs throughout this region is tracked by offense types such as homicide, assault, robbery, theft, etc. Combining this with other information like socioeconomic status and demographic characteristics of suburbs as well as regional features, such as shopping centers, recreation areas, public transport services, provides numerous layers of spatial information across which important patterns of activity may be present. How best to scrutinize spatial information when relationships are not obvious remains a perplexing task. One of the fundamental approaches for pattern spotting and identification continues to be the use of clustering models. Murray and Estivill-Castro (1998) reviewed two basic optimization-based clustering models for exploring relationships in spatial information. The intent of these models is to identify observation groupings or clusters that are most alike. Thus, they are structured to minimize total grouping dissimilarity in the selection of a specified number of clusters (see Vinod 1969; Rao 1971; Kaufman and Rousseeuw 1990). Made explicitly clear in Murray and Estivill-Castro (1998) was that there exists a number of alternative clustering models that may be utilized for ESDA. Further, the major gap in empirical work comparing alternative clustering models was made fairly obvious. One optimization-based approach for clustering geographic data is characterized by its use of spatial locations as a means for identifylng clusters. These spatial locations can be central points, as used in the spatial model of Cooper (1963) (commonly referred to as the multifacility planar location problem in the facility location literature), or medians, as utilized in the work of Hakimi (1964) (commonly referred to as the p-median problem in the facility location literature). The most notable feature of this modeling approach is that objects (or observations) are partitioned through the use of representative locations in space. That is, for cluster analysis the locations themselves (central points or medians) do not have a practical interpretation. Thus, they do not correspond to a selected facility site or switching station as is their interpretation in facility location. Rather, representative locations serve only to identify clusters and allow for variation within cluster groups to be estimated. Given this, it should be recognized that the use of representative locations for identifying clusters is an approximation process, because explicit differences between cluster members are not being analyzed. A further point is that a comprehensive comparison of

/ 3

Alan T. Murray

the central points and the median clustering models in Murray (2000) demonstrated significant functional and spatial similarities in produced clusters, which supports the contention that they are in fact very much related. The conclusion reached in Murray (2000) was that the median clustering problem is more appropriate for spatial application than the central points clustering approach. This finding was based upon strong similarities in identified clusters coupled with a significant difference in computational requirements. The second optimization-based modeling approach for clustering geographic data accounts for explicit differences between observations, such as the work of Rao (1971). This modeling approach reflects the true spirit of clustering, which is to identify groups that are the most similar. This is referred to as the exchange or interaction clustering approach (Rosing and ReVelle .1986; Klein and Aronson 1991). The interaction clustering problem accounts for the explicit relationship between cluster members, referred to as the within-group difference. This is in contrast to using a location in space to relate cluster members. There is currently little understanding of the relationships between the explicit interaction clustering approach and the approximate representative location clustering methods (Murray and Estivill-Castro 1998). Such distinctions are essential for establishing the appropriateness of a particular approach under varying circumstances for exploring and synthesizing spatial information-ESDA. This paper begins by formulating the interaction and median clustering problems for identifying groupings in spatial information. This will demonstrate that the two approaches are significantly different in terms of model intent and mathematical specification. The basis for comparing and contrasting solutions to cluster models is then reviewed. Next, a formal approach for com aring alternative spatial clusters is presented. Following this, application resu ts are provided to illustrate comparative characteristics of the two clustering models. This paper ends with a discussion on comparing the two approaches as well as concluding comments.

f

CLUSTERING MODELS

The interaction and median clustering models are detailed in this section. A recent review discussing the relationship of these models to previous research may be found in Murray and Estivill-Castro (1998). The first clustering model to be presented is the interaction clustering problem. Rao (1971) was the first to specify this basic approach in the statistics literature, where the interest was to identify clusters that minimized average and total within-cluster differences between observations. This concept has been further developed and formalized by Rosing and ReVelle (1986) and Klein and Aronson (1991). The idea is that observations or spatial objects need to be clustered into a specified number of groups so that cluster membership is best matched. The following notation will be used in the specification of the spatial interaction clustering model: i , j = indices of observations (total number = n); k = index of clusters (total number = p ) ; ai = relative value of observation i ; &j

= spatial difference measure relating observations i and j ; 1 if observation i is in cluster k

Zik =

0 otherwise.

4

1

Geographical Analysis

Interaction Clustering Problem (ICP)

Minimize

subject to

k

z i k -> 1 Vk

(3)

i

The objective (1)of the ICP is to minimize the total weighted difference in the assignment of observations to clusters. It is worth pointing out that the relative value of each observation, ai, could correspond to representative population totals or perhaps total observed occurrences of a particular event. The objective of the ICP is nonlinear because it accounts for the interaction between cluster group members (within-group differences). Constraint (2) ensures that each observation is assigned to a cluster. Constraint ( 3 ) requires at least one observation in each cluster. If the objective coefficients are non-negative, then constraint (3) is not necessary. In fact, this constraint is not included in the formulations of Rao (1971) or Klein and Aronson (1991). Constraint (4)imposes integer restrictions on the decision variables. The above formulation does show some relationship to the quadratic assignment problem (QAP), formulations of which may be found in Hubert (1987) and Burkard (1990) among others. Of course, those familiar with the QAP will observe two significant differences. First, the ICP lacks an equality sign in constraint ( 3 ) .Further, as stated above, this constraint may not be necessary, which is certainly not true for the equality-based version of this constraint in the QAP. Second, the quadratic interaction in the objective function (1)of the ICP is not equivalent to the interaction traditionally represented in the QAP because the QAP accounts for two levels of interactions. As a result, the objective function of the QAP does not have related indices in the decision variables. In the QAP the quadratic term in the objective would have unique subscripts (that is, Zikzjt). This is not equivalent to the product stated in objective (1). The ICP, on the other hand, is structured to account only for one level of interaction, that which takes place within cluster groups. Apparently, if the cluster group sizes are fixed (which they are not for the ICP), then the ICP could be structured as a QAP (Hubert 1998). With these differences noted, the two problems are related, as observed in ReVelle (1997), but they are not equivalent. The next model to be presented is the median clustering problem. This model also attempts to identify observation clusters that are most similar, but does so using an approximation approach. Specifically, points in space are utilized for creating clusters. For the median approach, these potential points correspond to the set of observations. Thus, if p clusters are to be identified, then p observations will be selected (which are called medians) and clusters created based upon the grouping of observations with their most alike median. Vinod (1969) presented a formulation of this approach for statistical analysis. ReVelle

Alan T. Murray

5

and Swain (1970) formulated a similar model to that of Vinod (1969), but it was structured for spatial analysis. The following notation will assist in model formulation: i = index of observations (total number = n); j = index of potential medians (same as i ) ; dq = distance between observation i and potential median j ; p = number of cluster medians to be selected.

Decision variables: 1 0 1 0

{ .=(

'j=

if cluster median j is selected otherwise. if observation i is assigned to cluster median j otherwise.

Median Clustering Problem (MCP)

Minimize

Subject to

xj = (0,l) Vj

The objective (5) of the MCP is to minimize total weighted assignment of observations to selected medians. Constraint (6) ensures that each observation is assigned to a median. Constraint (7) imposes the condition that an observation may only be assigned to a selected median. Constraint (8) specifies that p cluster medians are to be selected. Constraint (9) imposes integer restrictions on decision variables. It is worth mentioning that the hub location problem (see Campbell 1994) may be considered an extension of the MCP in the context of clustering analysis. Similarly, the planar hub location problem presented in O'Kelly (1992) may be considered an extension of the central points variant of the MCP discussed in Murray (2000). The hub location problem, interpreted for cluster analysis, approximates within-group difference as does the MCP approach. In addition, the hub location problem structures between group variation as well. Thus, both within- and between-group variation may be optimized. The betweengroup interaction is not represented in the ICP or MCP approaches. Although both types of interaction are of concern in some clustering applications (Hubert

6

/ Geographical Analysis

(b) FIG.1. Relationships Structured in the Two Clustering Models: (a) ICP; (b) MCP

1987), the between-group difference extension is not of interest in this paper. Thus, the hub location problem is related to the MCP (and therefore the ICP and QAP), but is sufficiently outside the scope of this paper for a more detailed discussion. Based upon the formal specification of the ICP and the MCP, there can be little doubt that these two models are significantly different. To further illustrate this point, Figure 1 shows a cluster group of five observations labeled AE. In Figure l a the connections represent those structured in the ICP, so there are ten linedarcs labeled 1-10. The connections shown in Figure l a are all the possible interactions between the five observations in this cluster. Alternatively, assuming that observation C is the selected group median, Figure Ib depicts the relationships structured in the MCP, so there are four lines connecting observations A, B, D, and E to observation C. As the intent of the MCP is to minimize total within-group differences, then clearly this is only approximated as the differences between medians and observations is structured. Figure 1 clearly shows that the relationships accounted for in the two models are very different. Although the intent of the two clustering models is the same, the way in which cluster membership is accounted for and optimized is clearly contrasted mathematically in the model formulations. With the exception of Murray and Estivill-Castro (1998), the clustering literature (for example, Kaufman and Rousseeuw 1990) does not specifically recognize that the MCP is an ap-

Alan T. Murray

7

proximation approach for structuring within group difference. Thus, the fact that within-group difference may be explicitly structured using the ICP is not well known. Observing application performance becomes an important issue given this distinction and the fact that the MCP is considered an approximation for the ICP. SOLVING FOR CLUSTERS

The use of either the MCP or the ICP for analysis requires approaches for solving these models. The MCP may be solved using a number of exact or approximate techniques (Murray and Church 1996; Rolland, Schilling, and Current 1997). In this paper, a Lagrangian relaxation with branch and bound code written in Fortran was utilized for solving the MCP [see Cornuejols, Fisher, and Nemhauser (1977), Narula, Ogbu, and Samuelsson (1977), and Murray and Gerrard (1997) for details of the application of this technique for the MCP]. Results reported in this paper for the MCP are optimal to within 0.001 percent. While the development of solution approaches for the MCP continues to be quite active (see Murray and Church 1996; Rolland, Schilling, and Current 1997; Estivill-Castro and Murray 1998 among others), little research has focused on approaches for specifically solving the ICP. This is certainly not due to the success of current approaches for directly solving the ICP as applications have only been solved for relatively small instances (see Rosing and ReVelle 1986; Klein and Aronson 1991). Part of the reason for this is that the ICP is an inherently difficult optimization problem to solve given the nonlinear objective (1).Exact approaches have been pursued for small problems ( n 5 50) by Klein and Aronson (1991) using a linear transformation of this formulation (see also Rosing and ReVelle 1986). Heuristic solution techniques for directly solving the ICP have not yet been proposed. In order to obtain solutions for medium to large application instances of the ICP, an interchange heuristic is introduced here. The interchange process basically attempts to reassign observations from their current clusters to other clusters in order to find more efficient groupings. The interchange heuristic developed for the ICP is detailed as follows: (a) Generate p clusters. (b) Consider all observations unevaluated. (c) Select an unevaluated observation and attempt to assign it to one of the other p-1 clusters. If this results in an improved grouping, then assign the observation to the best cluster as measured by the ICP objective (1).This observation is now considered evaluated. (d) Have all observations been evaluated? If not, return to (c). (e) If a pass through (b)-(d) has been completed with any change to cluster membership, return to (b). (f ) Terminate heuristic. Local optima have been reached. The interchange heuristic for the ICP may be considered an extension of the traditional interchange heuristic developed for the MCP (see Murray and Church 1996) as well as a variant of the improvement-oriented heuristics for the QAP (see Armour and Buffa 1963; Burkard 1990). Interchange heuristics have proven to be very successful at identifylng optimal or near optimal solutions, particularly for the MCP. There is every expectation that this would be the case for the ICP as well. Preliminary experience with the application of the interchange heuristic to the ICP indicates that it performs very well and maintains observed pe?formance standards established for the MCP. Results

8

f

Geographical Analysis

presented in this paper for the ICP represent the best solution found after one hundred runs of the interchange heuristic. CLUSTER COMPARISONS

The ICP and MCP models as well as solution approaches for these models have now been detailed. The remaining issue is how to evaluate the resulting clusters generated by each model. There are basically two approaches which may be utilized for comparing alternative cluster groupings. The first is strictly analytical by evaluating the functional measure of a particular cluster grouping using the model objective. Evaluating a clustering as an ICP would involve the use of objective (1)and evaluating a clustering as an MCP would involve the use of objective ( 5 ) . Using objectives (1) and ( 5 ) , any identified grouping may be assessed. Further, this provides the basis for comparison of the two clustering models. Specifically, the best cluster solution for each model can be evaluated using the other cluster model. Thus, we can observe relative model quality differences between cluster groupings identified by the ICP and the MCP. This comparative approach was utilized in Murray (2000). A second evaluation approach for comparing alternative clusters is to examine spatial variation between the groupings. With the use of GIs, this may be done in a visual and very understandable manner using an overlay process to contrast alternative spatial clusters. This is a valuable comparison technique because differences in spatial clusters are easy to comprehend and recognize (Murray 2000). However, such an evaluation approach does not enable one to attach any significance to observed differences or similarities. Given this, it is important to develop quantitative measures of spatial variation in addition to visual comparisons. This is discussed in the next section. MEASURING SPATIAL VARIATION

In order to establish significance in comparing differences or similarities in cluster groupings, a measure of spatial variation is needed. An approach for doing this is to determine cluster similarity (or difference) between two alternative groupings. Specifically, we wish to compare the ICP and MCP groupings (or partitions) for associated cluster values, p . This may be structured as an optimization problem that identifies the maximum amount of cluster overlap between two groupings. Additional notation is defined as follows:

Ck = observations in cluster k of first grouping (identified by the ICP); Gj = observations in clusterj of second grouping (identified by the MCP); Okj = I ck n Gjl; 1 if cluster k of first grouping is matched with cluster j of second grouping 0 otherwise. The interpretation of okj is that it indicates the number of observations in cluster k of the ICP grouping which are also in clusterj of the MCP grouping. Based upon this notation, there are a number of conditions which hold true:

Alan T. Murray

/ 9

These conditions stipulate that groupings (or partitions) are mutually exclusive and exhaustive. Given these definitions and conditions, an optimization model for determining the relationship between two spatial cluster groupings may be formulated. Cluster Overlap Assignment Problem (COAP)

Maximize

subject to

The objective (14) of the COAP is to maximize the total overlap of matched clusters between two spatial groupings. This indicates the amount of similarity between two cluster groupings. An upper bound on the value of objective (14) is n, the total number of observations. This would indicate complete overlap, where the identified clusters are exactly the same for the two cluster groupings. Constraint (15) ensures that each cluster in the second grouping, GI, is assigned to a cluster in the first grouping, Ck. Constraint (16) ensures that each cluster in the first grouping is assigned to a cluster in the second grouping. Constraint (17) imposes integer requirements on decision variables. The COAP is nothing other than an assignment problem (Bazaraa, Jarvis, and Sherali 1990). The COAP may be solved using linear programming or the transportation simplex algorithm, among others. Optimal solutions reported in this paper were identified using a transportation simplex code written in Fortran by the author. EMPIRICAL FINDINGS

Comparative results will be presented for three spatial applications across a range of p cluster values using a Pentium II/300 personal computer. The first application contains thirty-three observations, indicating the number of emergency service calls at each identified location throughout the Austin, Texas, region (Daskin 1982). Cluster model solutions are shown in Table 1 for p = 3-6. Solution times were less than 0.28 seconds per solution using the interchange heuristic for the ICP and less than 0.03 seconds using the Lagrangian relaxation

10

Geographical Analysis

TABLE 1 Cluster Model Objectives for the 33-Observation Application ICP Solution tl

3 4 5 6

MCP Solution

ICP

MCP

ICP

MCP

12,245,110.32 7,947,764.62 5,307,632.78 3,921,092.88

12,467.44 11,583.79 10,337.02 8,568.70

12,424,214.12 9,341,902.74 6,554,005.83 4,776,882.96

12,434.67 10,741.28 9,316.69 8,196.81

approach for the MCP. Table 1 reports the evaluation of the MCP and ICP solutions for each value of p . As indicated previously, each cluster solution is evaluated by the other cluster model as well. Thus, the column ZCP Solution shown in Table 1 evaluates the clusters generated using the ICP and the column M C P Solution evaluates the clusters generated using the MCP. These columns further indicate the functional performance of the cluster configuration using both models. As an example, for p = 5 in Table 1 the best ICP grouping has an objective (1) value of 5,307,632.78 (under the ZCP Solution column) and the best MCP grouping has an objective (5) value of 9316.69 (under the MCP Solution column). The remaining entries for p = 5 are evaluations of the best cluster groupings using the alternative model. Specifically, the optimal MCP grouping for p = 5 evaluated as an ICP results in an objective (1) measure of 6,554,005.83 (under the MCP Solution column) and the best ICP grouping for p = 5 evaluated as an MCP gives an objective (5) value of 10,337.02 (under the ZCP Solution column). In both instances the clusters evaluated using the other model are of inferior quality, according to the respective clustering model objectives, which is expected. An interesting finding in Table 1 is that the ICP and MCP models do not produce clusters which evaluate well using the other model. For example, the cluster grouping identified for p = 6 using the ICP has an objective measure of 3,921,092.88 and the optimal MCP cluster grouping evaluated as an ICP has an objective (1) measure of 4,776,882.96. Functionally, the quality of the clusters identified by each model is significantly different. The spatial variation associated with the two clustering models is depicted in Figure 2 for p = 5. The cluster groupings do not appear to be particularly similar between the MCP and the ICP. For example, observations 16, 17, and 18 change cluster membership in the two modeling approaches. Another interesting spatial difference is the change occurring for observations 11, 12, and 27. Assessing differences using the COAP gives a quantitative measure to spatial variation. Note that the upper bound for the COAP objective (14) is 33, the number of observations in this application and the maximum amount of cluster overlap possible. Table 2 indicates the percentage of cluster overlap between the ICP and MCP groupings for p = 3 - 6. The percentage overlap represents the COAP objective over the total number of observations. The time needed to solve the COAP was less than 0.01 seconds for each value of p . For p = 5, the COAP establishes that there is 75.76 percent similarity between the two cluster groupings, which corresponds with what is shown in Figure 2. Table 2 reveals that the most similarity occurs for p = 3, where there is 96.97 percent overlap, and the least similarity is for p = 4, where there is 66.67 percent overlap. The second application contains fifty-five observations, representing relative air travel volumes originating in the Washington, D.C., metro region (Swain 1971). The ICP and MCP results are given in Table 3 for p = 5-10. Solution times were less than 1.69 seconds per solution using the interchange heuristic

Alan T. Murray

/ 11

FIG.2 . Five Clusters for the Thirty-three-Node Application

TABLE 2 Spatial Similarity in the 33-Observation Clusters P

COAP Objective

Overlap

3 4 5 6

32 22 25 25

96.97% 66.67% 75.76% 75.76%

for the ICP and less than 0.95 seconds using the Lagrangian relaxation approach for the MCP. The clusters do not appear to relate well to each other. The ICP cluster grouping for p = 9 in Table 3 has an objective measure of 90,533.14 and the optimal MCP cluster grouping evaluated using the ICP has an objective (1) measure of 131,965.44. Spatial variation between the two clustering models is shown in Figure 3 for p = 9. Again, it is difficult to see a great deal of similarity between the two cluster groupings. For example, the change in cluster membership of observations 6, 9, 10, 41, and 47 is quite marked. In one instance they form one cluster (ICP groupings) and in the other instance they are members of three different clusters (MCP groupings). The COAP find-

12

Geographical Analysis

~~

TABLE 3 Cluster Model Objectives for the 55-Observation Application MCP Solution

ICP Solution

P

ICP

MCP

ICP

MCP

5 6 7 8 9 10

244,559.16 185,371.83 136,818.94 111,418.84 90,533.14 75,524.41

3,235.28 2,935.57 2,502.47 2,309.82 2,176.19 2,036.07

304,687.25 267,585.75 171,181.47 143,546.69 131,965.44 107,155.42

2,944.20 2,649.55 2,420.79 2,217.85 2,071.19 1,927.45

-------

ICP groupings

FIG.3. Nine Clusters for the Fifty-five-Node Application

ings reported in Table 4 summarize the cluster overlap between the ICP and MCP groupings for p = 5-10. Note that the maximum possible value of the COAP objective (14) is 55, which would indicate that the cluster groupings were exactly the same. The time needed to solve the COAP was less than 0.01

Alan T. Murray

/ 13

TABLE 4 Spatial Similarity in the 55-Observation Clusters P

COAP Objective

Overlar,

5 6 7 8 9 10

36 38 46 39 37 42

65.45% 69.09% 83.64% 70.91% 67.27% 76.36%

TABLE 5 Cluster Model Objectives for the 152-Observation Application ICP Solution

MCP Solution

P

ICP

MCP

ICP

MCP

5 6 7 8 9 10 11 12 13 14 15

1,105,909,770.31 844,214,082.37 659,252,244.16 527,258,982.16 441,862,420.29 373,547,803.70 318,245,129.07 274,181,710.08 239,390,696.31 209,745,285.10 186,183,214.14

477,584.97 442,715.19 427,965.97 377,599.78 350,264.88 330,093.50 308,807.28 289,530.19 279,519.66 263,930.97 250,161.45

1,132,926,681.57 934,297,431.56 789,282,119.53 577,640,539.46 469,911,081.35 398,215,828.77 327,153,999.34 305,221,535.31 267,417,690.00 221,632,904.31 205,436,553.21

470,093.78 425,847.28 388,914.28 361,348.88 337,015.03 318,666.66 300,422.96 284,857.78 266,991.76 252,392.95 240,438.35

seconds for each value of p . For p = 9, the COAP establishes that there is 67.27 percent similarity between the two cluster groupings. The similarity between the clusters identified for the ICP and MCP ranges in Table 4 from 65.45 percent overlap for p = 5 to 83.64 percent overlap for p = 7. The final application contains 152 observations. The identified locations correspond to coffee bean buying centers in Busoga, Uganda (Migereko 1983). Thus, the relative value of each observation, ai, is the amount of coffee beans available. Table 5 reports results for the ICP and MCP for p = 5-15. Solution times were less than 60.00 seconds per solution using the interchange heuristic for the ICP and as high as 34.10 seconds using the Lagrangian relaxation approach for the MCP. Once again there appears to be little similarity between the evaluated cluster groupings. The ICP cluster grouping for p = 6 in Table 5 has an objective measure of 844,214,082.37 and the optimal MCP cluster grouping evaluated using the ICP has an objective (1) measure of 934,297,431.56. For p = 15, the ICP cluster grouping has an objective of 186,183,214.14, whereas the MCP cluster grouping evaluated as an ICP has an objective of 205,436,553.21. Cluster groupings are depicted in Figure 4 for the two models for p = 6. The spatial variation is quite apparent between the two cluster groupings. For example, the MCP grouping of observations 51-57 and 61-66 is split into four different clusters in the ICP groupings. Reported in Table 6 are the COAP findings for p = 5-15 for the ICP and MCP models. Note that the upper bound for the COAP objective (14) is 152. The time needed to solve the COAP was again less than 0.01 seconds for each value of p . For p = 6, there is 73.03 percent similarity between the two cluster groupings. The spatial

14

/ Geographical Analysis

-------

ICP groupings

FIG.4. Six Clusters for the 152-Node Application

variation in Table 6 ranges from 73.03 percent overlap for p = 6 to 90.13 percent overlap for p = 11 for the ICP and MCP groupings. DISCUSSION

The functional comparisons presented in Tables 1, 3, and 5 suggest that there is little agreement in relative quality between the ICP groupings and the MCP groupings. The evaluated differences are summarized for the three applications in Figure 5. Displayed in Figure 5 are the ICP and the MCP objective function

Alan T. Murray J

15

TABLE 6 Spatial Similarity in the 152-Observation Clusters P

5 6 7 8 9 10 11 12 13 14 15

COAP Objective

Overlap

134 111 119 121 130 120 137 129 121 129 132

88.16% 73.03% 78.29% 79.60% 85.53% 78.95% 90.13% 84.87% 79.60% 84.87% 86.84%

differences for each value of p between the cluster groupings identified for each model, across the three spatial applications. As an example, for the results given in Table 1 for p = 5, Figure 5 shows the ICP model difference between the two generated clusters as a percentage deviation from the best ICP solution, which is 23.5 percent. This indicates that the MCP-generated grouping evaluated as an ICP is 23.5 percent less efficient, as measured by the objective function, than the best ICP grouping. Alternatively, the MCP model difference is 10.9 percent, which indicates that the ICP-generated grouping evaluated as an MCP is 10.9 percent from the optimal MCP grauping. In Figure 5 there are instances where functional similarity does exist. For example, results for p = 3 in Table 1 and p = 5 and p = 11 in Table 5 show almost no functional differences for either the ICP or MCP models. These clusters have the most spatial similarity between identified groupings with 96.97 percent, 88.16 percent, and 90.13 percent overlap, respectively. However, in general, Figure 5 shows a significant deviation in functional differences (as shown in Tables 1, 3, and 5 already). An interesting relationship depicted in Figure 5 is that the ICP is much more sensitive to the functional quality of cluster groupings than is the MCP. That is, when an MCP grouping is evaluated as an ICP, it always results in a functional. difference that is greater than when an ICP grouping is evaluated as an MCP. Perhaps the most significant fact summarized in Figure 5 is that there are instances of large differences between evaluated clusters for the ICP, such as the high of 45.8 percent for p = 9 in Table 2. These large and numerous occurrences of functional dissimilarity are very much a concern. The functional and spatial differences between the ICP and MCP are so significant that there is no reason to believe that the possibility of suboptimal ICP solutions could be responsible for the observed variation. After all, the quality of identified clusters is what is being evaluated and there is no agreement between the two approaches, either functionally or spatially. Discussed early in the paper was that the MCP was in theory an approximation for the ICP. The presented formulations showed that the ICP and MCP models are quite different. This difference is further supported in the reported application results. Specifically, the summary presented in Figure 5, as well as the spatial differences illustrated in Figures 2-4, demonstrates that the MCP does not, in general, serve as a suitable approximation model for the ICP. There is no evidence to suggest that the MCP should be viewed as a model that replicates the interactions being modeled in the ICP. It is probably worth pointing out that the spatial applications utilized in this paper likely represent the most

a4

g

f

50%

O

45% W

35%

E 30% U

5

a4

=a 2!5% a4

.? 20% al 15% c 0

10%

5% 0%

FIG.5. Functional Comparison of Application Results

l.ICP

I

Alan T. Murray

/ 17

favorable situation for relating the two models, because the relationship measure 4j is’strictly a spatial distance measure. CONCLUSIONS

This paper has presented a comparison of two clustering models for the analysis of spatial information. The interaction clustering problem (ICP) was shown to reflect the true spirit of cluster analysis by explicitly representing relationships between the spatial observations being analyzed-within-group variation. Alternatively, the median clustering problem (MCP) was shown to be an approximation approach which utilized spatial observations to create cluster groupings. Although much of the rationale for the use of the MCP is based upon the inherent difficulty associated with solving the ICP, the application results do not suggest that the MCP is a good approximation model for the ICP. In fact, the use of the MCP to approximate the ICP resulted in clusters which were as much as 46 percent less efficient. This is of much concern when the intent of cluster analysis is to establish similarity between observations. The use of the MCP would obviously be problematic in this regard. For the area of exploratory spatial data analysis (ESDA), which is based upon complimenting and perhaps integrating with geographical information systems (GIs), the use of cluster analysis is an increasingly important modeling tool as it offers the capability to identify patterns in spatial information. Thus, a better understanding of the relationships and suitability of various cluster models is essential. The analysis presented in this paper suggests that the ICP should be the relied upon optimization-based clustering model due to the inability of the MCP to serve as a suitable approximation model. Given this, it is important that continued development of solution techniques for the ICP be pursued as the likely problem instances of thousands of observations are well beyond current capabilities, either by exact or heuristic solution methods. LITERATURE CITED Anselin, L. (1994). “Local Indicators of Spatial Association-LISA.” Geographical Analysis 27, 93115. Armour, G., and E. B d a (1963). “A Heuristic Algorithm and Simulation Approach to Relative Location of Facilities.” Management Science 9, 294-309. Batty, M., and Y. Xie (1994). “Modelling inside GIS: Part 1. Model Structures, Exploratory Spatial Data Analysis, and Aggregation.” International Journal of Geographical Information Systems 8, 291-307. Bazaraa, M., J. Janris, and H. S h e d (1990).Linear Programming and Network Flows, 2d ed. New York: John Wiley & Sons. Burkard, R. (1990). “Locations witb Spatial Interactions: The Quadratic Assignment Problem.” In Discrete Location Theoy, edited by P. Mirchandani and R. Francis, pp. 387-437. New York: John Wiley & Sons. Campbell, J. (1994). “Integer Programming Formulations of Discrete Hub Location Prohlems.” European Journal of Operational Research 72, 387-405 Cooper, L. (1963). “Location-Allocation Problems.” Operations Research 11, 331-43. Cornuejols, G., M. Fisher, and G. Nemhauser (1977). “Location of Bank Accounts to Optimize Float: An Analytical Study of Exact and Approximate Algorithms.” Management Science 23, 789-810. D a s h , M. (1982). “Application of an Expected Covering Model to Emergency Medical Service System Design.” Decision S c i a c e s 13,416-39. Ester, M., H. Kriegel, and J. Sander (1997). “Spatial Data Mining: A Database Approach.” In Aduances in Spatial Databases, edited by M. Scholl and A. Voisard, pp. 47-66. Berlin: Springer. Estivill-Castro, V., and A. Murray (1998). “Discovering Associations in Spatial Data: An Efficient Medoid-Based Approach.” In Research and Development in Kn0U;ledge Discovey and Data Mining, edited by X. Wu, R. Kotagiri, and K. B. Korb, pp. 110-21. New York: Springer.

18

1 Geographical Analysis

Fayyad, U., G. PiatetslyShapiro, and P. Smyth (1996). “The KDD Process for Extracting Useful Knowledge from Volumes of Data.” Communications of the ACM 39(11), 27-34. Hakimi, S. L. (1964). “Optimum Locations of Switching Centers and the Absolute Centers and Medians of a Graph.” Operations Research 12, 450-59. Hubert, L. (1987). Assignment Metlwo!.~ in Combinatorial Data Analysis. New York: Marcel Dekker. -(1998). Personal communication, August 7, 1998. Kaufman, L., and P. Rousseeuw (1990). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons. Klein, G., and J. Aronson (1991). “Optimal Clustering: A Model and Method.” Naual Research Logistics 38, 447-61. Migereko, D. (1983). “An Analysis of the Coffee Cooperative Marketing System in Busoga, Uganda: Transportation and Facilities Location.” Masters thesis. University of California, Santa Barbara. Murray, A. (2000). “Spatial Analysis Using Clustering Methods: Evaluating Central Point and Median Approaches.” Journal of Geographical Systems, forthcoming. Murray, A., and R. Church (1996). “Applying Simulated Annealing to Location Planning Models.”]ournal of Heuristics 2,49-71. Murray, A,, and V. Estivill-Castro (1998). “Cluster Discovery Techniques for Exploratory Spatial Data Analysis.” International Journal of Geographical Information Science 12, 431-43. Murray, A., and R. Gerrard (1997). “Capacitated Service and Regional Constraints in LocationAllocation Modeling.” Location Science 5, 103-18. Nmla, S., U. Ogbu, and H. Samuelssoa (1977). “An Algorithm for the p-Median Problem.” Operations Research 25, 709-13. OKelly, M. (1992). “A Clustering Approach to the Planar Hub Location Problem.” Annals of Operations Research 40, 339-53. -(1994). “Spatial Analysis and GIS.” In Spatial Analysis and GIs, edited by S. Fotheringham and P. Rogerson, pp. 65-79. London: Taylor & Francis. Openshaw, S. (1991). “Developing Appropriate Spatial Analysis Methods for GIS.” In Geographical Information Systems: Principles and Applications, edited by D. Maguire, M. Goodchild, and D. Rhind, pp. 389-402. New York: Longman. Rao, M. (1971). “Cluster Analysis and Mathematical Programming.”Journal of the American Statistical Association 66, 622-26. ReVelle, C. (1997). “A Perspective on Location Science.” Location Science 5, 3-13. ReVeUe, C., and R. Swain (1970). “Central Facilities Location.” Geographical Analysis 2, 30-42. Rolland, E., D. Schilling, and J. Current (1997). “An Efficient Tabu Search Procedure for the p-Median Problem.” European J o u m l of Operational Research 96, 329-42. Rosing, K., and C. ReVelle (1986). “Optimal Clustering.” Environment and Planning A 18, 1463-76. Swain, R. (1971). “A Decomposition Algorithm for a Class of Facility Location Problems.” Ph.D. dissertation. Cornell University, Ithaca, NY. Vinod, H. (1969). “Integer Programming and the Theory of Grouping.” Journal of the American Statistical Association 64, 506-17. Zhang, T., R. Ramakrishnan, and M. Livny (1996). “BIRCH: An Efficient Data Clustering Method for Very Large Databases.” SIGMOD 25, 103-14.