Evaluation Methods for Web Application Clustering - CiteSeerX

6 downloads 113330 Views 56KB Size Report
Clustering of the entities composing a Web application. (static and dynamic .... good and complementary for a Web designer who aims at understanding an ...
Evaluation Methods for Web Application Clustering P. Tonella½ , F. Ricca½ , E. Pianta½ , C. Girardi½ , G. Di Lucca¾ , A. R. Fasolino¿ , P. Tramontana¿ (1) ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy (2) RCOST - Research Centre on Software Technology, Universit`a del Sannio, Benevento, Italy (3) Universit`a di Napoli “Federico II”, Dipartimento di Informatica e Sistemistica, Napoli, Italy

Abstract Clustering of the entities composing a Web application (static and dynamic pages) can be used to support program understanding. However, several alternative options are available when a clustering technique is designed for Web applications. The entities to be clustered can be described in different ways (e.g., by their structure, by their connectivity, or by their content), different similarity measures are possible, and alternative procedures can be used to form the clusters. The problem is how to evaluate the competing clustering techniques, in order to select the best for program understanding purposes. In this paper, two methods for clustering evaluation are considered, the gold standard and the task oriented approach. The advantages and disadvantages of both of them are analyzed in detail. Definition of a gold standard (reference clustering) is difficult and prone to subjectivity. On the other side, an evaluation based on the level of support given to task execution is expensive and requires careful experimental design. Guidelines and examples are provided for the implementation of both methods.

1 Introduction Clustering is a general technique aimed at gathering the entities that compose a system into cohesive groups (clusters). Clustering has several applications in program understanding and software reengineering [2, 13, 21], and has been recently applied to Web applications [11, 12, 17]. When clustering techniques are applied to a given domain (such as Web applications), several decisions have to be made [2], which affect the final result: Entity description The entities to be clustered are described according to a set of properties that charac-

terize them. Similarity between entities will be abstracted into similarity between entity properties. Different properties lead to different similarity relationships. In the sibling link approach, entities are described by means of internal features that characterize each entity in isolation. In the direct link approach the connections among entities are used to characterize them. Entity grouping Entities (or, more generally, clusters of entities) are clustered together when they are very similar/cohesive. Several alternative measures of similarity (distance) / cohesion (coupling) are possible, given the descriptions of the entities. Moreover, different choices are available when two clusters (instead of two entities) are compared to determine the similarity (distance) / cohesion (coupling) between them, out of the measures computed for the enclosed entities. Clustering algorithm The steps that are followed to produce the final clustering can differ. Hierarchical approaches build a tree where all entities are partitioned into clusters at each tree depth. Higher tree levels are associated to larger clusters, up to the root, where all entities are in a single cluster. Leaves contain singleton clusters. Hierarchical approaches can proceed top-down (divisive algorithms) or bottom-up (agglomerative algorithms). Another family of approaches is based on the definition of a modularity metric to be optimized. Just one clustering (entity partition) is determined, associated with the maximum value of the metric selected to measure the quality of the clustering. In the context of Web application clustering, we have so far identified at least three alternative approaches that can be used to cluster Web pages, based on: Structure The syntactic structure of the Web pages, as

represented in the abstract syntax tree obtained by a parser, is the entity description used for clustering. The distance between two syntax trees is determined as the tree edit distance [3], and two pages are clustered together if they are at minimum distance, following the agglomerative clustering algorithm. Connectivity The hyperlinks connecting the Web pages are used as entity descriptions [11]. When a group of pages is highly connected by hyperlinks, the related cohesion metric is given a high score. Clusters are determined which maximize a “quality of clustering” metric. Keywords The content of the Web pages is analyzed in order to determine a set of keywords that characterize each of them. Two pages are clustered together when they share a large number of keywords. Given the clusters produced by these alternative approaches, the problem is how to evaluate them. In this paper, we consider two complementary methods that can be used to evaluate the output of different clustering techniques applied to the pages of a Web application. The results of such an evaluation are fundamental to identify the best techniques to use and to verify if specific techniques are better than others for given purposes. However, the definition of an evaluation method is not a trivial task in itself, since there is no obvious way to assess the result of a clustering algorithm. In the reminding of this paper, we will describe the gold standard and the task oriented approach (Section 2) and we will give a procedure to be followed for an evaluation (Section 3). In Section 4, an example of a hypothetical Web application is given, including gold standard and tasks that could be used during an evaluation of alternative clusterings. Finally, related works are commented and conclusions are drawn in the last two sections.

2 Approach Clustering techniques do not discover some hidden or unknown structure in a system, but rather impose a structure on the set of entities they are given in input. They (arbitrarily) ignore some features and favor others. The result is a higher level view of the system, according to a specific perspective. Such a view may give useful and interesting information about the organization of the system, or may be completely useless. However, there is no unique, predefined way to partition a system in a useful way, so that different clusterings of a Web application may be equally good and complementary for a Web designer who aims at understanding an existing system.

2.1 Gold standard The gold standard approach (expert criterion in [2]) is a general evaluation method that is used to measure the performance of competing algorithms which approximate an “ideal solution” to a problem. The gold standard is the reference (“ideal”) solution to the problem. It is usually determined manually, by one or more experts, on a set of examples, and the alternative techniques are applied to such examples to see how close they are to the gold standard. The best technique is that which gives the solution closest to the reference one. One of the most widely used languages for the high level and detailed design of software systems is UML (Unified Modeling Language) [1, 4]. UML has been recently adapted to the Web application domain, with the definition of stereotyped elements that can be used to describe the typical components of a Web application [5]. In UML the basic grouping mechanism that allows describing a system at a high level is called package, and the related view is called package diagram. A package is a grouping of model elements [1], which can be either leaf elements of the package decomposition (e.g., Web pages) or packages themselves. Guidelines and indications for the construction of the package diagram for Web applications are given in [5]. The general principles of “good design” should be followed. This means that packages should be independent units, whose internal elements contribute to a common system’s behavior. These general design principles can be used as indicators of a good vs. poor high level design of a system, but there is no unique way to decompose a system into packages, and the ability and experience of a designer are key factors in the production of a good design. Since clustering produces a grouping of the basic elements that constitute a Web application (pages, scripts, etc.), it makes sense to compare the output of clustering with the package diagram produced for the application under analysis. However, the package diagram is not the unique possible decomposition of a Web system that is meaningful to a Web developer, and that can be used for Web application understanding. Alternative decompositions focused on specific aspects might be equally relevant. Consequently, while it might be useful to adopt the package diagram of a Web application as the gold standard, the evaluation of a clustering method cannot be limited to the ability to recover the package diagram:

¯ the package diagram available for a Web application depends on the design choices made when it was produced; ¯ alternative groupings of the Web application entities can be meaningful and useful, although different from

the package diagram. For these reasons the gold standard approach need be complemented by a second empirical evaluation method, the task oriented approach, described in the following.

2.2 Task oriented approach Clusters are useful if they are able to support the daily activities of Web developers, by providing high level information about the organization of a Web application. When Web applications are evolved, they are subject to the same process that characterizes software maintenance. First, the Web application is inspected to locate the requested change (program understanding phase). Then, the ripple effects of the change are assessed (impact analysis). Finally, the change is implemented and the application is retested. Availability of views which decompose the system into meaningful units may help during the first two phases. A change can be assigned to one (or a few) units, if they capture the functionality to be changed. Dependences with the other units are followed to determine the impact of the change. The task oriented approach for the evaluation of alternative software engineering methods is based on the definition of an empirical study [8, 16, 20] for the assessment of the usefulness of the methods in a realistic setting, with a scientific and objective monitoring of the ongoing activities. The alternative methods are compared by having different groups of users (properly sampled) performing (close to) real world tasks, under controlled conditions. The only free variable should be the specific support method (clustering technique) adopted, with all the other influencing variables properly balanced or adjusted. In the context of Web application clustering, the research questions are: Quest1 Do clustering methods give support to Web application understanding and modification and, if yes, which method is the best? Quest2 Are there clustering methods that give better support to specific Web application understanding and modification tasks? The dependent variables are the level of support given by clustering to different Web application maintenance activities. The independent variables are the clustering methods being used. The task oriented approach does not require that a correct output of the clustering technique be defined. It is rather focused on the usage made of the clustering’s output. If the output of a clustering method is helpful in conducting some of the typical activities in program understanding and modification, then the view extracted by such clustering method

is considered meaningful, in that it represents an abstraction which makes sense for a Web developer and provides interesting information to her/him. The result of a task based evaluation allows assigning clustering techniques to task types. Some techniques may be useful when tasks in a given category are executed, while their support to tasks in other categories might be null. Task oriented evaluations are expensive, because they require human intensive work in the definition and execution of the tasks, and in the assessment of the support provided by the alternative techniques (scoring). Moreover, the evaluation procedure needs be carefully designed, in order to minimize the effects of a subjective evaluation, which are inherent with this approach. However, a task oriented evaluation is a fundamental complement to the gold standard approach, because it might be the case that some (or even all) clustering techniques are unable to produce the reference design decomposition represented in the package diagram, but nonetheless they produce a useful view that can effectively support program understanding. Moreover, a task oriented evaluation gives precious indications on the actual benefits associated with the adoption of a support technique (this information is not provided by the gold standard approach). Finally, it allows determining which technique is more suited for which task (this is also not provided by the gold standard approach).

3 Evaluation Procedure 3.1 Gold standard Evaluation of a set of alternative clustering algorithms consists of the following steps (to be executed in this order):

¯ Construction of the package diagram (if not available). ¯ Computation of clusters by means of alternative techniques. ¯ Clustering evaluation. The package diagram has to be constructed in case it is not provided with the Web application (in our experience, a very frequent case). It cannot be derived from the organization of the code into directories, because this method relies on the quality of such an organization, which is often poor (for example, it is common to divide files by type: HTML, images, scripts, etc., or by language). The package diagram has to be constructed before computing the clusters, in order to avoid being influenced by the output of clustering.

3.1.1 Construction of the package diagram Construction of the package diagram follows the usual design principles adopted when a complex system is decomposed into independent components, containing elements that are cohesive from a semantic, behavioral point of view. The package diagram gives the main components into which a system is logically divided. In turn, each component (package) is hierarchically decomposed into subcomponents (enclosed packages). The guidelines given in [5] can be followed in the definition of the package diagram. In summary, they prescribe that packages in a package diagram be:

¯ Comprehensible: should be clear.

The MoJo (Move-Join) metric proposed in [18] is determined as the minimum number of operations to transform the clustering into the gold standard, where the two allowed operations are moving an entity from a cluster to another one, or a new, singleton one, and joining two existing clusters. An important property of a clustering method is its stability [19]. A clustering is stable if small changes in its input (the Web page attributes) produce small changes in the resulting decomposition. Stability should also be considered when alternative methods are contrasted.

3.2 Task oriented approach

semantics and responsibilities

¯ Cohesive: logically similar entities are aggregated. ¯ Loosely coupled: intra-package connections are prevalent over inter-package connections. ¯ Hierarchically shallow: the suggested number of nesting levels is two or three. To minimize the subjectivity that is unavoidable in the process of high level design, several different experts can perform the task of package diagram definition in parallel, and then they can compare the produced diagrams. In case of disagreement, they can meet and discuss the rationale for the different design choices. At the end of this process, an agreed diagram for the given Web application is obtained. Consensus is reached by deciding on a set of design principles that are considered important for the given application, and by defining a way to enforce them during package diagram definition. 3.1.2 Clustering evaluation The clusters produced by the alternative techniques are compared with the gold standard by means of a proper similarity measure, such as the overlap (number of common elements divided by the total number of elements) [10]. The comparison method proposed in [9] could be properly adapted for the purposes of the present work. In [2, 6], the quality of a clustering with respect to the gold standard is assessed in terms of precision and recall. Precision and recall are defined by comparing the intra pairs in the gold standard and those in the clustering under test, being respectively measured as the percentage of intra pairs in the test clustering that are also in the gold standard, and the percentage of intra pairs in the gold standard that are also in the test clustering. Another approach to measure the distance between a given clustering and the gold standard is described in [18].

Evaluation of a set of alternative clustering algorithms consists of the following steps (to be executed in this order):

¯ Task definition. ¯ Computation of clusters by means of alternative techniques. ¯ Task execution. ¯ Clustering evaluation. Tasks are defined before computing the clusters, so that there is no bias toward the result of clustering (of course, a valid alternative is that two different groups work on task definition and cluster computation separately). 3.2.1 Task definition The tasks used for clustering evaluation should be those typical of the activities performed by a Web developer during the evolution of a Web application. They are specific instances of the general maintenance process which includes program understanding, impact analysis, and change implementation as main, high level activities. The level of granularity of a task should be such that a Web developer can actually perform it, when she/he is given the source code of the Web application. Thus, generic tasks such as “understand the organization of the system” are not allowed. The best method for task definition would be interviewing the developer of the considered Web application and collecting a list of real world tasks. If this is not possible, tasks should be determined by playing the role of the Web developer, and trying to identify real chunks of work that could be actually requested for the evolution of the Web application. Navigation in the Web application from a Web browser could be a useful way to identify areas of improvement, which could trigger task definition.

3.2.2 Task execution and clustering evaluation Independent variables are the different clustering methods. clustering methods are being evaluated, · ½ Thus, if groups of Web programmers are necessary in order to perform the identified tasks under all alternative conditions: groups accesses the output of a clustering each of the method, while 1 group (control group) performs the tasks without any support from clustering. To minimize the effects of influencing factors external to the experimental design, the · ½ groups of Web programmers should be balanced in terms of programming experience, knowledge of the Web application under analysis, etc. Moreover, they should contain people with the typical profile of real world Web developers. The environment in which the study is run is also required to be that commonly available to Web developers. The dependent variable in this experiment is the level of support that each clustering technique gives to Web application maintenance. The null hypothesis for this experiment is that there is no difference among the groups using different (or no) clustering techniques. The null hypothesis can be falsified if there is a statistically meaningful difference among the groups. To measure this, it is necessary to define a proper set of metrics that capture differences in the execution of the given tasks. One example of such metrics is the time necessary to complete each task. A second possible metric is a subjective assessment on an ordinal scale of the level of difficulty encountered during the execution of each task. An example of ordinal scale that can be adopted for this purpose is the following:





Search

User management





¯ [Hard] The task was very difficult. ¯ [Average] The task was completed with an ordinary effort. ¯ [Easy] The task was trivial to complete. Different scores are expected to be given by Web programmers using or not using clustering results (falsification of null hypothesis). In order to eliminate the well known effect of the learning curve, it is necessary that the groups of Web programmers that will access the results produced by clustering be properly trained in advance. They need to familiarize with the notion of clustering, with the clustering tool they will use and with its graphical user interface. The interpretation and usage of clustering information has to be explained in detail, and the training sessions should include the execution and usage of clustering information on several example applications (of course, different from the Web application that will be used in the study).

Data base access

Shopping Cart

Payment

Figure 1. Package diagram of an e-commerce application.

4 Example 4.1 Gold standard The high level architecture of an e-commerce application to be used as a benchmark for clustering evaluation is provided in Figure 1 in the form of a package diagram. Packages contain groups of related pages: Search: search.html, advancedsearch.html, general-search.php, search-help.html, search-error.inc User management: registration.php, login.php, logout.php, check-password.php, conditions.html, privacy.html Data base access: query.php, db-lib.php, insert-record.php, remove-record.php Shopping cart: add-to-cart.php, del-from-cart.php, show.php Payment: order.php, validate-credit-card.php, total.php, disclaimer.html, security.html, exec-transaction.php, confirmation.inc The package Search depends (dashed line) on the package Data base access in that the result of a search is a list of items found in a database. The static HTML page search.html includes a form which triggers the execution of the script general-search.php. This script exploits the functions in query.php (an included script file) to execute the appropriate query on the target database (interfaced through db-lib.php).

The package User management is responsible for user registration and authentication (login and logout procedures). Account data and passwords are retrieved from a database (dependence on package Data base access). The shopping cart (package Shopping cart) is filled in with items obtained through a search (dependence on Search). Only authorized users can insert items into the shopping cart (dependence on User management), and information about referenced items is stored in the database (dependence on Data base access). Finally, the package Payment performs the transaction necessary to complete an order. The transaction depends on package Shopping cart for the items to be ordered, on package User management for user authentication, and on the package Data base access for data about items (prices, etc.), user and transaction (credit card information, etc.). The package diagram in Figure 1, derived from the design documents of the chosen application and validated by an expert, can be used as the gold standard for the evaluation of alternative clustering methods.

4.2 Tasks In the following a set of tasks are described with reference to the e-commerce application described in the previous section.

4.3 Tasks 1. Introduce a security check for all pages related to buying. 2. Remove the list of hyperlinks at the bottom of pages and replace them with a menu in a new left frame. 3. Add links to similar products in each page describing a product. 4. Advertise the service of a given bank in each page related to the payment. 5. Introduce a stricter error check in pages for product ordering. 6. In pages for product search, add a browsing functionality to access an index of the products possibly related to the query. The empirical study for clustering assessment consists of providing the subjects with the source code of the application, as well as a browser to navigate it. While the control group has no further information, the other groups of subjects are also provided with the results of each clustering method. In order to make such results usable, a preliminary

training session is conducted with these subjects about clustering, its support to program understanding, and on the tool used to visualize and explore the clusters. Comparison among the groups can be obtained by measuring the time to complete each task and by requesting a subjective evaluation of the difficulty of each task. Moreover, the quality of the maintenance interventions produced by the different groups can be assessed by experts inspecting the changed application.

5 Related work

While an extensive literature exists on clustering algorithms and applications in software modularization [2, 11, 12, 7, 13, 17, 21], only a few works considered clustering evaluation. Clustering evaluation for optimizing algorithms based on the dependency relationships among entities is discussed in [14, 15]. When the entity features used for clustering are in the form of (weighted) relationships between entities, it is possible to measure the similarity between two alternative clusterings by counting the number of intra- and inter-cluster edges that are common to both clusterings, or by determining the number of intra-cluster edges that must be turn into inter-cluster edges when transforming the first clustering into the second one [14]. The presence of a ”consensus” among alternative clustering approaches can be detected by means of the method described in [15]. The result is a new, ”average” partition of the entities based on the frequency of the cases where alternative algorithms agree. While this method can be used to assess the agreement among alternative techniques, it does not answer our main research question, i.e., the ability of clustering to produce a view that is meaningful and useful for Web developers. The comparison of the output of a clustering algorithm with a reference partition of the system (gold standard) is discussed in [9]. The proposed recall metric takes into account the case of two groupings which overlap to a large degree, as well as the case of a grouping being a partial subset of another one. The basic underlying computation is the overlap, measured as number of common elements divided by total number of elements [10]. Other similarity/distance metrics for the comparison of the output of clustering are described in [2, 18]. Clustering stability is formally defined in [19]. Usage of a controlled experiment is proposed in [9] for the evaluation of semi-automatic or manual clustering methods. We propose its usage also for the assessment of the usefulness of clustering, in the context of a task oriented evaluation.

6 Conclusions Two alternative approaches for the evaluation of the results produced by Web application clustering have been compared. The gold standard approach is appealing because it can be fully automated, once a gold standard has been agreed upon. However, its main disadvantage is that the gold standard, which in our case corresponds to the UML package diagram, depends on design choices that are made during the process of high level design and during the assignment of responsibilities to components. Thus, it depends on the design rationale, as well as the designer’s ability and experience. Moreover, the fact that clustering is unable to reproduce a reference package diagram does not mean that the views produced by clustering are not meaningful and useful. The task oriented approach is an empirical evaluation method that requires expensive user involvement and measurements. Moreover, statistical significance can be achieved only at the price of a large number of subjects involved, properly balanced. On the other side, this approach has several remarkable advantages over the gold standard. In addition to indicating the best technique, it provides a task based discrimination among horizontal techniques (supporting all task kinds) and vertical techniques (supporting a single task very strongly), highlighting the presence of complementary techniques. Finally, it gives information on the actual usefulness of each technique. The implementation of both approaches for the evaluation of a set of clustering methods is essential to answer the research question which originated this work: can clustering support Web understanding and modification? The ability of a clustering technique to recover the package diagram of a Web application is a strong indicator of a positive answer, since package diagrams are known to be an extremely valuable support to software and Web application evolution. However, in case of negative answer, the outcome of a task oriented empirical study could still indicate that the views extracted by means of clustering are useful and meaningful, although not coincident (or close) to the package diagram.

References [1] Unified modeling language (UML) specification, version 1.4. Technical report, Object Management Group (OMG), September 2001. [2] N. Anquetil and T. C. Lethbridge. Experiments with clustering as a software remodularization method. In Proc. of the 6th Working Conference on Reverse Engineering (WCRE’99), pages 235–255, Atlanta, Georgia, USA, October 1999. IEEE Computer Society. [3] M. J. Atallah (editor). Algorithms and Theory of Computation Handbook. CRC Press, Boca Raton, Florida, USA, 1999.

[4] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Language – User Guide. Addison-Wesley Publishing Company, Reading, MA, 1998. [5] J. Conallen. Building Web Applications with UML. AddisonWesley Publishing Company, Reading, MA, 2000. [6] J. Davey and E. Burd. Evaluating the suitability of data clustering for software remodularization. In Proc. of the Seventh Working Conference on Reverse Engineering (WCRE’00), pages 268–277, Brisbane, Australia, November 2000. IEEE Computer Society. [7] M. Harman, R. Hierons, and M. Proctor. A new representation and crossover operator for search-based optimization of software modularization. In Proc. of the AAAI Genetic and Evolutionary Computation COnference 2002 (GECCO), pages 1359–1366, New York, USA, July 2002. [8] B. Kitchenham. A method for evaluating software engineering methods and tools. Technical Report TR96-09, DESMET project UK DTI, 1996. [9] R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. In Proc. of the International Workshop on Program Comprehension (IWPC). IEEE Computer Society, 2000. [10] A. Lakhotia and J. M. Gravely. Toward experimental evaluation of subsystem classification recovery techniques. In Proc. of the 2nd Working Conference on Reverse Engineering (WCRE), pages 262–269, Toronto, Canada, July 1995. IEEE Computer Society. [11] G. A. D. Lucca, A. R. Fasolino, U. D. Carlini, F. Pace, and P. Tramontana. Comprehending web applications by a clustering based approach. In Proc. of the 10th International Workshop on Program Comprehension (IWPC), pages 261– 270, Paris, France, June 2002. IEEE Computer Society. [12] G. A. D. Lucca, M. D. Penta, and A. R. Fasolino. An approach to identify duplicated web pages. In Proc. of the 26th Annual International Computer Software and Applications Conference (COMPSAC), pages 481–486, Oxford, England, August 2002. IEEE Computer Society. [13] S. Mancoridis, B. S. Mitchell, Y. Chen, and E. R. Gansner. Using automatic clustering to produce high-level system organizations of source code. In Proc. of the International Workshop on Program Comprehension, pages 45–52, Ischia, Italy, 1998. [14] B. S. Mitchell and S. Mancoridis. Comparing the decompositions produced by software clustering algorithms using similarity measurements. In Proc. of the International Conference on Software Maintenance (ICSM), Florence, Italy, November 2001. IEEE Computer Society. [15] B. S. Mitchell and S. Mancoridis. Craft: A framework for evaluating software clustering results in the absence of benchmark decompositions. In Proc. of the Working Conference in Reverse Engineering (WCRE), Stuttgart, Germany, October 2001. IEEE Computer Society. [16] S. L. Pfleeger. Experimental design and analysis in software engineering. SIGSOFT NOTES, Parts 1 to 5, 1994 and 1995. [17] F. Ricca and P. Tonella. Using clustering to support the migration from static to dynamic web pages. In Proc. of the International Workshop on Program Comprehension (IWPC), pages 207–216, Portland, Oregon, USA, May 2003. IEEE Computer Society.

[18] V. Tzerpos and R. Holt. Mojo: a distance metric for software clusterings. In Proc. of the Working Conference on Reverse Engineering (WCRE), pages 187–193, Atlanta, Georgia, USA, July 1999. IEEE Computer Society. [19] V. Tzerpos and R. Holt. On the stability of software clustering algorithms. In Proc. of the International Workshop on Program Comprehension (IWPC), pages 211–220, Limerick, Ireland, June 2000. IEEE Computer Society. [20] A. von Mayrhauser and A. Vans. Comprehension processes during large scale maintenance. In Proceedings of the International Conference on Software Engineering, pages 39–48, Sorrento, Italy, May 1994. IEEE Computer Society Press. [21] T. Wiggerts. Using clustering algorithms in legacy systems remodularization. In Proc. of the 4th Working Conference on Reverse Engineering (WCRE), pages 33–43. IEEE Computer Society, 1997.