an empirical study of software metrics for assessing ...

61 downloads 421 Views 316KB Size Report
Community Index (www.tiobe.com/index.php/content/paperinfo/tpci), published .... It is good OO programming practice to create small and cohesive classes, so.
International Journal of Software Engineering and Knowledge Engineering Vol. 22, No. 4 (2012) 525548 # .c World Scienti¯c Publishing Company DOI: 10.1142/S0218194012500131

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

AN EMPIRICAL STUDY OF SOFTWARE METRICS FOR ASSESSING THE PHASES OF AN AGILE PROJECT

GIULIO CONCAS*, MICHELE MARCHESI†, GIUSEPPE DESTEFANIS‡ and ROBERTO TONELLI§ Department of Electrical and Electronic Engineering University of Cagliari, Piazza d'Armi, Cagliari, 09123, Italy *[email protected][email protected][email protected] §[email protected] http://www.diee.unica.it

Received 13 September 2011 Revised 7 November 2011 Accepted 25 January 2012 We present an analysis of the evolution of a Web application project developed with objectoriented technology and an agile process. During the development we systematically performed measurements on the source code, using software metrics that have been proved to be correlated with software quality, such as the Chidamber and Kemerer suite and Lines of Code metrics. We also computed metrics derived from the class dependency graph, including metrics derived from Social Network Analysis. The application development evolved through phases, characterized by a di®erent level of adoption of some key agile practices    namely pair programming, testbased development and refactoring. The evolution of the metrics of the system, and their behavior related to the agile practices adoption level, is presented and discussed. We show that, in the reported case study, a few metrics are enough to characterize with high signi¯cance the various phases of the project. Consequently, software quality, as measured using these metrics, seems directly related to agile practices adoption. Keywords: Software metrics; software evolution; agile methodologies; object-oriented metrics, SNA metrics applied to software.

1. Introduction Software is an artifact that can be easily measured, being readily available and composed of unambiguous information. In fact, since software inception, many kinds of metrics have been proposed to measure software characteristics. The main goal of software metrics is to measure the e®ort needed to develop the software, or to measure its quality. E®ort metrics are relatively simple and well understood. They cover the requirement phase, with metrics such as \Function Points" [2] and the like,

525

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

526

G. Concas et al.

up to design and coding phases, with metrics starting from the simple \Lines of Code" (LOC), to more complex metrics like Cyclomatic Complexity [25]. While the e®ectiveness of e®ort metrics in predicting and measuring the actual costs of software development is still debated, in this paper we will not focus on this kind of metrics, but only on quality metrics. Software quality metrics aim to measure how much a software is \good"    especially from the point of view of being error-free and easy to modify and maintain. Software quality metrics tend to measure whether software is well structured, not too simple and not too complex, with cohesive modules that minimize their coupling. Many quality metrics have been proposed for software, depending also on the paradigm and languages used    there are metrics for structured programming, objectoriented programming, aspect-oriented programming, and so on. In this paper, we will focus on object-oriented (OO) metrics, nowadays being the OO paradigm which is most popular by far among developers.a In dealing with software metrics, however, the main point is not to comeup with new, sensible metrics able to measure software, but to empirically demonstrate their usefulness in practice. Empirical proofs of the value of metrics to assess software quality are mainly based on ¯nding correlations between speci¯c metrics and the fault-proneness of software modules, that is the number of faults that were found and ¯xed. Unfortunately, considering software quality just inversely related to the number of faults has its drawbacks. The ¯rst one is that the relationship between a fault and a software module is typically declared when the module is modi¯ed to ¯x the fault. However, a module is often modi¯ed as a consequence of an error, not because it is wrong. Moreover, simply relating quality and (absence of) faults does not account for other characteristics that are very important in software development    such as ease of maintenance    but that are much more di±cult to relate with software metrics. In this work we will present the possible use of OO metrics to indirectly assess the quality of the developed software, by showing signi¯cant changes in time as the development proceeds along di®erent phases. In these phases, various speci¯c \agile" development practices were used    or their use was discontinued. In this context, we assess the ability of some metrics to discriminate among the phases of the project, and therefore the usage of speci¯c practices. We present results on an industrial casestudy, and discuss their implications and relationships with previous research. We understand that the presented evidence is anecdotal, but with real software projects it is very di±cult to plan multi-project researches of this kind. This is because a The

relative di®usion of programming languages is continuously monitored by some Web sites. Among them, lang-index.sourceforge.net monitors the usage of languages in Sourceforge Open Source projects. Here, on November 2011, the share of OO languages was greater than 55%. Tiobe's monthly Programming Community Index (www.tiobe.com/index.php/content/paperinfo/tpci), published since 2001, shows the top 50 languages' ratings based on searching the Web with certain phrases that include language names and counting the numbers of hits returned. Here the ratings of OO Languages on November 2011 was 55.3%.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

527

software houses tend to be very secretive about their projects. We hope that other researchers will try to replicate the presented results on similar projects whose data they can access. The target of our research is the evolution of a software project consisting of the implementation of FLOSS-AR, a program to manage the Register of Research of universities and research institutes. FlossAr was developed with a full object-oriented (OO) approach and released with GPL v.2 open source license. It is a Web application, which has been implemented through a specialization of an open source software project, jAPS (Java Agile Portal System) [18], that is a Java framework for Web portal creation. Throughout the project we collected metrics about the software product under development. We used the Chidamber and Kemerer (CK) OO metrics suite [8], as well as complexity metrics computed from the class dependency graph [10]. The project was developed following an agile process [5, 6] with various adoption levels of some key agile practices, namely Pair Programming (PP), Test-Driven Development (TDD) and refactoring [5], that were recorded during the project. We show how some metrics computed on the developed code seem to have the capability to discriminate in a statistically signi¯cant way among the various phases of the project, that in turn are characterized by the adoption, or non-adoption, of the above mentioned agile practices (PP, TDD, refactoring). In this way, the quality of an ongoing project might be controlled using these metrics. This paper is organized as follows: in Sec. 2 we present CK, graph-theoretical and SNA metrics computed on the software; in Sec. 3 we discuss prior literature on software metrics; in Sec. 4 we present the phases of the development; in Sec. 5 we present and discuss the results, relating software quality    as resulting from the metrics measurements    with the adoption of agile practices; Sec. 6 deals with the threats to the validity of the paper, which is concluded in Sec. 7.

2. Software Metrics In this section we brie°y introduce all the metrics studied in our work, used as a starting point to choose the metrics subset best suited to discriminate between various project phases. For a more detailed description, references with their de¯nition and possible uses are given. The metrics we computed throughout the project are the OO metrics suite given by Chidamber and Kemerer [8], Graph-theoretical metrics, and Social Network Analysis (SNA) metrics. The Chidamber and Kemerer (CK) metrics suite is perhaps the most studied among OO metrics suites, and its relationship with software fault-proneness has already been validated by many researchers. The CK metrics are: Number Of Children (NOC) and Depth of Inheritance Tree (DIT), related to inheritance; Weighted Methods per Class (WMC) and Lack of Cohesion in Methods (LCOM), pertaining to the internal class structure; Coupling Between Objects (CBO) and Response For a Class (RFC), that are related to relationships among classes. Several papers related CK metrics to software quality, not always agreeing on which metrics

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

528

G. Concas et al.

are the most correlated with lack of faults and ease of maintenance; see Sec. 3 for a survey of the related literature. As presented and discussed in the next section, among CK metrics, WMC and CBO are those that have been found to be most correlated with software quality. RFC and LCOM were sometimes    but not always    proved to be correlated with fault proneness or with maintenance e®ort related to a class. DIT was sometime found correlated, but was also often found not correlated, or exhibiting too low value variations. NOC is the CK metric least related to software quality. In general, the lower the value of CK metrics, the better the quality of the system. Note that a recent work on Eclipse Java system evolution shows that the cohesion/coupling metrics do not behave as expected in some cases [3]. For instance, in the referred paper, cohesion metrics were found to decrease after restructurings that should have increased cohesion, and similar results were found regarding coupling. However, the work [3] studies coupling and cohesion at package and plugin level, while all our analysis is made at class level. The second kind of metrics we analyzed are derived from network theory applied to the software graph. In fact, it is possible to build a directed graph    called the class graph    from the source code of an OO system, the nodes of the graph being the classes (or the interfaces), and the graph edges being the dependencies between classes. In this graph, we can de¯ne the Fan-In (or in-degree) of a class as the number of edges directed toward the class; the in-degree is a measure of how much the class is used by other classes in the system. The Fan-Out (or out-degree) of a class is the number of edges directed from the class; it counts how many other classes of the system are used by the class. Fan-In and Fan-Out measure the number of di®erent classes using, or used by, the target class. These metrics can be also weighted by the number of times another class uses, or is used by, the target class, thus yielding weighted Fan-In/Fan-Out. As an example, if class A uses class B three times (for instance de¯ning an instance variable of type B, and two local variables of type B in two methods), A's Fan-Out is increased by one, while its weighted Fan-Out is increased by three. Fan-In and Fan-Out    weighted or not    are the graphtheoretical metrics we considered. They are related to complex network theory, because it is well-known that in complex networks their distribution is fat-tailed, and often is a power-law [28]. We also consider the class LOCs metric, that is the number of lines of code of the class. It is good OO programming practice to create small and cohesive classes, so also class LOC metric should be kept reasonably low in a \good" system. Graph-theoretical metrics can be related to CK metrics pertaining to the relationships among classes. We know that CK CBO metric, being the count of the number of other classes which a given class is coupled to, denotes class dependency on other classes in the system, and is therefore strictly related to the sum of Fan-In and Fan-Out of a class node in the class graph, because links represent dependencies between classes. Also CK RFC metric is computed as the sum of the number of

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

529

methods of a class and the number of external methods called by them. This latter quantity is strictly related to the weighted Fan-Out of the class node. The third group of metrics we used are SNA metrics [30]. These metrics come from complex network theory, too. They were introduced for sociological analysis, and recently used in software graphs as well. There are several variations of SNA metrics. We decided to restrict the analysis to SNA metrics that account for the directionality of edges, and that can be considered meaningful in a software engineering context. These metrics are: in- and out-Reach E±ciency, in- and out-Two Step Reach, in- and out-number of weak components, in- and out-Closeness. These and other metrics are fully explained in [12]. The studied SNA metrics have an interpretation from the OO software development point of view. We remember that the nodes of the network are classes or interfaces, while the directed edges represent a dependency between two classes    the class which the edge comes from uses somehow the class which the edge is directed to. High reach e±ciency indicates that primary contacts of a class are in°uential in the network. REI means that the classes using a given class are in turn used by many other classes. This is a measure of the degree of reuse of a class, not only directly but also in two steps. REO means that a class uses other classes, which in turn further use other classes. It is a measure of two-step dependence on the rest of the system. Both these metrics are related to coupling. They should be kept at relatively low values to minimize coupling among classes of the system. Weak Components is a normalized measure of how many disjoint sets of other classes are coupled to a given class. In general, it is an indirect measure of coupling    the higher is WC, the lower is the coupling among the classes coupled to a given class. Closeness-In is a measure of how easy it is for a class to be reached, directly or indirectly, by other classes that need its services. Similarly, Closeness-Out is a measure of how many dependence steps are needed to reach all other (reachable) classes of the system. The two closeness measures are related to the \small-world" property of a software network. For a single class, the hypothesis is that the more central a class is, the more defects it will have. For ensemble measures over the whole system    such as the mean or a percentile of CI or CO    the hypothesis is that a smaller value of centrality denotes a smaller coupling among classes. Note that these measures can greatly vary for entire ensembles of classes if a link is added to a set of classes that were not previously connected or if such a link is removed. Table 1 summarizes the metrics we computed for the system under development. Throughout the project, we computed and analyzed the evolution of a set of source code metrics including the CK suite of quality metrics, the total number of classes, the lines of code of classes (LOCs), and the above described metrics derived from the analysis of the software graph. All the cited metrics are measurements made on single classes, so there is a value of the metrics for each class (and interface) of the system. However, we are mainly interested in measures of the whole system, able to give a synthetic picture of its

530

G. Concas et al.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

Table 1.

The metrics used to study the system.

Metric

Type

Description

NOC DIT WMC LCO

CK CK CK CK

CBO

CK

RFC

CK

FI WFI FO WFO REI

Graph Graph Graph Graph SNA

REO

SNA

WC

SNA

CI

SNA

CO

SNA

LOC

Dim.

Number of Children    No. of immediate subclasses. Depth of Inheritance Tree    No. of superclasses, up to the root. Weighted Methods per Class    No. of methods of the class (weight ¼ 1). Lack of Cohesion in Methods (LCOM)    No. of method pairs not sharing any instance variable minus No. of pairs sharing at least one. Zero if negative. Coupling Between Objects (CBO)    No. of other classes that depend on the given class, or which the given class depends on (excluding inheritance). Response For a Class (RFC)    No. of methods plus No. of dependencies on other classes (excluding inheritance). Fan-In    No. of other classes that depend on the given class. Weighted Fan-In    No. of times all other classes depend on the given class. Fan-Out    No. of other classes which the given class depends on. Weighted Fan-Out    No. of times the given class depends on other classes. Reach E±ciency In    Percentage of nodes within two- step distance from a node, following arcs from the head to the tail, divided by the No. of nodes within one step. Reach E±ciency Out    Percentage of nodes within two- step distance from a node, following arcs along their direction, divided by the No. of nodes within one step. Weak Components    No. of disjoint sets of nodes within one step from a node, not considering the node itself, divided by the No. of nodes within one step. Closeness-In    Reciprocal of Farness-In which is de¯ned as the sum of the lengths of all shortest paths from the node to all other nodes, following arcs from the head to the tail, divided by the No. of reachable nodes. Closeness-Out    Reciprocal of Farness-Out which is de¯ned as the sum of the lengths of all shortest paths from the node to all other nodes, following arcs along their direction, divided by the No. of reachable nodes. Lines Of Code    No. of lines of code of the class, excluding comments and blank lines.

quality. To this purpose, we computed statistics about the metric values for all the classes of the system, during its development, and used some of these statistics as a measure of the whole system. More about this in the section on results.

3. Related Work Several papers related CK metrics to software quality, not always agreeing on which metrics are the most correlated with lack of faults and ease of maintenance. In a study of two commercial systems, Li and Henry studied the link between CK metrics and maintenance e®ort [22]. Basili et al. found that many of the CK metrics were associated with fault-proneness of classes [4]. In another study on three industrial projects, Chidamber et al. reported that WMC, CBO and RFC look highly correlated among each other, and that higher values of CK coupling and the cohesion metrics (CBO and LCOM) were associated with reduced productivity and increased rework/design e®ort [9]. Subramanyam and Krishnan studied a large system written in Cþþ and Java, and found a good correlation between number of defects and

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

531

WMC, CBO, DIT [32]. Gyimothy et al. systematically studied the open-source Mozilla system, ¯nding that above all CBO, and then RFC, LCOM, WMC and DIT show a fair correlation with defects [17]. Succi et al. reported a broad empirical exploration of the distributions of CK metrics along several Java and Cþþ projects, con¯rming that some metrics are fairly correlated, and that NOC and DIT metrics generally exhibit a low variance, so they are less suitable to be used for a systematic assessment based on metric computation [33]. Recently, some papers were published on the use of OO metrics to assess the quality of software developed using agile methodologies. Giblin et al. presented a case study comparing the source code produced using agile methods with the source code produced for a similar type of application by the same team using a more traditional methodology. They made extensive use of speci¯c OO metrics, and concluded that agile methods have guided the developers to produce better code in both quality and maintainability [16]. Kunz et al. presented a methodological work discussing cost estimation approaches for agile software development, and a quality model making use of distinct metrics for quality management in agile software development [20]. Melis et al. used the software process simulation approach to assess the e®ect of the use of PP and TDD on e®ort, size, quality and released functionalities [26]. They found that increasing the usage of these practices signi¯cantly diminishes product defectiveness, and increases programming e®ort. Dyba and Dingsøyr reported a systematic review of other empirical studies of agile software development, including in Sec. 4.7 some other empirical evaluation of product quality [14]. These studies include a paper by Layman et al. [21] on an industrial project before and after adoption of Extreme Programming, reporting a 65% decrease in pre-relase defect rate, and a 35% decrease in post-release defect rate after XP adoption; a paper by Macias et al. [24] on comparing 20 student projects using Waterfall and XP methodologies, reporting no signi¯cant di®erences in external and internal quality factors; a paper by Wellington et al. [34] on comparing the development of 4 systems by 20 student teams using Plan-driven and XP methodologies, reporting that XP code shows consistently better quality metrics, among which a decrease of 40% of WMC average value. One of the most studied agile development practice in literature is TDD. Here we will report papers studying the in°uence of TDD on software quality and OO metrics. Canfora et al. [7] studied a set of 28 professional developers, asked to develop a test project. They found that TDD improves the unit testing but slows down the overall process. Nagappan et al. [27] studied industrial projects carried on in various contexts, using Java, Cþþ and .NET. The results indicated that the pre-release defect density decreased between 40% and 90% compared to similar projects that did not use the TDD practice. The teams experienced a 1535% increase in initial development time after adopting TDD. Janzen and Saiedian [19] studied various projects, industrial and academic, analyzing also the result of the use of TDD on OO metrics computed on the developed software. They found that test-¯rst

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

532

G. Concas et al.

programmers consistently produced classes with lower values of WMC metric; CBO and Fan-Out of the studied classes did not show a signi¯cant di®erence between software developed with or without TDD; LCOM* metric (a normalized LCOM, constrained to [0, 1] interval) also showed no signi¯cant di®erence. Siniaalto and Abrahamsson [31] studied 5 small scale case projects (5-9 KLOCS each), mainly performed by students. They found that WMC, CBO, RFC, NOC and LCOM do not signi¯cantly di®er between software developed with or without TDD; however, they also found signi¯cantly lower values of RFC in TDD software, as well as signi¯cantly higher values of DIT. Concas et al. published a paper using the same empirical data of this paper, limiting their study only to CK metrics and LOC metrics (class LOC and methods LOC) and describing in deeper detail the agile practices used in the project [11]. They found that all considered metrics but LCOM are able to discriminate very well between the ¯rst two phases of the project (initial \Agile" phase and \cowboy coding" phase), while only a few metrics maintain the ability of discriminating between subsequent phases, and no metric is able to discriminate between all pairs of consecutive phases at a signi¯cance level greater than 95%. A few papers have been published regarding the relationships of graph-theoretic and SNA metrics with software quality. Among these, Zimmermann and Nagappan [35] computed and studied many SNA metrics, on both the oriented and non-oriented software graph related to binary modules of Windows Server 2003 operating systems, and their dependencies. They found that some SNA metrics could identify 60% of the binaries that the Windows developers considered as critical    twice as many as those identi¯ed by complexity metrics (dimension, No. of functions, parameters and globals, Fan-In, Fan-Out). Concas et al. [12] presented an extensive analysis of software metrics for 111 object-oriented systems written in Java, including SNA metrics, ¯nding systematic non-normal behavior in their distributions, and studying the correlations among metrics. Concas et al. [13] studied the application of CK and SNA metrics to Eclipse and Netbeans open source systems, and performed an analysis of their correlation with defects found in classes; they found that the metrics most correlated with defects are LOCS, RFC and CBO.

4. Project Phases Besides a ¯rst exploratory phase at the beginning of the project, where the team studied the functionalities of the underlying open source Web portal management system (jAPS) and the way to extend it, without producing code, the project evolved through four main phases, each one characterized by an adoption level of the key agile practices of pair programming, TDD and refactoring. In particular: .

Pair Programming was one of the keys to the success of the project. All the development tasks were assigned to pairs and not to single programmers. Given a

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

533

task, each pair decided which part of it to develop together, and which part to develop separately. The integration was typically made working together. Sometimes, the developers paired with external programmers belonging to jAPS development community, and this helped to grasp quickly the needed knowledge of the framework. . Regarding TDD, developers had the requirement that all code must have automated unit tests and acceptance tests, and must pass all tests before it can be released. The choice whether to write tests before or after the code was left to programmers. . Refactoring was practiced mainly to eliminate code duplications and improve hierarchies and abstractions. Unfortunately, data on speci¯c refactorings were not recorded. The developers had a fair knowledge of Fowler's book [15], so several refactorings cited there were applied. A full account of the agile practices used in the project, and preliminary results on the use of CK metrics for discriminating among phases, is reported in [11]. To give empirical evidence to such phases, we asked each of the ¯ve members of the development team to de¯ne, to their judgement, system evolution phases in respect of PP, TDD and refactoring usage, and the dates when these phases started and ended. Four out of ¯ve members cited the four phases. Only one proposed three phases, merging phases 3 and 4 in just one phase. Regarding the dates de¯ning the boundaries between phases, all agreed that week 17 signed the end of Phase 2, obviously related to the date of presentation of the system. The end of Phase 1 was attributed to weeks from 8 to 11, with median equal to, and mean close to, 10 weeks. The end of Phase 3 was attributed to weeks 20 and 21, the majority saying 21. The resulting phases, that we will consider in the remaining of the paper, are summarized below: .

Phase 1 (Initial Agile): a phase characterized by the full adoption of all practices, including testing, refactoring and pair programming. It lasted ten weeks, leading to the implementation of a key set of the system features. In practice, speci¯c classes to model and manage the domain of research organizations, roles, products, and subjects were added to the original classes managing the content management system, user roles, security, front end and basic system services. The new classes include service classes mapping the model classes to the database, and allowing their presentation and user interaction. . Phase 2 (Cowboy Coding): this is a critical phase, characterized by a minimal adoption of pair programming, testing and refactoring, because a public presentation was approaching, and the system still lacked many of the features of competitors' products. So, the team rushed to implement them, compromising the quality. This phase lasted seven weeks, and included the ¯rst release of the system after two weeks. . Phase 3 (Refactoring): an important refactoring phase, characterized by the full adoption of testing and refactoring practices and by the adoption of a rigorous pair

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

534

G. Concas et al.

programming rotation strategy. The main refactorings performed were \Extract Superclass", to remove duplications and extract generalized features from classes representing research products, and corresponding service classes, and \Extract Hierarchy" applied to a few \big" classes, such as an Action class that managed a large percentage of all the events occurring in the user interface. This phase was needed to ¯x the bugs and the bad design that resulted from the previous phase. It lasted four weeks and ended with the second release of the system. . Phase 4 (Mature Agile): Like Phase 1, this is a development phase characterized by the full adoption of the entire set of practices, until the ¯nal release, after eight weeks.

5. Results and Discussion In this section we analyze the evolution of FlossAr source code metrics. At regular intervals of one week, the source code was checked out from the CVS repository and analyzed by a parser that calculated the metrics. The parser and the analyzer were developed by our research group as a plug-in for Eclipse IDE. Thus we gathered 30 \snapshots" of the system, one for each development week. 5.1. Correlations of the metrics of a given system To study how the 19 metrics    each computed for all classes of a given system    are correlated, we calculated the cross-correlation values of the various considered metrics of the last release of the system under study. We used Kendall's nonparametric measure of rank correlation [29] because Pearson's correlation coe±cients were highly in°uenced by outliers, while Spearman's rank correlation coe±cient computation su®ered from the many equal values found in integer data. The results are reported in Table 2, highlighting in bold those whose absolute value is above 0.6. Correlation tests made on other snapshots of the system yield very similar results. We found a high correlation between several pairs of metrics. RFC is fairly correlated with WMC, CBO and LOC, while LCO is correlated with WMC. FI and WFI are the most correlated metrics ( ¼ 0:90), but FI and WFI are also correlated with REI, CI and REO    the latter being strongly anti-correlated. FO is correlated with CBO, RFC, WFO, and REI, while WC is correlated with CBO. Finally, CI is correlated with WFO, which in turn is also correlated with LOC and RFC. Reach E±ciency, and in particular REO, tends to be anti-correlated with most other metrics. These correlations do not mean that some metrics can be easily substituted by others. However, they can be a good starting point to reduce the number of metrics to study. From the correlations studied, and from common knowledge on OO metrics, as speci¯ed below, the following metrics can be considered candidates to be overlooked, or substituted by other metrics:

LOC

1.00 0.57 0.39 0.00 0.06 0.41 0.69 0.02 0.05 0.48 0.60 0.20 0.06 0.35 0.02 0.06

LOC WMC LCO NOC DIT CBO RFC FI WFI FO WFO REI REO WC CI CO

0.57 1.00 0.60 0.11 0.09 0.37 0.68 0.24 0.28 0.20 0.30 0.04 0.14 0.35 0.16 0.06

WMC

0.39 0.60 1.00 0.00 0.17 0.28 0.44 0.22 0.25 0.10 0.14 0.08 0.17 0.24 0.14 0.07

LCO

0.00 0.11 0.00 1.00 0.06 0.08 0.06 0.36 0.35 0.05 0.03 0.21 0.24 0.21 0.29 0.18

NOC 0.06 0.09 0.17 0.06 1.00 0.02 0.01 0.20 0.21 0.24 0.29 0.17 0.29 0.04 0.16 0.14

DIT 0.41 0.37 0.28 0.08 0.02 1.00 0.57 0.33 0.32 0.56 0.44 0.13 0.15 0.74 0.17 0.06

CBO 0.69 0.68 0.44 0.06 0.01 0.57 1.00 0.08 0.12 0.57 0.57 0.22 0.04 0.46 0.07 0.06

RFC 0.02 0.24 0.22 0.36 0.20 0.33 0.08 1.00 0.91 0.18 0.13 0.58 0.59 0.42 0.59 0.12

FI 0.05 0.28 0.25 0.35 0.21 0.32 0.12 0.91 1.00 0.16 0.11 0.56 0.54 0.39 0.61 0.11

WFI 0.48 0.20 0.10 0.05 0.24 0.56 0.57 0.18 0.16 1.00 0.77 0.52 0.28 0.39 0.11 0.06

FO 0.60 0.30 0.14 0.03 0.29 0.44 0.57 0.13 0.11 0.77 1.00 0.40 0.22 0.35 0.08 0.10

WFO

REO 0.06 0.14 0.17 0.24 0.29 0.15 0.04 0.59 0.54 0.28 0.22 0.40 1.00 0.26 0.34 0.16

REI 0.20 0.04 0.08 0.21 0.17 0.13 0.22 0.58 0.56 0.52 0.40 1.00 0.40 0.00 0.37 0.12

0.35 0.35 0.24 0.21 0.04 0.74 0.46 0.42 0.39 0.39 0.35 0.00 0.26 1.00 0.22 0.09

WC

0.02 0.16 0.14 0.29 0.16 0.17 0.07 0.59 0.61 0.11 0.08 0.37 0.34 0.22 1.00 0.09

CI

0.06 0.06 0.07 0.18 0.14 0.06 0.06 0.12 0.11 0.06 0.10 0.12 0.16 0.09 0.09 1.00

CO

The Kendall rank cross-correlation coe±cients of the considered metrics, computed on all classes of the last version of FLOSS-AR system.

Metric

Table 2.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project 535

536

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

.

.

. .

.

G. Concas et al.

NOC and DIT: it is well known that most authors consider these metrics the least correlated with faults [17]. In our case, we found that mean and 90th percentile of NOC and DIT metrics show small variations across the snapshots, and look less useful than other metrics for discriminating among the Phases. The only large variation in DIT is between weeks 17 and 18 — thus between Phase 2 and Phase 3 — when a new abstract superclass, \EntityManager", was introduced to generalize a large part of the behavior of 18 existing classes. This led to a jump in DIT, and a corresponding drop in WMC, CBO, RFC, FI and FO, because many dependencies between each of the 18 subclasses and other classes were pushed up the hierarchy, to the new class. Overall, inheritance links contribute only for about 4% to all links of the software graph. For this reason, despite the importance of inheritance in OO development, NOC and DIT metrics were not considered to discriminate among the Phases of the presented case study. WMC: the information carried by this metric is found also in LOC (the more methods in a class, the more lines of codes) and RFC (which includes WMC in its computation). CBO: it is well correlated with RFC, FI, FO, as known from the literature [10, 33], so we will not consider it. WFI: FI is an almost perfect substitute because it is strongly correlated to WFI, and exhibits correlations very similar to those of WFI with all other metrics; moreover, it is simpler to compute. FO, WFO: these metrics are well represented by RFC metric. Moreover, their averages, over all the classes of the system, are the same as the averages of FI and WFI, respectively. This is because their average is the average number of in-links and out-links over all system classes. Since each in-link corresponds to one out-link, their total number, and hence their averages, are the same. This is true for both weighted and non-weighted links.

We decided to consider all SNA metrics, because they are not well studied in the software ¯eld yet, so they deserve to be studied more in depth. Note that we performed the analysis of variations in metric statistics reported in the following also for the metrics considered substituted by others, con¯rming that their behavior is consistent with that of their substitution metrics. In this way, the paper is simpler, without losing information. In the end, we analyze the behavior of the following nine metrics, as system development evolved: LCOM, RFC, FI, REI, REO, WC, CI, CO, LOC.

5.2. Metric statistics across system snapshots and their correlations The total number of classes in the system (including abstract classes and interfaces), which is a good indicator of its size, increases over time, though not linearly. The project started with 362 classes    those of jAPS release 1.6. At the end of the project, after 30 weeks, the system had grown to 514 classes, due to the development

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

537

Fig. 1. Total no. of classes during the evolution of FLOSS-AR system.

of new features that constituted the specialized system. Figure 1 shows the evolution of the number of classes during development, together with the four main phases of development, and the weeks of the three releases of the system. We computed key statistics of these metrics    mean, standard deviation, median, 90th percentile    for each of the 30 systems analyzed. Remember that these metrics are always positive, and none of them is normally distributed, but all follow a \fat tail" distribution, often a power-law [10, 23, 13], so the statistics must be focused mainly on the extreme tail. We found that the best statistics to account for the behavior of the metric in the whole system are the mean    that is a rough measure of the overall behavior of the metric across all classes of the system anyway    and the 90th percentile    that gives information on the tail. The standard deviation gives information only on how the values are spread, but not on the values themselves, while the median is skewed toward values that are too low, and tends to be fairly constant. We computed the Kendall cross-correlation coe±cients of the mean and 90th percentile of the metrics, on the 30 weekly snapshots of FLOSS-AR system under study, to assess how these metrics were related across the development. We show these cross-correlations in Tables 3 and 4, highlighting in bold those whose absolute value is above 0.7. Note that the 90th percentile of LOCS metric is constant across the snapshots, so we had to drop it from Table 4. This correlation is di®erent from the correlation computed class by class for a single snapshot of the system shown in Table 2. A high positive value of the class-byclass cross-correlation between two metrics means that, when one is above (below) average for a class, the other is likely to be above (below) average as well for the same class. In Tables 3 and 4, we refer to the correlation among average and 90th

538

G. Concas et al. Table 3. The Kendall rank cross-correlation coe±cients of the averages of the nine considered metrics, computed on the 30 weekly snapshots of FLOSS-AR system.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

Metric LOC LCO RFC FI REI REO WC CI CO

LOC

LCO

RFC

FI

REI

REO

WC

CI

CO

1.00 0.64 0.87 0.71 0.52 0.40 0.63 0.06 0.42

0.64 1.00 0.54 0.43 0.42 0.41 0.50 0.23 0.34

0.87 0.54 1.00 0.79 0.58 0.38 0.62 0.14 0.48

0.71 0.43 0.79 1.00 0.76 0.46 0.64 0.07 0.69

0.52 0.42 0.58 0.76 1.00 0.67 0.52 0.13 0.87

0.40 0.41 0.38 0.46 0.67 1.00 0.41 0.38 0.57

0.63 0.50 0.62 0.64 0.52 0.41 1.00 0.06 0.44

0.06 0.23 0.14 0.07 0.13 0.38 0.06 1.00 0.21

0.42 0.34 0.48 0.69 0.87 0.57 0.44 0.21 1.00

Table 4. The Kendall rank cross-correlation coe±cients of the 90th percentiles of the eight considered metrics, computed on the 30 weekly snapshots of FLOSS-AR system. LCOM has been dropped because it is constant over all snapshots. Metric LOC RFC FI REI REO WC CI CO

LOC

RFC

FI

REI

REO

WC

CI

CO

1.00 0.30 0.27 0.67 0.71 0.56 0.05 0.57

0.30 1.00 0.37 0.08 0.06 0.55 0.51 0.04

0.27 0.37 1.00 0.35 0.27 0.61 0.65 0.38

0.67 0.08 0.35 1.00 0.84 0.39 0.02 0.90

0.71 0.06 0.27 0.84 1.00 0.38 0.10 0.82

0.56 0.55 0.61 0.39 0.38 1.00 0.51 0.40

0.05 0.51 0.65 0.02 0.10 0.51 1.00 0.00

0.57 0.04 0.38 0.90 0.82 0.40 0.00 1.00

percentile values of the metrics, respectively, measured at weekly time steps during the development. In this case, a high positive value of the cross-correlation means that, at a given development step, when one metric averaged among all classes is above (below) its average value over the whole development, also the other is likely to be above (below) its average value of a similar percentage for the same time step. As you can see in Tables 3 and 4, many metrics are fairly correlated with each other. The most correlated with other metrics for both means and 90th percentiles are LOCs, RFC, Fan-In, and Closeness-Out    the latter being anti-correlated with other metrics. The least correlated metric is Closeness-In. Regarding the 90th percentiles, the correlations substantially con¯rm those of the means, but are typically lower. These results often do not match those reported in Table 2, in the sense that if two metrics are fairly correlated (or not correlated at all) when computed class-by-class, this does not imply that their means or 90th percentiles are correlated (or not correlated) in the same way when computed across a sequence of snapshots of the system under development, and vice versa. In about 40% of the cases, we observe even an inversion of the sign of the correlation. This is quite counter-intuitive, but the two correlations have di®erent meanings. If the slopes of the regression between two correlated quantities    computed across

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

539

classes of the same snapshot    vary across di®erent snapshots, the resulting correlation of means or 90th percentiles can be very di®erent from the correlations referred to a single snapshot.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

5.3. Discriminating amongst development phases using aggregate metrics As reported in Sec. 4, the development of the system evolved through four distinct phases. We know that what di®erentiates the various phases is the level of adoption of agile practices    namely PP, TDD and refactoring. We also know that these agile practices were applied or not applied together; consequently, it is not possible to discriminate among them using the data reported for this case study. So, we talk of \key agile practices" considering them as applied together. In this subsection we show and discuss how aggregate statistics of OO and network metrics exhibit speci¯c patterns of evolution, as system development proceeds. In Fig. 2 we show the behavior of the mean values of the three metrics that seem to discriminate better than others among development phases    Fan-In, Closeness-In and Closeness-Out. All the values are normalized to the maximum value reached by the metrics. FI and CO look the best to discriminate between Phases 1 and 2, while CI is the best to discriminate between Phases 2 and 3. Phases 3 and 4 are less discriminated, but this is reasonable, because Phase 3 is a refactoring phase, and Phase 4 is a subsequent development phase that continues on the same path, without aggressive refactoring. In Fig. 3 we show the behavior of the mean values of other metrics which are still \good" at discriminating among phases. They are LCOM, RFC, REI and REO.

Fig. 2. The evolution of the mean value of FI, CI and CO metrics.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

540

G. Concas et al.

Fig. 3. The evolution of the mean value of LCOM, RFC, REI and REO metrics.

In particular, LCOM exhibits a strong growth in Phase 2, when good OO and agile practices were abandoned, which is only partially corrected in Phases 3 and 4. Figure 4 shows the behavior of 90th percentiles of FI, WC and CI metrics, the best at discriminating between phases. Note the di®erent behavior with respect to the means reported in Figs. 2 and 3. For the sake of brevity, we do not report the behavior of other metrics, because they look less signi¯cant than the reported ones. The evolution of most aggregate statistics of the studied metrics with the process phases shows signi¯cantly di®erent values and trends that depend on the speci¯c phase, as shown in Figs. 24. Our hypothesis is that this variability is due to the di®erent level of adoption of the key agile practices. In fact, to our knowledge, the only external factors that might have had an impact on the project are precisely the di®erences among the phases, as reported in Sec. 4. Regarding internal factors, the only relevant factor at play was team experience, regarding both agile practices applications, and knowledge of the system itself. The project duration was relatively short, so we estimate that the latter factor a®ected signi¯cantly only Phase 1. We performed a Kolmogorov-Smirnov (KS) two-sample test to assess if those measurements signi¯cantly di®ered from one phase to the next. The KS test determines if two datasets belong to di®erent distributions, making no assumption on the distribution of the data.b For each computed metric, we compared the measurements b Since the metrics computed at a given weekly snapshot depend also on the state of the system in the previous snapshot, the assumption underlying KS test that the samples are random and mutually independent can be challenged. However, we used KS test to assess the di®erence between measurements in di®erent phases as if they were independent sets of points, and we believe that at a ¯rst approximation the KS test result is still valid.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

541

Fig. 4. The evolution of the 90th percentile of FI, WC and CI metrics.

belonging to any pair of phases; we were of course most interested in the ability to discriminate between subsequent phases. The results are shown in Tables 5 and 6 for the means and 90th percentiles, respectively. The cases with signi¯cance levels greater than 99% are shown in bold. Regarding metrics means (Table 5), Phase 1 metrics di®er very signi¯cantly from any other phase in all cases but for LCOM between Phases 1 and 2    getting a signi¯cance higher than 90% also for LCOM. Phase 2 is less clearly di®erentiated from Phases 3 and 4. REO and CI metrics appear to be able to discriminate best, with a KS signi¯cance greater than 98%, and with RFC and FI following suit at 95%. Phases 3 and 4 can be discriminated e®ectively by FI, REI and CO metrics means. 90th percentiles are slightly less able to discriminate among phases. Phase 1 is still well di®erentiated from other phases    and especially from Phase 2    but for RFC Table 5. Con¯dence level that the mean of the metrics taken in a pair of phases signi¯cantly di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance is above 99%. Metric LOC LCO RFC FI REI REO WC CI CO

Phases 12

Phases 13

Phases 14

Phases 23

Phases 24

Phases 34

99.990 91.310 99.990 99.990 99.990 99.990 99.990 99.825 99.825

99.340 99.340 99.340 99.340 99.340 99.340 99.800 99.340 99.340

99.985 99.985 99.985 99.985 99.985 99.985 99.998 99.985 99.985

85.113 62.319 95.250 95.250 62.319 98.770 33.939 98.770 62.319

96.402 84.722 99.386 88.606 82.414 99.924 56.976 99.924 82.414

64.019 64.019 64.019 99.213 99.213 35.531 59.580 91.128 99.213

542

G. Concas et al. Table 6. Con¯dence level that the 90th percentile of the metrics taken in a pair of phases signi¯cantly di®ers, according to K-S two-sample test. In bold are the cases whose signi¯cance is above 99%.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

Metric LOC RFC FI REI REO WC CI CO

Phases 12

Phases 13

Phases 14

Phases 23

Phases 24

Phases 34

99.947 86.416 99.947 99.947 99.746 99.529 99.529 99.529

99.340 91.964 91.964 99.340 99.340 99.340 0.000 99.340

99.985 0.000 99.985 99.985 99.985 97.032 0.000 99.985

98.770 0.000 98.770 95.250 98.770 0.000 95.250 45.215

99.924 84.722 99.924 99.924 99.924 0.118 99.386 99.924

64.019 91.128 0.483 99.213 50.702 8.198 0.000 99.213

metric. CI is able to discriminate Phase 1 from Phase 2 very well, but totally fails to discriminate Phases 3 and 4 from Phase 1. It looks like a very powerful indicator of Phase 2, when good agile practices were dropped by developers. Phase 2 is discriminated from Phase 3 by LOC, FI and REO metrics almost at 99% signi¯cance level. The same metrics are able to discriminate Phase 2 from Phase 4 at even an higher level. Finally, Phases 3 and 4 are well discriminated by REI and CO metrics, con¯rming the results of the means. On the contrary, FI    that was a good discriminator in the case of the mean    is totally unable to discriminate between Phases 3 and 4 when its 90th percentile is used. These results in fact con¯rm the di®erence in trends and values of the various metrics in the various phases that are patent in Figs. 24. 5.4. Aggregate metrics behavior across development phases During the development of FLOSS-AR system, Phase 1 is characterized by a steady growth of the number of classes. All metrics but LCOM and CO are stable during the ¯rst ¯ve weeks of this phase; then, their means tend to grow    in particular for FI, REI, REO, WC and, to a lower extent, RFC and LOC. The means of LCOM and CO, on the contrary, tends to increase during the ¯rst few weeks and then stabilize. The 90th percentiles of the metrics tend to be quite constant during Phase 1, except in the case of CO. This means that no signi¯cant addition to the tails of the distributions (classes with extreme values of the metrics) was made. Regarding the large variation of CO 90th percentile, we remember that CO for a class is related to the number of steps needed to reach all the other (reachable) classes, following edges along their direction. The lower the average No. of these steps for a class, the higher its CO value. The large variations might be explained with the addition, or deletion, of links in such a way that some classes increased/decreased substantially their closeness to other classes in the system, a phenomenon clearly possible in a smallworld network such as a software network. The starting values of all these metrics are those of the original jAPS framework, constituted by 367 classes and evaluated by code inspection as a project with a fairly good OO architecture.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

543

The increase of RFC and FI means (which we remember is highly correlated with CBO) denotes a worsening of software quality. Note that Phase 1 is characterized by a rigorous adoption of agile practices, but we should consider two factors:

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

(1) The knowledge of the original framework was initially quite low, so the ¯rst addition of new classes to it in the initial phase had a sub-optimal structure, and it took time to evolve towards an optimal con¯guration; (2) Some agile practices require a time to be mastered, and our developers were junior. In general, we might conclude that in Phase 1 the team steadily added new features    and consequently new classes    to the system. In the ¯rst half of the phase, however, these classes substantially kept the structure of the original system they were added to. As the system grew, this structure was slowly impaired, due to the factors mentioned above. Phase 2 is characterized by a strong push for releasing new functionalities and by giving up the use of pair programming, testing and refactoring. In this phase we observe a growth in all metrics means but CO, and particularly in metrics related to coupling and complexity    with an explosive growth of LCOM. This seems to con¯rm that in Phase 2 the quality has been compromised for adding several new features. Regarding 90th percentiles, they substantially con¯rm the behavior of the corresponding means. It is worth noting that 90th percentile of several metrics exhibit an even steeper change passing from Phase 1 to Phase 2. This happens for FI, LOCS, WC, CI and CO    the latter with a steep decrease in value. Phase 2 is followed by Phase 3, a phase when the team, adopting a rigorous pair programming rotation strategy together with testing and refactoring, were able to refactor the system, increasing its cohesion and decreasing coupling    and thus reducing the values of several metrics known as anti-correlated to quality, such as LCOM, RFC, FI and LOCs. In this phase, no new features were added to the system. The number of classes increased during this phase, because refactoring required to split classes that had grown too much, and to refactor hierarchies, adding abstract classes and interfaces. The transition from Phase 2 to Phase 3 is marked by a signi¯cant decrease of Fan-In and CI, patent in both mean and 90th percentile behavior. After this decrease FI and CI means tend to increase again at the end of Phase 3. CO mean has a trend opposite to CI, as happens also in Phase 2 (but not in Phases 1 and 4). REO has a behavior similar to CO, while RFC and LCOM were reduced, mainly at the end of the phase. There is also a light decrease of LOC (not shown), mainly due to the addition of abstract classes to the hierarchies that factor out common features and reduce the code of many classes. Note that the values of the metrics at the end of Phase 3 seem to reach an equilibrium. Phase 4 is the last development phase. It is characterized by the adoption of all key agile practices, and by the creation of other classes associated to new features. In this phase most metrics do not change signi¯cantly    although, in the end, the

544

G. Concas et al.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

Table 7. The metrics and statistics best suited to discriminate between the various phases. Phases

Metric (Statistic)

Discussion

1!2

CI (90th perc.)

1!2 1!2

FI (mean) LCO (mean)

1!2 2!3

CO (mean) CI (mean & 90th perc.)

2!3

FI (mean & 90th perc.)

2!3

REO (mean)

3!4

REI (mean)

3!4

CO (mean)

3!4

FI (mean)

A steep increase of the 90th perc. of CI looks like a very good marker of a phase where \good" agile practices were abandoned. An increase of FI mean is also a good discriminator of Phase 2. LCOM mean starts low, but then increases very signi¯cantly during the middle of Phase 2. CO mean signi¯cantly decreases during Phase 2. When agile practices are resumed, we found an immediate, steep decline of CI (both mean and 90th perc.), that persisted in Phase 3. FI con¯rms to be another good marker able to discriminate between Phases 2 and 3, though to a lesser extent than CI. REO mean in Phase 3 is consistently and signi¯cantly greater than in the previous phase. REI mean steadily increases at the end of Phase 4, showing a good discrimination ability with respect to Phase 3. CO mean decreases at the end of Phase 3, and continues to decrease in Phase 4, showing a fair discrimination ability. FI mean increases at the end of Phase 3, and then remains almost constant in Phase 4, showing a mild discrimination ability.

values of most of them are slightly lower than at the beginning of the phase    maybe because the team became more e®ective in the adoption of the agile practices compared to the initial Phase 1. Only REO tends to grow in the end of the whole development. Table 7 summarizes these observations, highlighting which metrics look the best to discriminate between the various phases. In conclusion, Fan-In looks the only metric able to discriminate fairly well between all the various phases, especially considering its mean. Other good discriminators are CI (especially its 90th percentile) for the ¯rst phases, and REI and CO means for the last phases. For this case study, a combination of FI mean, CI 90th percentile and REI would be able to discriminate among the various phases fairly well.

6. Threats to Validity The presented work is based on a single, empirical case study. This fact yields several obvious threats to its validity that we discuss in this section. The ¯rst issue is that what we presented is just one anecdotal case study, since we were not able to ¯nd other case studies with an amount of source code data and, above all, information about the variations of agile practices adopted throughout the development. From a single case study, it is clearly impossible to safely generalize to other cases. We believe, however, that the case study is of great anecdotal interest, and might be used by practitioners as a starting point to analyze the relationships between software metric trends and practices used to improve software quality.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

545

Related to this issue, the adoption of agile practices used to identify the phases of the project comes from a survey among developers. The details of the actual adoption of Agile practices (kinds of refactorings applied, exact percentage of time spent in Pair Programming, etc.) were not explicitly recorded during the project. This vagueness is another threat to the validity of the results. Another threat to the validity of the presented results is that we studied a smallmedium sized project, and this might be di±cult to generalize to larger, more critical projects. This issue is related to the previous one. However, modern development processes tend to split large projects in a bunch of loosely coupled, smaller developments, whose magnitude is not so di®erent from the presented one. When this is the case, this objection should fall. Another threat concerns the speci¯c OOPL language (Java) and programming environment (Eclipse) used to develop the system. Again, the generalization to other languages and programming styles is not granted. We can observe that on one hand we are interested in OO metrics and in software graphs built from an OO architecture. The OO paradigm is currently the most used programming paradigm, and we believe that focusing on it is not really limiting. On the other hand, many popular OO languages    especially Cþþ, C#    are very similar to Java. In a previous study, the distributions and correlations of CK metrics in 100 Java and 100 Cþþ projects were found fairly similar [33]. So, we believe that the presented results can generalize to them. For other OO languages    like Python and Ruby    this might not be true because the programming styles are very di®erent from Java. The last threat, and perhaps the biggest, is that at least some of the ¯ndings might have been obtained just by chance. The number of samples used in the statistical analysis, one for each snaphot, is 30 per metric/statistic. The sample groups pertaining to the four phases used to discriminate between metrics contain between 4 and 10 values. These numbers, compared to the total number of metrics and statistics tested to discriminate among phases (16 original metrics, and 4 statistics for each of them), are small. So, the discrimination ability of some metric/statistic might be due to statistical variations, and not signi¯cant at all. In order to answer to this objection, we can highlight that: (1) Regarding the statistics, we immediately found that the median and the standard deviation were not able to discriminate anything, and dropped them. So, we remained with 2 statistics. We used 90 for the percentile value as an optimal choice, not being too close to the median, and not too close to the extreme portion of the tail. (2) We did not use seven of the original metrics (DIT, NOC, WMC, etc.) following information found in the literature about their incapacity to discriminate software quality, and because of strong correlations with other metrics. Having dropped them is not a \selection of the ¯ttest" in a statistical sense, but bears a speci¯c meaning.

546

G. Concas et al.

(3) As shown in Tables 5 and 6, most of the remaining metrics actually used are able to strongly discriminate among various pairs of phases. Only the pairs including Phases 2 and 3, and Phases 3 and 4 are discriminated by just a few metrics/ statistics. This makes very unlikely that this discrimination ability is due to chance.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

7. Conclusions We presented a case study related to agile software development of a medium-sized project using Java OOPL, matching the di®erent use of key agile practices in the four phases of the project with OO and graph-related metrics. In Phase 1 we observed a deterioration of quality metrics, that signi¯cantly worsened during Phase 2; Phase 3 led to a signi¯cant improvement in quality, and Phase 4 kept this improvement. The only external factors that changed during the phases were adoption of pair programming, TDD and refactoring agile practices, that were abandoned during Phase 2, and were used again at their full power during Phase 3, aiming to improve the system quality without adding new features, and then in Phase 4. As regards internal factors, in Phase 1 the team was clearly less skilled in the use of agile practices and in the knowledge of the original framework than in subsequent phases. We studied the aggregate variation of several source code metrics, speci¯c for OO systems, and for the oriented software graph built from the OO software structure. We ¯nd that an appropriate combination of a few metrics    namely the average Fan-In, the 90th percentile of Closeness-In, and the average of Reach-E±ciencyIn    is able to discriminate among the various phases, and hence among the development practices used to code the system. The adoption of \good" agile practices is always associated with \better" values of these metrics    when pair programming, TDD and refactoring are used, the quality metrics improve; when these practices are discontinued, the metrics worsen signi¯cantly. We validated the usefulness of software metrics in monitoring the quality of the ongoing development, for the empirical case study analyzed. This might be useful for software practitioners. Clearly, it is not possible to draw de¯nitive conclusions observing a single, medium-sized project. Unfortunately, it is not easy to ¯nd other case studies, because they must include not only tracking of the source code produced during development    a task easily accomplished with modern con¯guration management systems    but also an accurate tracking of the development practices, and of possible other external and internal factors, used throughout the project. We hope that this paper might spur similar studies by researchers with access to proper data, able to con¯rm or to disprove our ¯ndings. Acknowledgments This work was partially funded by Regione Autonoma della Sardegna (RAS), Regional Law No. 7, 2007 on Promoting Scienti¯c Research and Technological

An Empirical Study of Software Metrics for Assessing the Phases of an Agile Project

547

Innovation in Sardinia, call 14/2/2009, and RAS Integrated Facilitation Program (PIA) for Industry, Artisanship and Services, call 14/10/2008, project No. 265, Advanced Technologies for Software Measuring and Integrated Management, TAMIGIS.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

References 1. Agile Manifesto, URL: www.agilemanifesto.org. 2. A. J. Albrecht, Measuring application development productivity, Proc. of IBM Application Development Symposium, Monterey, CA, October 1979, pp. 8392. 3. N. Anquetil and J. Laval, Legacy Software Restructuring: Analyzing a Concrete Case, in Proc. of the 15th European Conference on Software Maintenance and Reengineering (CSMR'11), Oldenburg, Germany, 2011. 4. V. R. Basili, L. C. Brand and W. L. Melo, A validation of object oriented design metrics as quality indicators, IEEE Trans. Software Eng. 22 (1996) 751761. 5. K. Beck and C. Andres, Extreme Programming Explained: Embrace Change, Second Edition (Addison-Wesley, 2004). 6. B. Boehm and R. Turner, Balancing Agility and Discipline (Addison-Wesley Professional, 2003). 7. G. Canfora, A. Cimitile, F. Garcia, M. Piattini and C. A. Visaggio, Evaluating advantages of test driven development: A controlled experiment with professionals, in Proc. Int. Symposium on Empirical Software Engineering, ISESE'06, 21-22 September (2006), Rio de Janeiro, Brazil, pp. 364371. 8. S. Chidamber and C. Kemerer, A metrics suite for object-oriented design, IEEE Trans. Software Eng. 20 (1994) 476493. 9. S. Chidamber and C. Kemerer, Managerial use of metrics for object oriented software: An exploratory analysis, IEEE Trans. Software Eng. 24 (1998) 629639. 10. G. Concas, M. Marchesi, S. Pinna and N. Serra, Power-laws in a large object-oriented software system, IEEE Trans. Software Eng. 33 (2007) 687708. 11. G. Concas, M. Di Francesco, M. Marchesi, R. Quaresima and S. Pinna, Study of the evolution of an agile project featuring a web application using software metrics, in Proc. 9th Int. Conf. on Product Focused Software Process Improvement (PROFES'08), Frascati, Italy, 2325 June 2008. 12. G. Concas, M. Marchesi, A. Murgia, S. Pinna and R. Tonelli, Assessing traditional and new metrics for object-oriented systems, in Proceedings of the Workshop on Emerging Trends n Software Metrics (ICSE'10), Cape Town, South Africa, May 2010. 13. G. Concas, M. Marchesi, A. Murgia and R. Tonelli, An empirical study of social networks metrics in object-oriented software, Advances in Software Engineering, Vol. 2010, 2010. 14. T. Dyba and T. Dingsøyr, Empirical studies of agile software development: A systematic review, Information and Software Technology 50 (2008). 15. M. Fowler, Refactoring. Improving the Design of Existing Code (Addison-Wesley, 1999). 16. M. Giblin, P. Brennan and C. Exton, Introducing agile methods in a large software development team: The impact on the code, in Proc. 11th Int. Conf. On Agile Processes in Software Engineering and Extreme Programming (XP2010), Trondheim, Norway, June 2010, pp. 5872. 17. T. Gyimothy, R. Ferenc and I. Siket, Empirical validation of object-oriented metrics on open source software for fault prediction, IEEE Trans. Software Eng. 31 (2005) 897910.

Int. J. Soft. Eng. Knowl. Eng. 2012.22:525-548. Downloaded from www.worldscientific.com by WSPC on 09/07/12. For personal use only.

548

G. Concas et al.

18. JAPS: Java agile portal system. URL: http://www.japsportal.org. 19. D. S. Janzen and H. Saiedian, Does test-driven development really improve software design quality?, IEEE Software, March/April 2008, pp. 7784. 20. M. Kunz, R. R. Dumke and A. Schmietendorf, How to measure agile software development, in Proc. Int. Conf. on Software Process and Product Measurement (IWSM-Mensura 2007), Palma de Mallorca, Spain, November 58, 2007, pp. 95101. 21. L. Layman, L. Williams and L. Cunningham, Exploring extreme programming in context: An industrial case study, in Proc. of the Agile Development Conference (ADC'04), Salt Lake City, Utah, June 2004, pp. 3241. 22. W. Li and S. Henry, Object oriented metrics that predict maintainability, J. Systems and Software 23 (1993) 111122. 23. P. Louridas, D. Spinellis and V. Vlachos, Power laws in software, ACM Trans. Software Eng. and Methodology, 18(1) 2008. 24. F. Macias, M. Holcombe and M. Gheorghe, A formal experiment comparing extreme programming with traditional software construction, in Proc. of the Fourth Mexican International Conference on Computer Science (ENC 2003), Tlaxcala, Mexico, 1212 September 2003. 25. T. J. McCabe, A complexity measure, IEEE Trans. Software Eng. 2 (1976) 308320. 26. M. Melis, I. Turnu, A. Cau and G. Concas, Evaluating the impact of test-¯rst programming and pair programming through software process simulation, Software Process Improvement and Practice 11 (2006) 345360. 27. N. Nagappan, E. M. Maximilien, T. Bhat and L. Williams, Realizing quality improvement through test driven development: Results and experiences of four industrial teams, Empirical Software Engineering 13 (2008) 289302. 28. M. E. J. Newman, The structure and function of complex networks, SIAM Review 45 (2003) 167256. 29. A. V. Prokhorov, Kendall coe±cient of rank correlation, in Encyclopaedia of Mathematics, ed. M. Hazewinkel (Springer Verlag, Heidelberg, 2001). 30. J. Scott, Social Network Analysis: A Handbook (SAGE Publications, London, UK, 2000). 31. M. Siniaalto and P. Abrahamsson, Does test-driven development improve the program code? alarming results from a comparative case study, in Balancing Agility and Formalism in Software Engineering, B. Meyer, J. R. Nawrocky, B. Walter, eds., Lecture Notes in Computer Science, Vol. 5802 (Springer, 2008), pp. 143156. 32. R. Subramanyam and M. S. Krishnan, Empirical analysis of CK metrics for objectoriented design complexity: Implications for software defects, IEEE Trans. Software Eng. 33 (2007) 687708. 33. G. Succi, W. Pedrycz, S. Djokic, P. Zuliani and B. Russo, An empirical exploration of the distributions of the Chidamber and Kemerer object-oriented metrics suite, Empirical Software Engineering 10 (2005) 81103. 34. C. A. Wellington, T. Briggs and C. D. Girard, Comparison of student experiences with plan-driven and agile methodologies, in Proc. of the 35th ASEE/IEEE Frontiers in Education Conference, Indianapolis, Indiana, 1921 October 2005. 35. T. Zimmermann and N. Nagappan, Predicting defects using network analysis on dependency graphs, in Proc. 30th Int. Conf. on Software Engineering (ICSE'08), Leipzig, Germany, 1018 May 2008, pp. 531540.