Assessing the Reproducibility and Accuracy of ...

45 downloads 3514 Views 234KB Size Report
Point Users Group (IFPUG FPA) and OO-Method. Function Points (OOmFP), a recently proposed FSM method for sizing object-oriented software systems that.
Assessing the Reproducibility and Accuracy of Functional Size Measurement Methods through Experimentation1 Silvia Abrahão1, Geert Poels2 and Oscar Pastor1 1

Department of Information Systems and Computation Valencia University of Technology Camino de Vera, s/n, 46022, Valencia, Spain {sabrahao, opastor}@dsic.upv.es 2

Faculty of Economics and Business Administration Ghent University Hoveniersberg 24, 9000 Ghent, Belgium [email protected]

Abstract A number of Functional Size Measurement (FSM) methods have been proposed in the literature, but so far there has been little systematic evaluation of these methods. This paper describes a controlled experiment which compares Function Point Analysis, a standard FSM method supported by the International Function Point Users Group (IFPUG FPA) and OO-Method Function Points (OOmFP), a recently proposed FSM method for sizing object-oriented software systems that are developed using the OO-Method approach. The goal is to investigate whether OOmFP results in better size assessments within the context of an OO-Method development process. The methods are compared using two criteria defined in the ISO 14143-3: reproducibility and accuracy. The results show that, within its context, OOmFP is more consistent and accurate than IFPUG FPA. Keywords: FSM methods, IFPUG Function Point Analysis, OO-Method Function Points, Empirical Validation.

1. Introduction Functional Size Measurement (FSM) methods are intended to measure the size of software by quantifying the functional user requirements that the software delivers. A FSM method measures the logical external view of the software from the user's perspective by evaluating the 1

amount of functionality to be delivered. The capability to accurately quantify the size of software in an early phase of the development project and to control the functionality delivered during the software lifecycle is critical to software project managers for developing accurate project estimates and having early project indicators. The most commonly used FSM method is IFPUG Function Point Analysis (IFPUG FPA) [1]. It is based on the method proposed by Alan Albrecht [2]. This technique was developed specifically to measure the amount of data that each function accesses as an indicator of functional size. One of the most serious practical limitations of existing FSM methods is their inability to cope with the measurement of object-oriented systems. A number of approaches have been proposed in the literature to address this issue [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13],[14], [15], but so far, none of these has been widely accepted in practice. A new method, OO-Method Function Points (OOmFP) [16], [17] has been proposed to overcome these difficulties in the context of an automated software production method called OO-Method [18]. OOmFP was designed to conform to the IFPUG FPA counting rules [1] since it redefines the IFPUG counting rules in terms of the concepts used in OO-Method. Because of the lack of generally accepted and sufficiently rigorous validation processes for FSM methods, it seems difficult to evaluate new proposals, both on a practical and on a theoretical level. The absence of systematic evaluation might perhaps explain the low adoption rate of the proposed FSM methods for OO

This work has been partially supported by the CICYT Project under grant TIC2001-3530-C02-01.

systems. To provide better guidance, the ISO published a series of standards for FSM. Part 1 [19] of the standard provides the concepts of FSM and establishes a basis against which existing and new FSM methods can be compared. Part 2 [20] provides a process for checking whether a candidate FSM method conforms to the concepts described in Part 1. Part 3 [21] provides a framework for verifying the statements of an FSM method and/or for conducting tests using performance criteria such as repeatability, reproducibility and accuracy. However, this standard does not offer any support to compare different FSM methods in order to decide which one is the best. Hence, how can we determine whether a particular FSM method is more effective than another FSM method? Empirical studies can help determine the effectiveness of proposed theories and methods [22], [23]. This is not only useful for researchers to prove their theories, but also for practitioners to select a new technology. Effectiveness can be defined as the capability of a FSM method in achieving its objectives. Evaluation of the effectiveness of a FSM method requires measuring the quality of the measurement results (outputs). This paper reports on an empirical study that investigated whether OOmFP is more effective than IFPUG FPA in sizing OO software systems that are developed according to the OO-Method. In order to do this, we chose the reproducibility and accuracy criteria specified in the ISO 14143-3 [21]. This study was conducted through a controlled experiment that was carried out at the Valencia University of Technology. This paper is organized as follows. In Section 2 an overview of related works on empirical studies for functional size measurement is given. In Section 3 the main principles of IFPUG FPA and OOmFP are described. This is followed by a description of our experimental procedure to compare OOmFP and IFPUG FPA in Section 4. The experiment results are presented and discussed in Section 5. Finally, in Section 6, we conclude by analyzing our findings.

2. Related Work In their work as ISO editors for the Guide to the Verification of FSM methods (ISO 14143-3) [21] Jacquet and Abran [24] suggest a process model for software measurement methods. This process describes a step in which the results provided by the measurement method must be validated and verified. By validated we mean “does the value that is produced adequately represent the functional size of the measured application?” and by verified we mean “is the value that is produced the result of a correct application and interpretation of the measurement rules?”. However, no procedure for validating the application or the results of a FSM method

is provided. A number of empirical studies that investigate the ability of a FSM method to produce a size value that can be used for effort [25] or cost [26] prediction have been published in the literature. For instance, Moser et al. [26] present an empirical study using 36 projects that demonstrate that the System Meter method, which explicitly takes reuse into account, predicts effort substantially better than a model based on function points. Also, some empirical studies that investigate the interrater reliability of FPA have been published in the literature [27] [28]. In these works, field experiments were conducted to address the issues of inter-rater reliability and inter-method reliability. Inter-rater reliability is described as “whether two individuals performing a measurement for the same system would generate the same result”. An empirical study that focuses on the clarification of counting guidelines and that deals with the ambiguity of measurement rules is presented in [29]. Morris and Desharnais [30] propose a method for checking the validity of the data collected (from 300 projects) during a function point count using IFPUG FPA. This study identifies common errors during the assessments and describes reasons for variances from the expected results. However, as far as we know, there is no study that contains a rigorous empirical evaluation of the effectiveness of FSM methods or the suitability of existing methods for measuring the functional size of OO systems. For the proposals that present FPA variants or extensions for OO systems, the only validation of any kind are proofs of concept presented by the developers of the methods themselves, mostly in the same paper that proposes the new FSM method. As example of this, Laranjeira [9] who applied his method to two projects. Some methods never transcend the stage of original proposal, and the paper in which they are proposed does not even offer an illustrative example of how the rules should be applied [31], [10], [11], [15]. The work presented here is different in the sense that we evaluate OOmFP using a controlled experiment. OOmFP was designed to be compliant with IFPUG FPA. At the same time, it is meant to improve upon IFPUG FPA for the measurement of OO systems that are developed using the OO-Method approach. For this reason, IFPUG FPA is treated as the benchmark FSM method against which the new proposal, OOmFP, is evaluated.

3. Overview of IFPUG FPA and OOmFP First, we present IFPUG FPA, which is a standard FSM method, and then we present OOmFP, which is a specific method for sizing OO systems developed using the OO-Method approach.

3.1. IFPUG Function Points Analysis In IFPUG Function Point Analysis (IFPUG FPA), the functional size is calculated by counting the number of the five types of components that exist, namely: External Inputs (e.g., transaction types entering data into the system), External Outputs (e.g., report types producing information output), External Inquiries (e.g., types of inquiries supported), Internal Logical Files (data maintained by the system), and External Interface Files (e.g., data accessed by the system but not maintained by it). These components are weighted (according to their complexity – low, average or high), and their weights are summed. This value is then adjusted using fourteen general system characteristics to produce a functional size measure. Fetcke et al. propose [32] a generalized representation for FSM methods. According to this representation, FSM requires two steps of abstraction: first, the elements that contribute to the functional size of the system are identified (identification step); and second, these elements are mapped into numbers (measurement step). In the identification step, (given a User Requirements Specification), the data models and process models are specified, and windows, screens and reports are designed. This is an implicit step that is not described as part of the measurement process; however, the IFPUG Count Practices Manual [1] suggests that measurement may be based on one or more of the following components: user requirements, data models (Entity-Relationship Diagrams), process models (Data Flow Diagrams) and designs of windows, screens, and reports. In some examples, the requirements specification document stands alone as the basis for the measurement, but most of the

examples presented in the manual include a data or process model, and designs of windows, screens, and reports. Significant elements for measurement are identified using these documents, models and designs. In the measurement step, the previously identified elements are captured in an abstract model that is specified according to the IFPUG FPA Measurement Abstract Model. This ‘meta-model’ describes all elements that contribute to the functional size of the system, according to the IFPUG view. The construction of the abstract model is an implicit process that is performed by applying the IFPUG measurement rules, specifically those that are required to identify and classify the five types of components as well as to rate their complexity. The abstract model is then used to assign a numerical value to the system representing its functional size (i.e. translating complexity-rated and classified functions into function point values, aggregating the values) [33].

3.2. OO-Method Function Points OO-Method Function Points (OOmFP) was designed to conform to the IFPUG FPA counting rules. It redefines the IFPUG counting rules in terms of the conceptual modeling primitives defined in OO-Method. Figure 1 shows a representation of the OOmFP Measurement Procedure. As in IFPUG FPA, there are two main steps: identification and measurement, but they are now clearly separated. In fact, the identification step is, strictly speaking not part of OOmFP (although it is required before measurement can be performed) as it refers to system modeling using the OO-Method approach.

Figure 1. An Abstraction of OOmFP Measurement Procedure

Therefore, given a User Requirements Specification, an OO-Method conceptual schema is built in the identification step. This schema includes at least2 an Object Model (i.e. the structural model of the domain) and a Presentation Model (i.e., patterns for describing the system-user interaction). Significant elements for measurement are identified using these models. In the measurement step, an abstract model of requirements (i.e. identifying, classifying and weighting transactional and data functions) is specified according to the OOmFP Measurement Abstract Model. This is done by applying the OOmFP measurement rules. Finally, the functional size is quantified (i.e. translating complexityrated and classified functions, and aggregating the values). As shown in the figure, the measurement step starts from the elements identified in an OO-Method conceptual schema. As a consequence, the functional size of the OO system is calculated in the problem space. For more information on the method usage and measurement rules refer to [16] and [17].

4. Experiment Design In this section, we describe the experiment we have carried out to empirically evaluate the proposed FSM method (OOmFP) in comparison with the IFPUG FPA. The experimental design was guided by the framework for experimental software engineering suggested by Wohlin et al. [34]. The goal of the experiment was to investigate which method (either OOmFP or IFPUG FPA) provides the best functional size assessment of object-oriented systems when measuring the user requirements. In other words, which method has the highest effectiveness. In terms of the Goal/Question/Metric (GQM) template for goal-oriented software measurement [35], the goal pursued in this experiment is: to analyze functional size measurements for the purpose of evaluating OOmFP and IFPUG FPA with respect to their reproducibility and accuracy from the point of view of the researcher. The context of the experiment is the OO-Method approach to the conceptual modeling and development of object-oriented systems as performed by students in the Department of Computer Science at the Valencia University of Technology. In order to measure effectiveness, we selected the following criteria described in Part 3 of the standard ISO/IEC for functional size measurement [21]: 2 OOmFP includes additional measurement rules for the other conceptual model views (Dynamic and Functional Models) that specify the behaviour of an OO system. These rules measure model components that contribute to other aspects of functional size (valid object lives, inter-object interaction, and semantics associated to change of state).





Reproducibility: Does OOmFP produce more consistent measurements of functional size when used by different people than IFPUG FPA? Reproducibility refers to the use of the method on the same product in the same environment by different subjects. The results obtained should be identical. Accuracy: Does OOmFP produce more accurate measurements of functional size from requirements specifications than IFPUG FPA?

4.1. Planning 4.1.1. Selection of subjects. The subjects that participated in this study were 22 students in the final year of Engineering in Computer Science with a major in Information Systems. Final year students were used as proxies for practitioners for the following reasons: • Accessibility: the possibility of getting practitioners is too difficult given the time and cost constraints. The costs and benefits of empirical studies using students are discussed in [36]. • Similarity: final year students were the closest to practitioners [37]. Students are the next generation of professionals and are, therefore, close to the population under study [38]. The students were between 22 and 24 years and had similar backgrounds (they attended the same courses on project development and management). The subjects were chosen for convenience, i.e., they were students enrolled in the Software Development Environments course from February to June of 2003. This course was selected because it was an advanced unit (where students learn advanced techniques about software development). Also, the necessary preparation and the task itself fitted well into the scope of this course. 4.1.2. Variables selection. The independent variable is a single variable, which is the method used by subjects to size an OO system. The independent variable has two levels, corresponding to the different FSM methods being compared: OOmFP [16] and IFPUG FPA [1]. The dependent variables measure how well subjects perform the experimental task. Two dependent variables are distinguished: • Reproducibility: the agreement between the measurement results of different subjects. • Accuracy: the agreement between the measurement results and the true value.

4.1.3. Experimental Design. The treatments correspond to the two levels of the independent variable: the use of OOmFP versus the use of IFPUG FPA to size an OO system. We chose a within-subject design experiment to control for differences in human ability. In a within-subject design, each subject contributes an observation for each treatment, which is an additional advantage given the small sample size. In order to cancel out possible learning effects due to similarities in the treatments (i.e. the relatedness of both FSM methods) and the use of the same requirements specification, the treatments were counterbalanced. The subjects were randomly assigned to two groups using the counterbalancing procedure, with equal numbers in each group. The tests were presented in a different order. • Experimental Group 1: IFPUG FPA first and OOmFP second (n=11) • Experimental Group 2: OOmFP first and IFPUG FPA second (n=11) 4.1.4. Instrumentation. The instrumentation used in this experiment includes the experimental object and training materials. The experimental object was a requirement specification document for building a new Project Management System (PMS) for a fictitious company. This document describes the requirements for the system using the IEEE 830 standard [39]. A software requirements specification is described in terms of functionality (what the software is supposed to do from the user's perspective), external interfaces (the interaction of the software with people, hardware and other software), performance (such as availability, response time, etc.), quality attributes (such as portability, maintainability, security, etc.), and design constraints imposed by implementation (required standards, implementation language, policies for database integrity, operating environments, etc.). The PMS should support the following transactions: project maintenance (create, change, delete, task assignment), types of task maintenance (create, change, delete), task maintenance (create, change, delete), user maintenance (create, change, delete, change password), and inquiries (users, projects and type of task). An example of a functional requirement is: when the company starts a new project, the employee in charge of the project must enter the data into the system. A project is created with the following data: identification number, description, name of employee in charge, start date, estimated duration, estimated final date, current duration (sum of the duration of the tasks performed until now), cost (sum of all tasks costs associated to the project), situation (0 = under development, 1 = small delay, = 10% more than the estimated time, 3 = finished), observations. The following training materials were prepared for all subjects: a set of instructional slides, describing each FSM method and the procedure for applying it; a working example showing how the methods could be applied; and a measurement guideline summarizing the measurement rules of the methods. The working example for the IFPUG FPA training session included a requirements specification document of a Banking System as well as the specification of an ER model and the screens of the system. The working example for OOmFP included the requirement specification of a Library System and a complete specification of an OO-Method conceptual model. 4.1.5. Experimental Tasks. Before the experiment took place, each subject was asked to specify an OO-Method conceptual model of the PMS system starting from the requirements specification document. This modeling is a preparatory task that was organized as a course exercise, but it was not part of the actual experiment. The IFPUG FPA treatment included a further modeling task and a measurement task. The modeling task involved the specification of an ER diagram and screen prototypes based on the requirements specification document. It reflects the implicit identification step in the IFPUG FPA measurement process.3 The subsequent measurement task involved applying the IFPUG FPA measurement rules. The OOmFP treatment only included a measurement task (i.e. applying the OOmFP measurement rules), as the OO-Method conceptual model of the PMS system was already developed before the experiment. 4.1.6. Hypothesis formulation. In order to evaluate whether the proposed method (OOmFP) is more effective than IFPUG FPA, we tested the following hypotheses: Hypothesis 1: This hypothesis tests the relationship between the independent variable and reproducibility. • Null hypothesis, H0a: OOmFP produces less consistent or equally consistent assessments than IFPUG FPA. • Alternative hypothesis, H1a: OOmFP produces more consistent assessments than IFPUG FPA. The effectiveness of a FSM method depends amongst others factors on the inter-rater reliability of the measurements [28]. We postulate that the closer the 3 Though subjects were supposed to start from the requirements specification document of the PMS system, their previous experience with the OO-Method conceptual modeling of this system might of course influence their performance of this step.

measurements obtained by different raters, the more effective the FSM method is. Hypothesis 2: This hypothesis tests the relationship between the independent variable and accuracy. • Null hypothesis, H0b: OOmFP produces less accurate or equally accurate assessments than IFPUG FPA. • Alternative hypothesis, H1b: OOmFP produces more accurate assessments than IFPUG FPA. Even when the obtained measurements are (nearly) identical, they can be far from the true value for functional size. Therefore, another dimension of effectiveness is related to the accuracy of the method in producing the ‘right’ value. This second hypothesis requires an objective standard against which to evaluate accuracy. Thus, we can only compare OOmFP and IFPUG FPA if there is some third (and supposedly ‘right’) way of assessing functional size. In order to do this, the system is sized with IFPUG FPA by a Certified Function Point Specialist (CFPS), and we take this count as our standard. The CFPS is a formal recognition of a level of expertise in the area of Function Point Analysis. A CFPS is acknowledged as having the skills necessary to perform consistent and accurate function point counts and comprehension of the most recent counting practices. In this experiment, in order to apply FPA, the CFPS was given the original requirements documentation. The CFPS was not familiar with the OO-Method, and hence no OO-Method conceptual model was provided. Once the function point count was obtained from the CFPS, we compared how close the measurements for the IFPUG FPA and the OOmFP were to it4.

4.2. Experiment Operation 4.2.1. Preparation. At the time the experiment was performed, the subjects were taking a course in conceptual modeling using OO-Method. We also ran separate training sessions on IFPUG FPA and OOmFP for each treatment group prior to getting them to do the experimental tasks. However, the subjects were not aware of what aspects we intended to study, nor were they informed of the stated hypothesis. The training sessions on IFPUG FPA and OOmFP were run before subjects applied a given method. Two sessions of two hours were needed for each method. In the first session, we explained the measurement rules of the method and demonstrated their application using some toy examples. In the second session, we demonstrated the 4 Although there is, of course, no certainty that the CFPS produced the ‘true’ value of functional size, the certification is our best guarantee for the confidence that we have in the functional size measurement that was obtained.

application of the measurement rules on a complete case study. 4.2.2. Execution. The subjects received all the materials as described above. We also explained how to do the experimental tasks. The experimental tasks were run online as part of a course [34]. We gave them a maximum of about five hours (two sessions of two hours and thirteen minutes) to perform the experimental tasks. The sequence of the training sessions and experimental tasks can be seen in Table 1. Each subject had to do each test alone. They were allowed to use the training materials when performing the tasks, but we made sure that no interaction whatsoever between subjects occurred. Table 1. Summary of Training Sessions and Experimental Tasks

Training Sessions and Experimental Tasks 1 2 3 4

Experimental Group 1 Training session in IFPUG FPA Sizing with IFPUG FPA - Modeling Task - Measurement Task Training session in OOmFP Sizing with OOmFP - Measurement Task

Experimental Group 2 Training session in OOmFP Sizing with OOmFP - Measurement Task Training session in IFPUG FPA Sizing with IFPUG FPA - Modeling Task - Measurement Task

4.2.3. Data Recording and Validation. The dependent variables were measured using three different data collection forms. The first and second forms were used to record the outputs of the IFPUG FPA and OOmFP functional size measurement, respectively. The third form was used to record the outputs of the CFPS specialist. Once the data were collected, we verified whether the tests were complete. As two students used only OOmFP, we took into account the responses of twenty subjects.

5. Analysis and Interpretation In general, the results are described in terms of their statistical significance, measured by alpha (α), which represents the probability that the result could have occurred by chance due to a Type I error. In this study, we define the following levels of significance: • Not significant: α > 0.1 • Low significance: α < 0.1 • Medium Significance: α < 0.05 • High significance: α < 0.01 • Very high significance: α < 0.001

We first tested hypothesis H1a related to the effectiveness of the FSM methods, in terms of reproducibility. Table 2 shows descriptive statistics for the functional size produced (presented in Appendix A) for the IFPUG FPA and OOmFP methods. Note that the values of the column “Size in IFPUG-FPA” are unadjusted function point values. Table 2. Descriptive statistics for functional size Reproducibility Number of observations Minimum Maximum Mean Standard deviation Percentiles (25th, 50th and 75th)

IFPUG FPA 20 114 174 132.40 17.443 121 126; 144.50

OOmFP 20 148 173 159.55 7.96 151.75; 159, 164.50

In order to measure the degree of variation between assessments produced by different subjects using the same method (reproducibility), we used a practical statistic similar to that proposed by Kemerer [28]. This statistic is calculated as the difference in absolute value between the count produced by a subject and the average count (for the same FSM method) produced by the other subjects in the sample, relative to this average count. Reproducibility measurements (REP) were thus obtained for each observation by applying the following equation: REPi =

Average Other assessments − Subject Assessmenti Average Other Assessments

The obtained REPi for both methods are presented in Appendix A. Then, the differences in reproducibility assessments obtained using both methods were described using the Kolmogrov-Smirnov test to ascertain whether the distribution was normal. The skewness of the distribution is close to one (1.230), meaning there are more observations in the left tail than normal. Because of this indication of a non-normal distribution, we used the Wilcoxon signed rank test for the difference in median reproducibility assessments, which is a non-parametric alternative to the paired samples t-test. Table 3. 1-tailed Wilcoxon signed rank test for differences in median reproducibility assessments (IFPUG FPA versus OOmFP; α = 0.05)

Reproducibility Mean Rank Sum of Ranks z 1-tailed p-value

5.25 10.50 -3.409 .000

The result of the one-tailed test (see Table 3) allows us to reject the null hypothesis (H0a), meaning that we can empirically corroborate that OOmFP produces more consistent assessments than IFPUG FPA. Next, we tested hypothesis H1b related to the effectiveness of the FSM methods, in terms of accuracy. The value obtained by the CFPS was 153 unadjusted function points. In order to compare the assessments produced by both methods and a Certified Function Point Specialist (CFPS), the Magnitude of Relative Error (MRE) was used [40], [41], [42]. Accuracy measurements were obtained by applying the following equation for each observation: MREi =

CFPS assessment − Subject assessmenti CFPS assessment

The obtained MREi for both methods are presented in Appendix A. Again, the differences in accuracy assessments obtained were then described using the Kolmogrov-Smirnov test to ascertain whether the distribution was normal. As the distribution was nonnormal, we used the Wilcoxon signed rank test for the difference in median accuracy assessments. In order to evaluate the significance of the observed difference, we applied a statistical test with a significance level of 5 %. The result of test (see Table 4) allows us to reject the null hypothesis (H0b), meaning that we can empirically corroborate that OOmFP produces more accurate assessments than IFPUG FPA. Table 4. 1-tailed Wilcoxon signed rank test for differences in median accuracy assessments (IFPUG FPA versus OOmFP; α = 0.05)

Accuracy Mean Rank Sum of Ranks z 1-tailed p-value

4.75 9.50 -3.442 .000

This analysis reveals that the statistical significance was found to be very high (α < 0,001) for H3 and H4, and high (α < 0,01) for H5. However, in order to draw valid conclusions about underlying theoretical constructs of the MAM, these constructs must be validated.

5.1. Validity Evaluation In this section, we discuss several issues that can affect the validity of the empirical study and how we attempted to alleviate them.



5.1.1. Threats to conclusion validity. To control the risk that the variation due to individual differences is larger than due to the treatment, we selected a homogeneous group of subjects.



5.1.2. Threats to Construct Validity. The dependent variables that we used are criteria proposed in the ISO/IEC 14143-3 [21]. 5.1.3. Threats to Internal Validity. The following issues were considered: • Differences among subjects. Error variance due to differences among subjects is reduced by using a within-subjects design. We also used the counterbalancing procedure where subjects were randomly assigned to two groups. This procedure canceled out both a possible learning effect due to similarities in the treatments and a confounding effect that might introduce the order of learning/applying the methods. • Knowledge of the universe of discourse. We used the same requirement specification document for all subjects. It specified the requirements of a Project Management System for a fictitious company. This is a known universe of discourse. • Fatigue effects. On average, each subject took three hours per session to solve the experimental tests. We ran separate sections of each treatment group on the same day. The section was run in parallel. So, fatigue was not very relevant. • Persistence effects. In order to avoid persistence effects, the experiment was carried out by subjects who had never done a similar experiment. • Subject motivation. We motivated students by explaining to them that functional size measurement is a topic widely used in practice by companies for developing project estimates and having early project indicators. 5.1.4. Threats to External Validity. The greater the external validity, the more the results of an empirical study can be generalized. We identified three threats have that limit the ability to apply any such generalization: • Materials and tasks used. We tried to use the representative requirement specification of a real case in the MIS functional domain5. However, more empirical studies are needed, using others user requirement specifications within this functional domain. 5

Defined in the ISO/IEC 14143-1 as a class of software based on the characteristics of Functional User Requirements.

Subjects. We are aware that more experiments with practitioners must be carried out in order to be able to generalize these results. OO-Method approach. We used a functional size measurement method for object-oriented systems that follows the OO-Method concepts. Thus, in order to generalize the results obtained in the context of OO-Method to other OO systems, we need include the evaluation of convertibility criteria defined in [21] in future experiments.

6. Conclusions and Further Work This paper described a controlled experiment which compares IFPUG FPA and OOmFP. The goal was to determine whether OOmFP provides significantly better size assessments within the context of an OO-Method development process. We have corroborated that within its context, OOmFP produces more consistent and accurate assessments than IFPUG FPA. Both accuracy and reproducibility assessments were found to be highly significant (at the 5% significance level). Therefore, the empirical evidence gathered through this experiment demonstrates the effectiveness of OOmFP for sizing OO systems from OO-Method conceptual schemas. Despite the promising results obtained in this experiment, we are aware that more experimentation is needed in order to confirm them. Several threats to the validity of this study have been identified in the paper. One of which is the choice of experimental objects (limited to a unique requirements specification). Further work includes the replication of the experiment using practitioners and the measurement of other criteria such as the convertibility of FSM methods.

References [1] [2] [3]

[4] [5]

IFPUG, “Function Point Counting Practices Manual, Release 4.1”, International Function Point Users Group, Westerville, Ohio, USA 1999. A. J. Albrecht, “Measuring application development productivity”, IBM Application Development Symposium, 1979, pp. 83-92. G. Antoniol and F. Calzolari, “Adapting Function Points to Object Oriented Information Systems”, 10th Conference on Advanced Information Systems Engineering (CAiSE'98), 1998, pp. 59-76 ASMA, “Sizing in Object-Oriented Environments”, Australian Software Metrics Association (ASMA), Victoria, Australia 1994. T. Fetcke, A. Abran, and T. H. Nguyen, “Function point analysis for the OO-Jacobson method: a mapping approach”, FESMA’98, Antwerp, Belgium, 1998, pp. 395-410.

[6] [7] [8]

[9] [10] [11]

[12]

[13] [14]

[15] [16]

[17]

[18]

[19] [20]

[21]

R. Gupta and S. K. Gupta, “Object Point Analysis”, IFPUG 1996 Fall Conference, Dallas, Texas, USA, 1996. IFPUG, “Function Point Counting Practices: Case Study 3 - Object-Oriented Analysis, Object Oriented Design (Draft)”, 1995. J. Kammelar, “A Sizing Approach for OOenvironments”, 4th International ECOOP Workshop on Quantitative Approaches in Object-Oriented Software Engineering, Cannes, 2000. L. Laranjeira, “Software Size Estimation of ObjectOriented Systems”, IEEE Transactions on Software Engineering, vol. 16, 1990, pp. 510-522. D. J. Ram and S. V. Raju, “Object Oriented Design Function Points”, First Asia-Pacific Conference on Quality Software, 2000. H. M. Sneed, “Estimating the Development Costs of Object-Oriented Software”, 7th European Software Control and Metrics Conference, Wilmslow, UK, 1996, pp. 135-152. T. Uemura, S. Kusumoto, and K. Inoue, “Function Point Measurement Tool for UML Design Specification”, 5th International Software Metrics Symposium (METRICS’99), Florida, USA, 1999, pp. 62-69. S. A. Whitmire, “3D Function Points: Specific and Real-Time Extensions of Function Points”, Pacific Northwest Software Quality Conferene, 1992. S. A. Whitmire, “Applying Function Points to ObjectOriented Software Models”, Software Engineering Productivity Handbook, J. Keyes, Ed., McGraw-Hill, 1992, pp. 229-244. H. Zhao and T. Stockman, “Software Sizing for OO Software Development - Object Function Point Analysis”, GSE Conference, Berlin, Germany, 1995. S. Abrahão and O. Pastor, “Estimating the Applications Functional Size from Object-Oriented Conceptual Models”, International Function Point Users Group Annual Conference (IFPUG’01), Las Vegas, USA, 2001. O. Pastor, S. M. Abrahão, J. C. Molina, and I. Torres, “A FPA-like Measure for Object-Oriented Systems from Conceptual Models”, 11th International Workshop on Software Measurement (IWSM'01), Montréal, Canada, 2001, pp. 51-69. O. Pastor, J. Gómez, E. Insfrán, and V. Pelechano, “The OO-Method Approach for Information Systems Modelling: From Object-Oriented Conceptual Modeling to Automated Programming”, Information Systems, vol. 26, 2001, pp. 507-534. ISO, “ISO/IEC 14143-1- Information Technology Software Measurement Functional Size Measurement. Part 1: Definition of Concepts”, 1998. ISO, “ISO/IEC 14143-2 - Information Technology Software Measurement Functional Size Measurement. Part 2: Conformity evaluation of software size measurement methods to ISO/IEC 14143-1:1998”, 2002. ISO, “ISO/IEC 14143-3 - Information technology -Software measurement -- Functional size measurement

[22] [23] [24]

[25]

[26] [27]

[28] [29]

[30]

[31] [32]

[33] [34]

[35]

[36]

[37]

-- Part 3: Verification of functional size measurement methods”, 2003. W. F. Tichy, “Should Computer Scientists Experiment More?”, IEEE Computer, vol. 38, 1998, pp. 32-40. M. V. Zelkowitz and D. R. Wallace, “Experimental Models for Validating Technology”, IEEE Computer, vol. 31, 1998, pp. 23-31. J. P. Jacquet and A. Abran, “>From Software Metrics to Software Measurement Methods: A Process Model”, 3rd Int. Standard Symposium and Forum on Software Engineering Standards (ISESS'97), Walnut Creek, USA, 1997. S. G. MacDonell, “Comparative review of functional complexity assessment methods for effort estimation”, Software Engineering Journal, vol. 9, 1994, pp. 107116. S. Moser, B. Henderson-Sellers, and V. B. Misic, “Cost Estimation Based on Business Models”, Journal of Systems and Software, vol. 49, 1999, pp. 33-42. G. C. Low and D. R. Jeffery, “Function Points in the estimation and evaluation of the software process” IEEE Transactions on Software Engineering, vol. 16, 1990, pp. 64-71. C. F. Kemerer, “Reliability of Function Points Measurement”, Communications of the ACM, vol. 36, 1993, pp. 85-97. C. F. Kemerer and B. S. Porter, “Improving the Reliability of Function Point Measurement: An Empirical Study”, IEEE Transactions on Software Engineering, vol. 18, 1992, pp. 1011-1024. P. Morris and J. M. Desharnais, “Post Measurement Validation Procedure for Function Point Counts”, Forum on Software Engineering Standards Issues (SES'06), Montréal, Canada, 1996. E. Rains, “Function Points in an ADA ObjectOriented Design”, OOPS Messenger, vol. 2, 1991, pp. 23-25. T. Fetcke, A. Abran, and R. Dumke, “A Generalized Representation for Selected Functional Size Measurement Methods”, 11th International Workshop on Software Measurement, Montréal, Canada, 2001, pp. 1-25. G. Poels, “Why Function Points Do Not Work: In Search of New Software Measurement Strategies”, Guide Share Europe Journal, vol. 1, 1996, pp. 9-26. C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering: An Introduction, Kluwer Academic Publishers 2000. V. R. Basili and H. D. Rombach, “The TAME Project: Towards Improvement-Oriented Software Environments”, IEEE Transactions on Software Engineering, vol. 14, 1998, pp. 758-773. J. Carver, L. Jaccheri, S. Morasca, and F. Shull, “Issues in Using Students in Empirical Studies in Software Engineering Education” 9th International Software Metrics Symposium (Metrics'03), Sydney, Australia, 2003, pp. 239-249. M. Höst, B. Regnell, and C. Wohlin, “Using Students as Subjects - A Comparative Study of Students and Professionals in Lead-Time Impact Assessment”,

[38]

[39]

[40]

Empirical Software Engineering, vol. 5, 2000, pp. 201-214. B. A. Kitchenham, S. Pfleeger, et al., “Preliminary Guidelines for Empirical Research in Software Engineering”, IEEE Transactions on Software Engineering, vol. 28, 2002, pp. 721-734. ANSI/IEEE, “Standard 830-1998, IEEE Recommended Practice for Software Requirements Specifications”, The Institute of Electrical and Electronics Engineers, Ed., New York, NY, IEEE Computer Society Press, 1998. S. D. Conte, H. E. Dunsmore, and V. Y. Shen, Software engineering metrics and models, The

[41]

[42]

Benjamin/Cummings Publishing Company, Inc., 1986. L. C. Briand, K. El Emam, D. Surmann, I. Wieczorek, and K. D. Maxwell, “An assessment and comparison of common software cost estimation modeling techniques”, 21st International Conference on Software Engineering, 1999, pp. 313 - 322. R. K. Smith, J. E. Hale, and A. S. Parrish, “An Empirical Study Using Task Assignment Patterns to Improve the Accuracy of Software Effort Estimation”, IEEE Transactions on Software Engineering, vol. 27, 2001, pp. 264-271.

Appendix A Table 5. Functional size assessments, reproducibility and accuracy values Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Size in OOmFP 150 158 172 155 151 148 158 148 160 165 151 170 163 171 163 158 173 160 163 154

Size in IFPUG-FPA 129 125 115 124 143 121 114 118 145 159 121 122 165 174 132 114 147 128 125 127

REPi in OOmFP 0.06 0.01 0.08 0.03 0.06 0.08 0.01 0.08 0.00 0.04 0.06 0.07 0.02 0.08 0.02 0.01 0.09 0.00 0.02 0.04

REPi in IFPUG-FPA 0.03 0.06 0.14 0.07 0.08 0.09 0.15 0.11 0.10 0.21 0.09 0.08 0.26 0.34 0.00 0.15 0.12 0.03 0.06 0.04

MREi in OOmFP 0.02 0.03 0.12 0.01 0.01 0.03 0.03 0.03 0.05 0.08 0.01 0.11 0.07 0.12 0.07 0.03 0.13 0.05 0.07 0.01

MREi in IFPUG-FPA 0.16 0.18 0.25 0.19 0.07 0.21 0.25 0.23 0.05 0.04 0.21 0.20 0.08 0.14 0.14 0.25 0.04 0.16 0.18 0.17