Assessing the Cost-Effectiveness of Inspections by ... - CiteSeerX

4 downloads 2761 Views 287KB Size Report
fierce resistance as they are introduced. For example, Jalote and Haragopal ...... under study is representative of what telecom software development is today. 2.
Assessing the Cost-Effectiveness of Inspections by Combining Project Data and Expert Opinion Lionel C. Briand Bernd Freimut Ferdinand Vollei Carleton University Fraunhofer Institute for Siemens AG Sys. and Comp. Engineering Experimental Software Engineering Corporate Technology 1125 Colonel By Drive, ME4462 Sauerwiesen 6 Otto-Hahn-Ring 6 Ottawa, Canada, K1S 5B6 D-67661 Kaiserlautern, Germany D-81730 Munich, Germany +1 (613) 520 2600 +49 (6301) 707 253 +49 (89) 636 49313 [email protected] [email protected] [email protected] ABSTRACT There is a general agreement among software engineering practitioners that software inspections are an important technique to achieve high software quality at a reasonable cost. However, there are many ways to perform such inspections and many factors that affect their costeffectiveness. It is therefore important to be able to estimate this cost-effectiveness in order to monitor it, improve it, and convince developers and management that the technology and related investments are worthwhile. This work proposes a rigorous but practical way to do so. In particular, a meaningful model to measure costeffectiveness is selected and a method to determine the cost-effectiveness by combining project data and expert opinion is proposed. To demonstrate the feasibility of the proposed approach, the results of a large-scale industrial case study are presented. Keywords Software inspection, cost-effectiveness, expert knowledge elicitation, Monte-Carlo simulation 1 INTRODUCTION It has been long recognized that the return on investment of software development technologies need to be assessed [12]. This entails the development of models that take into account both the costs of using a technology and its benefits across the whole development life-cycle in a realistic manner. Such models must not only represent accurately the economic impact of a technology but must also be usable at reasonable cost. Models based on parameters that cannot be estimated without resorting to

leave empty

prohibitive and complex data collection are not useful, from a practical standpoint. The systematic, rigorous, and documented inspection of life cycle artifacts, before any testing can be performed, has long been perceived as a key software development technology to achieve high quality at a reasonable cost. Despite clear success stories [1], inspections do not bring substantial benefits under all circumstances. It is clear that, like for any technology, mostly success stories are selectively reported in the literature. Therefore, the costeffectiveness of inspections needs to be demonstrated in each and every organization that tries to introduce them. The problem is even more acute when considering that systematic and rigorous inspections usually encounter fierce resistance as they are introduced. For example, Jalote and Haragopal [10] report about the “not applicable here” syndrome and how, through experiments, they convinced developers to use inspections. They are usually supported by management but are sometimes perceived as burdensome by developers. In most organizations, productivity is still perceived and measured in terms of lines of code per effort. However, inspections require effort without producing any line of code. Consequently, inspections are sometimes not perceived as bringing tangible benefits. Studies [13] have shown that, despite having demonstrated their potential benefits for more than two decades, inspections are still not widely used across the software industry. Therefore, the case of systematic and rigorous inspections needs to be made in every organization that intends to use them. Furthermore, their benefits have to be observed and communicated. The benefits of inspections consists of qualitative aspects, such as improving learning and communication, as well as quantitative aspects, such saving effort. To capture the latter, it is important to assess the costeffectiveness of inspections if we want to be able to monitor the factors that affect it, its variations across inspections and projects, and then improve it. This implies a new requirement: cost-effectiveness models must allow

(classed into major and minor defects) and the estimated rework effort.

comparisons across inspections in a meaningful way. The contribution of this paper is twofold. First, we present a practical methodology for assessing inspections’ costeffectiveness that combines project data and expert estimates within a rigorous framework, using well-defined procedures. It is important to point out that no experiments are necessary as only project data that can realistically be collected are required. There is, therefore, no additional cost incurred besides common project data collection and interviews. Interviews only have to take place once and the cost-effectiveness of all inspections taking place in the organization can be estimated. This methodology is used here for inspections but could be adapted to any situation where a new technology needs to be assessed and complete data collection is not possible. Second, we present the results of a comprehensive case study where we performed a cost-benefit analysis of the inspection practices of a large Siemens AG division working on mobile communication. We then analyze the plausibility of the results and report lessons learned. This paper is organized as follows. Section 2 gives the context of the industrial case study. Section 3 introduces the cost-effectiveness model. Section 4 discusses the usage of expert opinion. Section 5 presents a method for assessing the cost-effectiveness by combining project data and expert opinion. Section 6 presents the cost-effectiveness results from the case study. Section 7 summarizes the lessons learned with respect to expert knowledge elicitation in the case study. Sections 8 and 9 discuss future work directions and the main conclusions of our work, respectively.

3 MODELING EFFICIENCY In order to achieve the goals stated in the introduction, we need a cost-effectiveness model that fulfils the following requirements: • Cost-effectiveness can be compared across inspections, regardless of the number of defects present in the inspected artifact and the subsequent defect detection activities taking place • Cost-effectiveness must be stated in terms of effort savings, in order to make it relevant to quality assurance engineers and managers To fulfill these requirements we selected the Kusumoto model [14]. As thoroughly discussed in [3] and [14], this model captures the economical impact of inspections and enables the meaningful comparison of inspections. However, we re-express and generalize the model to multiple inspection activities in order to relax some of its assumptions. Generally, the model defines costeffectiveness (CE) as savings_fr om_inspect ions − cost_of_inspections CE = potential_ defect_cos t_without_ inspections Thus, it is intuitive as it calculates the net savings of inspections and compares these net-savings to the potential defect cost saved. The potential defect cost is the cost that would have been incurred if the detection activity to be assessed had not taken place. By this comparison, costeffectiveness can be compared across inspections as it is independent of the particular defect population present in the inspected artifact and the subsequent defect detection activities, i.e., it is normalized to remove these effects [14]. Table 1 shows the parameters of our generalized Kusumoto model. In the remainder of the paper we will refer to this model as to the CE model (for Cost-Effectiveness).

2 THE CASE STUDY CONTEXT The methodology we present in this paper was developed specifically for a development organization and validated in a carefully designed case study. This work took place in a business unit of Siemens AG, Germany, that is developing products and services for mobile communication and intelligent networks. In this particular business unit, inspections are performed throughout the entire life cycle to ensure the quality of all software artifacts. Thus, inspections are performed after each of the development phases: analysis, design, and coding. Due to the substantial investment in software quality through systematic inspections, the quality assurance team’s objective is to continuously improve their effectiveness and efficiency. To achieve this objective, a practical and rigorous way to measure the economic impact of inspections was proposed by the authors to the quality assurance team. In this environment where software quality is a major objective, most of the data recommended in [4], [5] is collected during inspections. This encompasses information like the number of participants, the size of the work product, the inspection effort, the number of defects found

Parameter

Description

A

Ordered set or sequence of defect detection (inspection or testing) activities: [a1, …, an] average defect cost in activity ai . For inspections these costs include the correction of defects. For tests theses costs include the isolation of faults, correction of faults, and re-test average inspection costs per defect in activity af

dε i

iε i ni Ni pi=ni/Ni. g i ,i +1

number of defects found in activity ai total number of defects present in activity ai effectiveness of defect detection activity ai a defect found in activity i would on average result in g i ,i +1 defects in the following defect detection activity (the original Kusumoto assumed this value to be equal to 1 for all i)

2

Parameter

Description

ri,j

assuming we are assessing the cost-effectiveness of inspection activity ai, this is the remaining proportion of defects detected in ai that would reach activity aj had ai not taken place

responses. Thus, expert judgment can be defined as data gathered formally, in a structured manner, and in accordance with research on human cognition and communication [15]. We will show in Section 5 how we have addressed those issues for the CE model presented above. To summarize, we believe that expert judgment is valid data and is comparable to other data. Expert judgment has been used in similar ways and with success in other fields such as nuclear engineering (risk models) [9] and policy decision making [19], and in software engineering for cost estimation purposes [2],[6],[7].

Table 1 CE Model Parameters Using these parameters, the CE model can be expressed as follows: n

CE (af , A) =

∑ dε

i

× nf × rf ,i × pi − ((d ε f + i ε f ) × nf )

i =f +1

n

∑ dε

j

5 A COST-EFFECTIVENESS ASSESSMENT PROCEDURE

× Nf × rf , j × p j

j = f +1

The purpose of this section is to provide an overview of the procedure we devised and followed in our case study to assess cost-effectiveness.

with

1 j = f +1   j −1 rf , j =  (1 − p k ) × g k ,k +1 else  k =f +1 However, expressed in the form above, the model requires data that are difficult to obtain, e.g., the propagation factor gi,i+1, and still makes assumptions that need to be relaxed. We therefore reformulate it so that it is made more operational and is expressed in terms of parameters that can either be easily obtained through data collection during project performance or through the elicitation of expert opinion. The use of expert opinion is better justified and further discussed in the next section.



Step 1: Make the CE Model Operational in Context The objective of this step is to instantiate the CE model for the projects in the case study and assess cost-effectiveness. When instantiating the CE model for a particular project, its underlying assumptions have to be assessed and relaxed if necessary. A minimal set of assumptions is, however, necessary to obtain a cost-effectiveness model that can be operationalized under realistic constraints. These assumptions are for the CE model 1. The effectiveness of the last defect detection activity equals 1, i.e., all defects are detected 2. A defect found in activity ai would on average result in g i ,i +1 defects in the following defect detection activity

4 USING EXPERT-OPINION There are many reasons why expert opinion may be needed. One of these reasons, that we face here, is when information regarding a phenomenon cannot be collected by any other affordable means (measurements, observations, experimentation). The question is now, especially when looking at the problem from a scientific perspective, whether expert data are valid data. First of all, data refers to information measured on welldefined measurement scales and can be nominal, ordinal, interval, or ratio. Thus, data refers to information either in quantitative or qualitative form. By this definition, expert judgment is data [15]. The next question is, whether expert data are valid data. It might be argued that expert data are soft data in the sense that they incorporate the assumptions and interpretation of the experts [15]. Specifically, expert data are subject to bias, uncertainty, and incompleteness. However, these problems can be prevented and controlled. Generally, the aim of expert knowledge elicitation techniques is to prevent the problems to the maximum extent possible, detect bias when it occurs, and model uncertainty. Elicitation techniques achieve this by carefully selecting experts, designing the mode of data collection and the interview procedure, and quantifying uncertainty in experts’

ai+1 3. Defects introduced after the defect detection activity to be assessed must not be considered in the CE model for that activity The practical implication of the first assumption is that all defects will eventually be found by the last defect detection activity. In reality this activity should be the operation of the software. But in order to determine the model parameters after completion of the project and to provide timely feedback on the cost-effectiveness of a project’s inspections, the last activity is assumed to be the last testing activity, e.g., acceptance testing. In our case study, the first assumption is an acceptable approximation as the products delivered undergo substantial quality assurance and are of high quality. The second assumption means that only defects causing a fault or failure in subsequent activities are considered. In the context of the case study, defects found in inspections are classified as major and minor defects. Major defects are those that would result in a defect in subsequent activities and other defects are denoted as minor defects. Therefore, in our context, only major defects should be used to determine the model’s parameters. 3

for analysis and design defects combined).

To fulfill the third assumption, two practical implications have to be taken into account when instantiating the model for a particular sequence of defect detection activities. First, for each inspection activity, a model of its own has to be instantiated. In our case study, since inspections take place at the end of analysis, design, and coding, we actually need to develop three models. Second, the different origins of defects (i.e., analysis, design, code) have to be differentiated in order to determine the model parameters for each detection activity. For example, when assessing the cost-effectiveness of design inspections, only analysis and design defects should be used to estimate the parameters. Thus, instead of determining the overall number of defects detected in a particular defect detection activity, the number of analysis or design defects must be determined. Similarly, instead of determining the average defect cost in a defect detection activity, only the average defect cost for analysis and design defects has to be determined. In general, when instantiating the CE model for analysis, design, and code inspections, its parameters have to be determined for each defect origin separately. Therefore it is very likely that some of these parameters required by the model are not collected in most environments. In this case the model has to be reformulated so that it consists of parameters that can either be derived from existing measurement data or that can be obtained from expert opinion. A key concept in this step is the decomposition [16] of the model. This concept stresses the importance that the expert-based parameters are in a form allowing the expert to concentrate on estimating something that is tangible and easy to envisage. Parameters that are usable for expert elicitation may represent physical quantities, counts, proportions, but also probabilities. Additionally, based on our understanding of the development process in place, we have to determine who in the development team is a good expert to estimate each of the parameters, i.e., who has the required background and experience. This step of the assessment procedure yields 1) reformulated instantiations of the CE models for each type of inspection, 2) a list of parameters to be obtained by expert knowledge elicitation, along with the roles and experience that qualify people as experts, and 3) a list of parameters to be obtained from project data.. Case Study Example

CE (DI, A) =

∑ dε

i , AD i ∈{CI ,UT ,ST }

× nDI ,AD ×rDI , j ×



ni , AD − ((d ε DI + i ε DI ) × nDI , AD ) Ni , AD

d ε j , AD × NDI , AD × rDI , j ×

i ∈{CI ,UT ,ST }

ni , AD Ni , AD

with rf , j

1   j −1 n = k , AD (1 − ) × g k ,k +1  N k ,AD k =f +1



j = f +1 else

In our case study, like in most cases, the parameters of this model for design inspections (and similarly the parameters for analysis and code inspections) could not all be derived from the existing project data nor could such data be collected in the near future. The model requires ni,AD , the number of analysis and design defects detected in ai. In most activities, however, only the total number of defects was collected. Thus, the experts had to be asked for the percentage breakdown of the defect origins in each activity ai. With this information it was then possible to determine ni,AD. Similarly, the average defect effort spent on analysis and design defects dεi,AD is required. For inspections, only an overall estimate for the corrections of all defects is available. This estimate contains both the effort for major and minor defects and, moreover, does not distinguish between the different defect origins. Thus, the experts had to be asked about the “typical” defect cost for a (major) defect of a particular origin detected in inspection activity ai. (we will see in a later section how such information can be elicited and formalized.) The correction effort for each defect (along with the defect origin) was only available during unit test. However, this effort data did not consider the effort for isolating faults after a failure was detected and did not consider the time to re-test a correction. For system test none of the required effort data were available. Thus, the experts had to be asked about this as well. Moreover, the model requires an average effort. But it might be difficult for experts (developers, testers) to estimate average effort values. It should be easier to estimate the effort for correcting a single defect. The experts can assign a meaning to this effort, since they experience this effort every time they correct a defect. Therefore, they were asked about the effort distributions for correcting a single defect. With these transformations in mind, the parameters for the design CE model were obtained following the procedure shown in the tables below.

Below we present an operationalization of the CE model for assessing design inspections in the context of our case study. If we take for example design inspections, the sequence of detection activities in our case study consists of design inspections (DI), code inspections (CI), unit test (UT) and system test (ST). Thus A=[DI,CI,UT,ST]. The parameters in Table 1 have been added a subscript indicating the activity they refer to (e.g., DI, for design inspection) or the defect origin they account for (e.g., AD

Table 2 shows the parameters required by the model and indicates how they can be estimated.

4

Parameter

Description

Data Source

d ε i ,AD

average defect cost for defects originating in analysis or design and detected in activity ai average inspection costs in activity ai number of defects originating in analysis or design and found in activity ai propagation factor: a defect found in activity ai would on average result in g i ,i +1

expert opinion

iε i ni,AD

g i ,i +1

Based on the information identified in Step 1, questionnaires have to be developed that capture the information to be obtained from the experts through wellformed questions [18]. Along with the questionnaire an interview procedure is to be defined to guide the interviewer during the course of interviews. This aims at making the interviews more systematic and consistent. Following the discussion in Section 4, the aim of expert knowledge elicitation techniques is to reduce the impact of bias to the maximum extent possible and to take into account the uncertainty in the experts’ answers. Moreover, in a literature survey on the use of expert opinion in risk assessment, Mosleh et al. [16] conclude that the methods by which expert opinion are elicited can have a significant impact on the accuracy of the resulting estimates. Thus, designing the questionnaire and interview procedure carefully and selecting appropriate response modes (i.e. the format in which the experts have to encode their answers) is of crucial importance. In any elicitation procedure many specific issues have to be addressed. In this paper we concentrate on two particular important issues: uncertainty and bias. Uncertainty

project data ni available as project data. To compute ni,AD expert opinion is required expert opinion

defects in activity ai+1

Table 2 Parameters of the CE model for design inspections To determine the expert-based parameters listed in Table 2, the information shown in Table 3 is elicited from the experts. Parameter dεi,, e pi, gi,i+1, e

Description defect cost for one defect of in activity ai as estimated by expert e the percentage of defects of in activity ai propagation factor for one defect from ai to ai+1 as estimated by expert e

There were two major kinds of questions that were asked to the experts in this case study. One set of questions concerned the effort for correcting defects of various origins in all defect detection activities. The other set of questions addressed the percentage breakdown of the various defect origins for all defect detection activities. It is impossible to ask the expert for a single value for these questions. One reason is that subjective estimates are inherently uncertain. This uncertainty stems from the expert’s lack of knowledge on the exact value for a parameter. Providing an exact answer might be impossible since experts may not know the exact value or since the parameter value may actually vary with circumstances. For example, it is obvious that the cost of defect correction does not warrant a unique value but a probability distribution. To capture this uncertainty a probability distribution can be selected as response mode. With such a distribution the experts are able to quantify their uncertainty. The probability distribution most often used in expert opinion elicitation is a triangular distribution1 as shown in Figure 1. Thus, the expert is asked to provide a range, given by minimum and maximum values, in which the estimate can be and the most likely value. This is shown in the left side of Figure 1.

Table 3 Information to be elicited from expert opinion With this information, the parameters can be obtained as shown in Table 4: Parameter ni,

d ε i ,A

d ε i ,D

d ε i ,AD

g i ,i +1

Description

Transformation

number of defects originating in and found in activity ai average defect cost for defects that originated in analysis and detected in activity ai average defect cost for defects that originated in design and detected in activity ai average defect cost for defects originating in analysis or design and detected in activity ai propagation factor: a defect found in activity ai would on average result in g i ,i +1

ni,=ni * pi,

avg.(dεi,A , 1,..., dεi,A, E)

avg.(dεi,D , 1,..., dεi,D, E)

(ni,A* dεi,,A +ni,D* dεi,D) / (ni,A+ni,D) avg.(gi,i+1, 1,..., gi,i+1, E)

defects in activity ai+1

Table 4 Computation of parameters from expert data Step 2: Preparation for Expert Knowledge Elicitation

1

The objective of this step is to prepare the means of the elicitation. These means consist of 1) questionnaires that are used to capture the experts’ estimates, 2) an interview procedure that guides the interviewer in performing the elicitation, and 3) the selection of experts.

In the only reported software engineering experience comparing the use of different distributions for expert estimation, Höst and Wohlin [6], [7] have reported that triangular distributions worked well for capturing expert estimates.

5

The use of a non-parametric distribution, such as the triangular distribution, is often recommended over parametric distributions in the domain of expert elicitation as their parameters have an intuitive appeal to the experts and are easy to respond to [20]. Moreover, the triangular distribution is often considered appropriate when little is known about the actual distribution of a variable [20]. In the case study, this response mode was used to ask for the percentage of defects of a particular origin in a defect detection activity. For the estimation of effort data, however, the maximum or minimum values can be misleading. This is due to the fact that outliers (i.e., extreme low and high values that occur very seldom) can get excessive weight when using a triangular distribution [20]. Therefore, we asked for practical maximum and minimum values. These are values that have some reasonable chance of occurring. To the experts the practical extreme values were also explained that the range between these values should contain 90% of all possible effort values. This is shown in the right side of Figure 1. To help experts visualize these response modes during the elicitation, slides similar to Figure 1 should be shown to them. Thus, they have a clear understanding of the information (e.g., maximum, minimum, most-likely values) they are to provide.

tasks [15]. The expert must comprehend the wording and the context of the question. Then s/he must remember relevant information to answer each question. By processing this information the expert identifies an answer, which is said to be “internal” as it is in the expert’s own representation mode. At this stage, people typically use mental shortcuts called heuristics to help integrate and process the information [11]. Finally, the internal answer has to be translated into the response mode requested by the interviewer. In each of these steps, especially in applying the heuristics, systematic errors can occur, which would distort the estimate [8]. To obtain reliable data it is therefore necessary to anticipate these biases and design and monitor the elicitation accordingly. This involves the following two activities in the design of the elicitation: 1. Anticipate which biases are likely to occur in the planned elicitation. For this step, the available research literature on expert opinion elicitation is used to determine which sorts of bias can occur under which circumstances [15]. A list of potential biases identified as relevant for our case study is listed in Table 5. 2. Re-design the planned elicitation to make it less prone to the anticipated biases. For example, based on the list in Table 5, an initial draft version of interview questionnaires was modified to account for the listed potential biases. Initially, the interview procedure guided the interviewer to just ask for maximum and minimum values of the parameters to be estimated. This could have introduced bias due to overconfidence. This type of bias occurs when experts intuitively minimize the uncertainty in their answers. To address this, the interview procedure was changed to require the experts to think of possible scenarios under which a maximum or minimum value could occur. Thus, they process more information to provide an estimate for these values and this, based on existing research, typically results in better estimates. A second modification was made to address bias due to inconsistency. This can occur if people get tired, forget information they were provided with, and therefore change assumptions or definitions in the course of the interview. To counter-balance these phenomena, the questions were re-phrased so that the key concepts and definitions were repeated in each question. Additionally, breaks were planned in the interview procedure to prevent fatigue. Finally, when the expert mentioned scenarios the interviewer asked about additional scenarios that were possible and – based on his experience in pilot interviews – suggested additional scenarios. This addresses the bias of availability, which occurs when experts focus only on a few, recent scenarios. Two additional activities for monitoring bias in the elicitation will be described in Step 3 of this procedure.

NPTUMJLFMZWBMVF

NPTUMJLFMZWBMVF

p

defect frequency

5%

5% t

0% NJOJNVN

100% NBYJNVN

QSBDUJDBM

QSBDUJDBM

NJOJNVN

NBYJNVN

Figure 1 Capturing uncertainty using distributions To illustrate how uncertain information was actually captured during interviews, an excerpt from a questionnaire is shown below. Questions regarding the effort for fixing defects in code inspections As a code author it is your task to fix defects detected in code inspections. This involves a decision, whether an issue is a real defect, finding a solution for the fix and changing the code accordingly. How long does it typically take to find a solution for the fix and change the code accordingly? 1. Suppose you determined several major defects that were introduced during coding. • In which range, according to your experience, can the effort for fixing one of these defects lie? __ h to ___ h • What would you estimate as a most likely effort for fixing one of these defects? __ h

Bias During elicitation, the experts will perform 4 cognitive 6

Bias Wishful thinking Inconsistency Overconfidence

Availability

Social pressure

Definition people’s hopes influences their judgement people are inconsistent in their solving of problems people underestimate the amount of uncertainty in their answers people retrieve events with different ease from long-term memory body language, facial expressions, word choice of interviewer changes response

through the questionnaire using the precisely defined interview procedure. An intensive use of visual aids should be made during the interviews, especially with respect to the visualization of the response mode. During the interview, the elicitation has to be monitored for the occurrence of bias. Therefore, the following two activities are performed: 1. Make the experts aware of the potential intrusion of bias. Experimental studies in probability elicitation have shown that substantial improvement in the quality of assessments can be obtained through elicitation training [9]. The experts need to be informed about the biases they are likely to exhibit. In particular they should be informed about the definitions and causes of these biases [15]. Therefore, the general concept of bias is to be introduced in the introduction of the elicitation interviews and the expected biases are presented. In this case study we did this by using Table 5. This table was printed on a slide and was kept on the table for the duration of the interview so that the experts had a visual anchor to the presented information. Additionally, the experts should be familiarized with the elicitation procedure, especially the response modes used in the questionnaire. 2. Monitoring the elicitation for the occurrence of bias. During the elicitation the interviewer monitors the experts body language and the verbalized thoughts of the expert. If these signs indicate some undesired situation, the interviewer has to react accordingly. One example is that if the expert leans back on his chair during interviews, the interviewer should propose to take a break. Another example is that phrases like “we had a case...” could point out a potential outlier in the effort distribution. Whenever the expert mentions these phrases the interviewer should inquire about the representativeness of the case.

Cause what should happen influences thinking (possible gain from response) unintentional change of assumptions through fatigue, memory problems, confusion people are uncomfortable with uncertainty in their lives and try to minimize it; people think that being an expert implies the ability to give exact answers people do not receive scenarios triggering other, less accessible memory associations people have desire to gain acceptance

Table 5 Bias definitions and sources in the case study Finally in this second step, experts have to be identified, according to precise criteria, and motivated to do the job well. Several experts are necessary to estimate parameters as the multiplicity of answers will help cancel out random error [8]. It does not help though if the procedure induces systematic bias. In our case study, we selected four to six experts to estimate one parameter depending on the model and parameter. In total, 23 experts were selected. Experts should be selected according to their role in the development process and their level of experience in the organization under study. In particular, a letter should be sent by the QA management and/or higher management to each potential interviewee in order to stress the importance and motivations of the study, as well as to officially guarantee the anonymity of the data collected from experts. To complete the selection process, phone calls are made to schedule the meetings and to answer upfront questions the experts might have about the study and their contribution. Using some of these experts, the interview procedure and the questionnaire are usually tried in a pilot test. The purpose of these pilot tests is to gain feedback and optimize the elicitation accordingly. Some of the most important goals of pilot testing is to determine whether the experts are able to answer the questions and if there are sources of confusion and bias that might have been overlooked. As a result, this step produces validated questionnaires supporting the elicitation of the information identified in Step 1 and the corresponding interview procedure.

Once all interviews are completed, the answers of the experts are compared. If significant differences are observed this needs to be investigated further. In the case study this was conducted in the following way: If an expert’s answer differed significantly from other experts’ answers and if that expert did not experience the phenomena related to the questions in the recent past or showed difficulties with the questions during the interview, then the answer was left out for the remainder of the analysis. If the expert did not differ from other experts in terms of experience, he was simply debriefed over the phone to determine possible reasons for his different answers. Depending on the explanations provided to us, e.g., how representative was the expert’s experience, it was then decided whether to include the answer or not in the analysis. This step produces, for each question (i.e., for each expertopinion-based parameter in the model), a set of answers from the experts in a predefined response mode.

Step 3: Performing Interviews and Screening Answers An interview is scheduled with each expert. Face-to-face interviews are preferable, as the experts are more motivated, and the interviewer has more control over the elicitation. However, depending on the amount of questions that have to be answered by the expert, the interview can also be performed using the telephone. During the interview the interviewer guides the expert 7

sampled values forms a scenario, which is used as input to the model to compute the corresponding cost-effectiveness value. Repeating this procedure 1000 times provides 1000 cost-effectiveness values which form a distribution. In this study we aggregated the answers concerning effort estimates according to the formulae given in Table 4. The answers concerning proportion estimates were aggregated during the MC-Simulation by selecting one expert per simulation run and parameter (with all experts having equal probabilities of being selected) and sampling from the probability distribution of the selected expert. As an example, the cost-effectiveness distributions for our case study are shown and discussed in Section 6.

Step 4: Compute Cost-Effectiveness The objective of the fourth step is to determine the costeffectiveness of inspections, in the environment under study, and according to the CE model. Step 1 produces the instantiated CE models for each inspection phase and their operationalized parameters (like in Table 2 to Table 4). Together with the experts’ answers, the cost-effectiveness is then computed as illustrated in Figure 2.

Proportion of design defects in design inspections?

Data

CE Model f(x1,x2)

Interviews

6 COST-EFFECTIVENESS IN THE CASE STUDY nDI,D=nD*pDI,D

aggregation

pDI,D

Figure 3 shows the results of the simulation performed using our extended and generalized cost-effectiveness model. The upper part shows the cost-effectiveness distribution, the lower part shows the savings distribution. According to Section 2, the savings can be seen as highlevel parameters of the CE model. They can be obtained using the MC-Simulation producing the cost-effectiveness. We can see that the distributions of cost-effectiveness as well as the savings for analysis, design, and code inspections are clearly ordered. Analysis inspections are considerably more cost-effective than their design counterpart, which are in turn even more markedly better than code inspections. Despite the uncertainty modeled during expert knowledge elicitation, the distribution patterns are quite clear to this respect. These patterns confirm what is usually acknowledged by software engineering professionals, i.e., earlier inspections are probably more beneficial. In addition, the QA engineers of the organization where the data was collected confirmed that they suspected the benefits of code inspections for some parts of the system to be limited, a suspicion confirmed by our results. We can see clearly that the code inspection cost-effectiveness distribution tails near the 0 threshold. Thus, the cost-effectiveness of code inspections should be further investigated in the future. All these observations support the plausibility of our model. However, one might claim that since the cost-effectiveness distributions are partly based on parameters estimated by experts, overall consistency with expert opinion is not surprising. But recall that these results were obtained based on measurement data and expert data that were only indirectly related to cost-effectiveness in ways that are far to be obvious and predictable. In other words, experts only estimated low-level parameters that are only indirectly related to the cost-effectiveness distributions we obtained here. Those distributions are not just the representation of a subjective expert perception on cost-effectiveness but are based on information that can be measured and information that was deemed assessable by experts. In general, because what we model is by definition not directly measurable, it must be acknowledged that one important recurring

cost-effectiveness distribution MC-simulation of CE model

No. defects in design inspections? nDI=200 determine model parameter

Project data

Figure 2 Steps in the computation of cost-effectiveness For those parameters that can be determined from measurement data, the values are computed from either inspection or test data. For those parameters that are to be estimated by expert knowledge elicitation, the estimates of different experts for each question have to be aggregated. Since the estimates of the experts were actually probability distributions, we also estimate a probability distribution for the cost-effectiveness. This distribution captures the inherent uncertainty in the cost-effectiveness of inspections. This uncertainty has two sources: the uncertainty in the experts’ estimates and the inherent variation of cost-effectiveness across inspections. An important methodological point in presenting the results of any expert opinion study is to make explicit the underlying uncertainty of the results [9]. A probability distribution of the cost-effectiveness takes this aspect automatically into account. Monte-Carlo (MC-) Simulation is a convenient way of performing the aggregation of experts’ data [19] and the computation of the cost-effectiveness distribution. Although for simple distributions analytical models can be developed to derive the cost-effectiveness distribution, the Monte-Carlo procedure is simpler, is identical regardless of the type of distributions used, and can be easily run using low-cost commercial simulation packages. Additionally, though this is not discussed here, it is also practical to determine the sensitivity of the model to its parameters [20]. The simulation procedure proceeds as follows. During one simulation run, a value for each input parameter is sampled from the experts’ probability distributions. The set of 8

difficulty with such models is their validation. It is only through indirect means (as shown above) and through practice that economic models measuring costeffectiveness of technologies can be validated and refined. These results have important practical consequences. First, they can be used to compare future inspections in order, for example, to assess whether a change in inspection procedures affects their cost-effectiveness. That is, distribution of cost-effectiveness of new projects could be compared with the one of old projects. In other words, the distributions generated here could be used as baselines for future comparisons. Thus, changes to the inspection process can be assessed in terms of the improved costeffectiveness. Second, we see here that earlier inspections, namely analysis or design inspections, have much higher benefits. We can see from Figure 3 that the net savings of the three types of inspections follow the same pattern as costeffectiveness. Any process improvement initiative should therefore focus first on these earlier inspections in priority. Our results help focus the QA team improvement efforts where the gain is likely to be more substantial.

Motivation of experts Overall, the experts were highly motivated for the interviews. They showed great interest in the study and the kind of questions they were asked. Very well received was the fact that the experts (i.e., the developers and testers) were involved in order to draw conclusions about the inspection practices in the business unit. In some cases, the experts expressed their satisfaction and interest to the QA team. The experts were also very interested in receiving a copy of the final case study report to see the final results of their efforts. Selection of experts The QA management selected the experts following the rationale that they should be proactive and motivated developers and testers. This might also explain the positive atmosphere reported above. The experts had various levels of experience in the business unit ranging from 3 to 18 years. Some of the experts with 3 to 4 years of experience had difficulties answering some of the questions regarding relatively rare events, such as the correction effort for analysis defects that are detected late in the system testing phase. Thus, experts should have a sufficient level of experience to be able to answer the questions. In a particular context, this should be determined during pilot testing. Additionally, they should currently perform the activity about which they are interviewed. In some cases, people who had worked a long time as developers but were currently performing different tasks, provided estimates that differed significantly from the estimates of other experts. Application of elicitation techniques One of the most important lessons learned was that the way in which questions are asked can influence the estimate. One question addressed the practical maximum effort for correcting defects. During the first pilot test of the elicitation we asked about the different scenarios in the following way: “Under what scenario would you expect a maximum value for the effort and what would be the maximum value then?” followed by “What would be then a value for the practical maximum?” This question leads the expert to think of “pathological” instances of defect corrections and their abnormally large associated effort. This had a significant impact on the practical maximum estimate. During the de-briefing after pilot testing, the experts concurred with this observation. Therefore, the question in the subsequent interviews was then phrased as “Under what scenario would you expect a practical maximum value for the effort and what would be the practical maximum value then?” Compared to the experts in the interviews, the estimates of the pilot testers showed significantly higher maximum values. This is a clear indicator that the first way of posing the questions has very likely introduced bias. A second observation is that people had difficulties to estimate some of the effort values. It was difficult to

Figure 3 Results of simulations with CE model 7 LESSONS LEARNED During the performance of the interviews and the analysis of interview data we drew lessons learned that can roughly be divided into the following categories: motivation of experts, selection of experts, and the application of the elicitation techniques; 9

estimate the correction effort for analysis or design defects detected in late testing phases. This could be attributed to the fact that the tasks for correcting these defects are not as easy to visualize and remember and are not performed as often as, for example, the correction of code defects. Thus, better support for these kinds of questions should be sought for. It was also difficult to estimate the effort to correct one defect after inspections. Some experts had difficulty including the fixed effort occurring once per inspection in their estimates (e.g., checking the document into the configuration system, performing administrative tasks). Thus, in future the model should be refined in the sense that the effort spent on correcting defects and the effort spent on administrative tasks is elicited separately. A third observation is that the answers of the experts can significantly differ because of their varying experience. One of the experts provided large estimates for the defect correction in test. It turned out that this expert was responsible for a very complex part of the system. These differences have to be taken into account when aggregating the answers of the experts. Finally it was observed that when estimating the percentage breakdown of defects experts usually tried to recall the number of observations and infer the percentage based on this information. In eliciting subjective probabilities it has been found that indicting a probability of, say, 5% as “5 in one hundred” yields better performance [17]. Thus, percentages of defects should be elicited in the following way “Suppose you have hundred defects. How many of these defects are of origin, say, analysis?”

9

CONCLUSIONS

The contribution of this paper is twofold. First, based on the existing research on expert knowledge elicitation, we have provided a method to estimate the cost-effectiveness of inspections under realistic constraints by combining measurement data and expert opinion. It is important to note that such a method can be easily tailored to the evaluation of technologies other than inspections. Second, we performed a case study from which we identified lessons learned and important research directions. Several outcomes from the case study are worth noting: 1. The quantitative results provide further evidence that early inspections are very beneficial, as the environment under study is representative of what telecom software development is today. 2. It is not clear that inspections are always beneficial, under all circumstances, like it is sometimes suggested with enthusiasm in the literature. In our study, for example, the usefulness of code inspections should be further investigated, as substantial gains have not been clearly demonstrated. 3. Just the fact of involving the developers and testers in assessing a technology like inspections is already beneficial. They feel they contribute to the evaluation and betterment of current development practices, as opposed to be imposed new ones. ACKNOWLEDGEMENTS We would like to thank the developers and testers who placed their time and experience to our disposal. Additionally we would like to thank Mr. Karl for his crucial support during the case study.

8 FUTURE WORK Future research activities can be summarized as “How can we better elicit expert judgment to get more precise estimates of the inspection cost-effectiveness?” It encompasses an identification of factors that can and should be elicited from software development experts with high reliability and using an optimal elicitation technique. This involves a sensitivity analysis of the model to identify model parameters that have a high impact on the resulting cost-effectiveness distribution. Special care can then be taken when eliciting these factors. Equally important is to identify quantities software experts can estimate with high accuracy. Of particular interest in this context would be studies assessing the accuracy of the experts’ estimates based on actual values. In other fields as well as in software engineering these studies are sparse [16], [6], [7]. This would also require an investigation of whether alternative elicitation techniques to the ones used in the presented method should be employed (e.g., alternative response modes). To improve data analysis, methods to help detect suspicious or biased data should be investigated. Since bias can never be fully avoided, its detection and the application of corrective actions during analysis should complement bias prevention.

REFERENCES 1. A. F Ackerman., L. S. Buchwald, and F. H. Lewsky. Software Inspections: An Effective Verification Process, IEEE Software, vol 6, no. 3, pp 31-36, 1989. 2. L. Briand, K. El Emam, and F. Bomarius, COBRA: A Hybrid Method for Software Cost Estimation, Benchmarking, and Risk Assessment, in Proceedings of the 20th International Conference on Software Engineering, pp. 390--399, 1998. 3. L. Briand, K. El Emam, O. Laitenberger, and T. Fussbroich, Using Simulation to Build Inspection Efficiency Benchmarks for Development Projects, in Proceedings of the 20th International Conference on Software Engineering, pp. 340--349, 1998. 4. R. G. Ebenau and S. H. Strauss, Software Inspection Process., McGraw Hill, 1993. 5. T. Gilb and D. Graham, Software Inspection. AddisonWesley Publishing Company, 1993. 6. M. Höst and C. Wohlin, An Experimental Study of Individual Subjective Effort Estimations and Combinations of the Estimates, Proceedings the 20th 10

International Conference on Software Engineering, pp. 332-339, Kyoto, Japan, 1998. 7. M. Höst and C. Wohlin, A Subjective Effort Estimation Experiment, International Journal of Information and Software Technology, Vol. 39, No. 11, pp. 755-762, 1997. 8. E. Hofer, On surveys of expert opinion, Nuclear Engineering and Design, vol. 93, no. 2-3, pp. 153--160, 1986. 9. S.C. Hora and R.L. Iman, Expert opinion in risk analysis: the NUREG-1150 methodology, Nuclear Science and Engineering, vol. 102, pp. 323--331, Aug. 1989. 10. P. Jalote and M. Haragopal, Overcoming the NAH Syndrome for Inspection Deployment, in Proceedings of the 20th International Conference on Software Engineering, pp. 371--378, 1998. 11. D. Kahneman, P. Slovic, and A. Tversky, eds., Judgement under uncertainty: Heuristics and biases. Cambridge University Press, 1982. 12. B. Kitchenham, L. Pickard, and S.L. Pfleeger, Case studies for method and tool evaluation., IEEE Software, vol 12, no. 4, p 52-62, 1995. 13. D.H. Kitson and S.M. Masters, An analysis of SEI software process assessment results: 1987-1991, in Proceedings of the 15th International Conference on Software Engineering, pp. 68--77, 1993. 14. S. Kusumoto, K. Matsumoto, T. Kikuno, and K. Torii, A new metric for cost-effectiveness of software reviews, IEICE Transactions on Information and Systems, vol. E75-D, no. 5, pp. 674--680, 1992. 15. M. A. Meyer and J. M. Booker, Elicitating and Analyzing Expert Judgement: A Practical Guide., Academic Press, Ltd., 1991. 16. A. Mosleh, V.M. Bier, and G. Apostolakis, The elicitation and use of expert opinion in risk assessment: a critical review, in Probabilistic Safety Assessment and Risk Management: PSA '87, vol. 1 of 3, pp. 152--158, 1987. 17. J.M. van Noortwijk, A. Dekker, R.M. Cooke, and T.A. Mazzuchi, Expert judgment in maintenance optimization, IEEE Transactions on Reliability, vol. 41, pp. 437--432, Sept. 1992. 18. A.N. Oppenheim, Questionnaire Design, Interviewing and Attitude Measurement. Pinter Publishers, 1992. 19. T. Saaty, The Analytic Hierarchy Process, McGrawHill, 1990. 20. D. Vose, Quantitative Risk Analysis: A Guide to Monte Carlo Simulation Modelling. John Wiley Sons, 1996

11