Performance Variability in Software Product Lines ...

0 downloads 0 Views 2MB Size Report
example, capacity can be defined as the maximum achievable throughput .... tasks (González-Baixauli et al 2007); these varying goals and tasks are then mapped to ...... ity (Olaechea et al 2012) or user-invisible variability (Sincero et al 2010).
Empirical Software Engineering manuscript No. (will be inserted by the editor)

Performance Variability in Software Product Lines: Proposing Theories from a Case Study Varvana Myll¨arniemi · Juha Savolainen · Mikko Raatikainen · Tomi M¨annist¨o

Received: date / Accepted: date

Abstract Context. In the software product line research, product variants typically differ by their functionality and quality attributes are not purposefully varied. Objective. The goal is to study purposeful performance variability in software product lines, in particular, the motivation to vary performance, and the strategy for realizing performance variability in the product line architecture. Method. The research method was a theory-building case study that was augmented with a systematic literature review. The case was a mobile network base station product line with capacity variability. The data collection, analysis and theorizing were conducted in several stages: the initial case study results were augmented with accounts from the literature. We constructed three theoretical models to explain and characterize performance variability in software product lines: the models aim to be generalizable beyond the single case. Results. We describe capacity variability in a base station product line. Thereafter, we propose theoretical models of performance variability in software product lines in general. Performance variability is motivated by customer needs and characteristics, by trade-offs and by varying operating environment constraints. Performance variability can be realized by hardware or software means; moreover, the software can either realize performance differences in an emergent way through impacts from other variability or by utilizing purposeful varying design tactics. Conclusions. The results point out two differences compared with the prevailing literature. V. Myll¨arniemi Aalto University, Finland E-mail: [email protected] P.O. Box 15400, FI-00076 Aalto, FINLAND Tel. +358-50-3504626, Fax +358-9-8554058 J. Savolainen Danfoss Power Electronics A/S, Denmark, Denmark E-mail: [email protected] M. Raatikainen Aalto University, Finland E-mail: [email protected] T. M¨annist¨o University of Helsinki, Finland E-mail: [email protected]

2

Varvana Myll¨arniemi et al.

Firstly, when the customer needs and characteristics enable price differentiation, performance may be varied even with no trade-offs or production cost differences involved. Secondly, due to the dominance of feature modeling, the literature focuses on the impact management realization. However, performance variability can be realized through purposeful design tactics to downgrade the available software resources and by having more efficient hardware. Keywords Case study · Software product line · Variability · Software architecture 1 Introduction Companies that develop software products face the diversity of customer needs. Instead of offering a single product as a compromise of the varying needs, companies offer several products with slightly varying capabilities. Software product lines have gained popularity as an approach to efficiently developing such varying products. A software product line is a set of software-intensive products that share a common, managed set of features, a common software architecture and a set of reusable assets (Bosch 2000; Clements and Northrop 2001). Instead of developing products independently, the products of a product line are developed by reusing existing product line assets in a prescribed way; these assets include software components, requirements, test cases, and other reusable artifacts. In a software product line, the architecture and the assets must be able to cover commonality and variability. Commonality represents those aspects that are shared among the products. Thus, commonality enables reuse, and consequently increases development efficiency. Variability is the ability of a system to be efficiently extended, changed, customized or configured for use in a particular context (Svahnberg et al 2005). Thus, variability represents those aspects that enable product differentiation and customization for different customer needs. Variability manifests itself in many different levels (van Gurp et al 2001): in the requirements, in the architecture, and in the implementation. One of the key challenges in product line engineering is the efficient management and realization of variability. Consequently, variability has been a focus of intense research during the recent years. However, the variability of quality attributes, and in particular, the variability of performance, has received less research attention. From the research point of view, and from the point of view of industrial cases reported in the research, the product variants seem to differ from each other mostly through their functional capabilities (Galster et al 2014), and performance is kept more or less similar, or at least its variability is not purposeful and explicitly managed. There are certain aspects of purposeful performance variability in software product lines that call for investigation. Firstly, performance is continuous (Regnell et al 2008): instead of being either included or excluded from a product variant, performance is measured as different shades of product goodness. Thus, different customer needs may often be addressed with the same product: if customer A has the requirement that the response time of function X should be 500ms, and customer B requires 1000ms, a product with response time of 500ms will satisfy both needs. By contrast, functional variants often cannot be ordered or substituted with each other. For the product line owner, all additional and explicitly managed variability, for example, the ability to produce both 500ms and 1000ms variants, adds to the cost and complexity of the product line. Therefore, the motivation to purposefully vary performance should be studied. Secondly, if performance is decided to be varied as a software product line, the differences between the products must be realized as variability in the product line architecture

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

3

and assets. Because of the architectural nature of performance and other quality attributes, the realization of performance variability may crosscut throughout the product line architecture. It has been argued that quality attribute variability that affects software architecture is difficult to realize and hence is to be avoided (Hallsteinsen et al 2006a). Thus, the strategy for realizing performance variability needs to be studied. Finally, the literature on performance variability and quality attribute variability in general mostly lacks empirical evidence, at least evidence that is explicitly reported and drawn from industrial product lines (Myll¨arniemi et al 2012). Even if studies address quality attribute variability, they often do not describe the study context or the research design (Galster et al 2014). By contrast, there is even evidence of industrial contexts in which quality variability was not needed (Galster and Avgeriou 2012). The lack of empirical evidence also applies to software product line engineering in general (Ahnassay et al 2013). Further, the existing empirical evidence on software product line engineering tends to focus on constructing and evaluating methods, techniques and approaches (Ahnassay et al 2013), that is, it is mostly about validating artifacts (Hevner et al 2004) or prescriptive design theories (Gregor 2006). Thus, there is a need to observe the phenomenon in its real-life context (Yin 1994), that is, to study how quality attribute variability exhibits in industrial product lines. To address the above issues, this paper presents a theory-building case study (Yin 1994; Urquhart et al 2010). The goal is to study the motivation to purposefully vary performance and the realization of performance variability in software product lines. The research questions are as follows: RQ1 Which characteristic of performance is decided to be varying? RQ2 Why is performance decided to be varied? RQ3 What is the strategy for realizing performance variability within the product line architecture? RQ4 Why is the realization strategy chosen? To answer the research questions, the study was conducted as a single-case case study (Yin 1994; Runeson and H¨ost 2009) that was augmented with a systematic literature review (Wohlin and Prikladniki 2013). We conducted a post mortem case study in the domain of 3G (3rd generation) mobile telephone networks. The case company is Nokia, formerly Nokia Solutions and Networks, and the product line in the focus of this study was a base station in the 3G radio access network. This software product line was designed to exhibit purposeful capacity variability. In addition to describing the results as capacity variability for base station product lines, we propose a number of theoretical models to address performance variability in software product lines in general. To build the theoretical models, we adopted a grounded theory approach to iteratively collecting and analyzing data (Urquhart et al 2010): the analysis and synthesis utilized the case account as a basis, and augmented the theory categories, boundaries and example instantiations from the literature. The case study was partly explanatory, partly descriptive (Runeson and H¨ost 2009); consequently, the resulting theoretical proposals include both describing and explaining models (Gregor 2006). The scope of this study was on purposeful variability, which means that unintended, indirect quality variability (Niemel¨a and Immonen 2007) was not addressed. Further, the focus was on performance, which includes subattributes such as response time, memory consumption, and capacity. Finally, the scope included both software product lines and softwareintensive product lines; software-intensive here implies that the product line encompasses both hardware and software. One of the contributions is to describe and explain capacity variability in a base station product line, thus accumulating the reported empirical evidence on quality attribute variabil-

4

Varvana Myll¨arniemi et al.

ity in industrial product lines. Another contribution is the combined analysis of the existing literature and the case account. The main contribution is to propose theoretical models that describe and explain performance variability in software product lines. By building the results into these models through analytical generalization (Yin 1994), we aim at generalizing the results beyond the setting or domain of this single case study. The proposed theoretical models indicate two fundamental differences between the prevailing research approaches and the case account. Firstly, performance variability is motivated by customer needs and characteristics, by design trade-offs and by varying operating environment constraints. Typically, literature explains performance variability as a way to resolve the trade-offs and constraints stemming from the solution domain. Due to price differentiation to the customers and their evolving needs, performance variability may be motivated only by the problem domain, that is, performance may be varied even with no trade-offs involved and even when the cost to produce the variants is the same. Secondly, due to the dominance of feature modeling, the existing literature focuses on realizing differences in performance by managing the impact and interactions of software features on product performance; we formulate this as the impact management realization. By contrast, the case company realized variability both through hardware and software; the software utilized a purposeful design tactic to downgrade the available resource with minimum dependencies to other variability. This realization enabled varying capacity separately of other variability and upgrading it at runtime as the needs evolved. This paper has been extended from an earlier publication (Myll¨arniemi et al 2013); all content from that publication has been thoroughly revised and updated. As a novel contribution, this paper presents the following: – An extended review of the previous work. – An extended data collection and analysis to include also accounts from the literature. – Proposed theoretical models to answer the research questions in more general terms, beyond the domain of the case study. The rest of this paper is organized as follows. Section 2 lays out basic theoretical concepts related to the research questions and reviews previous work. Section 3 describes the research method. Section 4 describes the results as capacity variability in a base station product line. Section 5 describes results as theoretical models on performance variability in software product lines. Section 6 discusses the validity of the results and the lessons learned, while Section 7 concludes.

2 Background and Review of Previous Work In the following, Section 2.1 and Section 2.2 describe the theoretical foundation for our research topic. Thereafter, Section 2.3 and Section 2.4 present a review of the previous work on the research topic.

2.1 Performance as a Quality Attribute Quality attributes, such as performance, security and availability, play a significant role in many industrial software products. In such a system, failing to meet one quality requirement may render the whole system useless. Quality attributes can be defined as characteristics that affect an item’s quality (IEEE Std 610.12-1990 1990). However, due to the vague definition,

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

5

quality attributes are often defined via attribute taxonomies (ISO/IEC 9126-1 2001; ISO/IEC 25010 2011; Boehm et al 1978; McCall et al 1977) and then defining the constituent subattributes in more concrete terms. In the context of software products, special focus is on external quality attributes (ISO/IEC 9126-1 2001), or on product quality attributes (ISO/IEC 25010 2011). This is because external quality attributes can be used to distinguish products from each other to the customers. Finally, quality attributes can be divided into those that are observable at runtime, and to those that are not (Bass et al 2003). The former relate to the dynamic properties of the computer system or to the quality properties in use (ISO/IEC 25010 2011), whereas the latter relate to the static properties of software (ISO/IEC 25010 2011). Quality requirements, also known as non-functional requirements (Mylopoulos et al 1992; Berntsson Svensson et al 2012), characterize quality attributes: a quality requirement is a requirement that a quality attribute is present in software (IEEE Std 1061-1998 1998). For software products, the quality requirements and their target values need to be decided by taking into consideration the benefit to the market, the cost of achieving, and even the competitor products (Regnell et al 2008). Yet, there are several practical challenges, for example, quality requirements tend to be neglected and overlooked (Berntsson Svensson et al 2012). Performance is an important quality attribute in the industry (Berntsson Svensson et al 2012). Poor performance may imply lost revenue, decreased productivity, increased development and hardware costs, and damaged customer relations (Smith and Williams 2002). Performance can be defined as the degree to which a system or component accomplishes its designated functions within given constraints, such as speed, accuracy, or memory usage (IEEE Std 610.12-1990 1990). Performance is relative to the amount of resources used to meet those constraints; example resources include other software products and the software and hardware configuration of the system (ISO/IEC 25010 2011). Thus, performance is a dynamic property of the computer system (ISO/IEC 25010 2011), which also implies it is observable at runtime (Bass et al 2003) and can be used to distinguish products from each other. Performance can be divided into subattributes of time behavior, resource utilization and capacity (ISO/IEC 25010 2011). Time behavior can refer to the latency of responding to an event, or to the throughput of processing events in a given time interval (Bass et al 2003; Barbacci et al 1995). Time behavior requirements may be defined relative to the amount of resources needed to meet the constraints, and to the load of the system (ISO/IEC 25010 2011; Bass et al 2003). Resource utilization refers to the amount of resources the system uses to perform its functions (ISO/IEC 25010 2011); typical examples of resources include both static and dynamic memory. Capacity means the degree to which the maximum limits of a product or system parameter meet requirements (ISO/IEC 25010 2011); as a concrete example, capacity can be defined as the maximum achievable throughput without violating the specified latency requirements (Barbacci et al 1995). The software architecture is critical to the realization of many quality attributes (Bass et al 2003): a significant part of the quality attributes are determined by the choices done during the architecture design. This is also true for performance: most performance failures are caused by not considering performance early in the design (Smith and Williams 2002). Performance is affected by several software architecture design decisions: the type and amount of communication among components; the functionality that has been allocated to these components; and the allocation of the shared resources (Bass et al 2003). In many systems, the functionality is decentralized, which means that performing a given function is likely to require collaboration among many different components (Smith and Williams 2002). Due

6

Varvana Myll¨arniemi et al.

to the architectural nature, performance should be designed in and evaluated at the architectural level (Bass et al 2003). To address performance during design, performance tactics and patterns encapsulate reusable solutions (Bass et al 2003; Smith and Williams 2002). Performance tactics include decreasing resource demand; increasing or parallelizing resources; and enhancing resource arbitration (Bass et al 2003). Further, there are many design decisions that improve other quality attributes at the expense of performance: for example, most of the availability tactics (Bass et al 2003) increase the overhead and complexity. Such situations are called trade-offs and they are usually resolved by finding a global, multi-attribute optimum (Barbacci et al 1995). The way these trade-offs are resolved during the architectural design forms the quality attributes of the system; later in the implementation individual quality attributes cannot be easily improved.

2.2 Variability in Software Product Lines To manage and represent variability in product lines, features and feature modeling (Kang et al 1990, 2002; Czarnecki et al 2005) have become de facto standard in the research community. A feature can be seen as a characteristic of a system that is visible to the enduser (Kang et al 1990), or in general, as a system property that is relevant to some stakeholder and is used to capture commonalities or discriminate among product variants (Czarnecki et al 2005). A feature model then represents the variability and relations of features. In addition to managing variability through features, the variability of the architecture (Thiel and Hein 2002) and the implementation (Svahnberg et al 2005) needs to be managed. Variability management typically focuses on functional variability. Quality attribute variability in software product lines has been studied to some degree (Myll¨arniemi et al 2012; Etxeberria et al 2007). However, quality attribute variability is typically not the main contribution but only supports other more central aspects in the study (Myll¨arniemi et al 2012). Further, quality attribute variation can be both purposeful and unintentional. This is because any variability in the product line may also cause indirect variation in the quality attributes (Niemel¨a and Immonen 2007). At least the following combinations of quality attributes, variability, and product lines can be identified. Firstly, software product lines can have purposeful quality attribute variability: this is the focus of our study. The product line has the ability to deliberately create quality attribute differences between the products, that is, the products exhibit purposefully different quality attributes to serve different needs. The products are developed as a product line, which means that the product line and its assets need to explicitly manage and realize quality attribute variability. Secondly, there can be product lines that do not have purposeful quality attribute variability. Either all products in the product line exhibit more or less similar quality attributes, or the quality differences between the products are unintentional. For example, the product line architecture is designed to address a common, ”the worst case” quality requirement (Hallsteinsen et al 2006a). Thirdly, there can be products with purposefully different quality attributes, but the products are not developed as a product line, for example, under one product line architecture. This kind of approach is well tailored to specific needs, but at the same time is costly, since the level of reuse will be lower. This may be an option if the needed architectural solutions are very different or conflicting.

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

7

2.3 Performance Variability in Software Product Lines In the following, we briefly review the related work on performance variability in software product lines. Most of the studies that address quality attribute variability discuss quality attributes in general and the contribution is not limited to specific quality attributes, such as performance. Only a handful of studies focus on specific quality attributes: as an example, Mellado et al (2008) propose a process of security requirements engineering for software product lines. Instead, a typical case is to propose a method or a construct that is promised to be applicable to all quality attributes, and then utilize a concrete example with specific quality attributes. However, questions that been raised about whether a blanket solution can cover all quality attributes equally well (Myll¨arniemi et al 2012; Berntsson Svensson et al 2012). Nevertheless, when looking at the examples utilized in the studies, it seems that performance, and in particular, time behavior and resource consumption, are two quality attributes that are often proposed to be varied in software product lines (Myll¨arniemi et al 2012). Performance variability can be represented and managed in many different ways and on many different levels; this is also noted in two literature reviews (Myll¨arniemi et al 2012; Etxeberria et al 2007). Some approaches focus more on representing and managing performance variability in the problem space, as performance requirements or goals, or as performance options that can be selected during application engineering. Some approaches focus more on how performance variability is realized, either through architectural tactics, or through the interplay of features and feature impacts in the software product line. Firstly, performance variability can be represented by capturing how features in a feature model impact performance: when features are varied, so is performance. A feature impact characterizes how a particular feature contributes to performance: for example, feature Credit Card contributes 50ms to the overall response time (Soltani et al 2012). Such feature impacts can be represented as feature attributes (Benavides et al 2005) or listed as the properties of features (White et al 2009; Soltani et al 2012); moreover, the impacts can be both quantitative or qualitative (feature Encryption has a negative contribution to the response time (Jarzabek et al 2006)). Representing the impact of features has been used for response time (Soltani et al 2012), CPU consumption (White et al 2009; Guo et al 2011), memory consumption (Tun et al 2009; Guo et al 2011), and speed (Bagheri et al 2010). However, the challenge is to characterize or measure the impact per feature; it has been claimed that time behavior can only be characterized per variant and not per feature (Siegmund et al 2012b). Further, the feature impacts may depend on the presence of other features. This is called feature interaction (Siegmund et al 2013, 2012a; Sincero et al 2010; Etxeberria and Sagardui 2008): a certain combination of features may create a bigger memory footprint, compared with simply aggregating the memory footprints of individual features. Feature interactions may occur when the same code unit participates in implementing multiple features; when a certain combination of features requires additional code; or when two features share the same resource (Siegmund et al 2012b, 2013). Feature interactions have been addressed for memory footprint, main memory consumption and time behavior (Siegmund et al 2012a, 2013). Another way to represent performance variability is to capture varying performance directly as ”quality attribute features” or as other feature-like entities (Lee and Kang 2010; Etxeberria and Sagardui 2008; Gimenes et al 2008): this makes it straightforward to select the desired performance variant for a product. However, the realization of different performance variants must also be addressed, for example, by characterizing qualitatively how

8

Varvana Myll¨arniemi et al.

each functional or technological feature contributes to the quality attribute features (Lee and Kang 2010; Etxeberria and Sagardui 2008). More in the problem domain, there are also approaches that represent performance variability as softgoals. Softgoals represent requirements that do not have clear-cut satisfaction criteria; instead, they are satisfied when there is sufficient positive evidence and little negative evidence (Mylopoulos et al 2001). Typically, performance is represented as softgoals, which are in turn operationalized as varying goals (Yu et al 2008) or varying tasks (Gonz´alez-Baixauli et al 2007); these varying goals and tasks are then mapped to solution-space features. Alternatively, the softgoals can be operationalized directly as varying features (Jarzabek et al 2006). The operationalization captures the qualitative impact, such as hurt or help, from the varying goals, tasks or features onto performance softgoals. However, since performance is one of the quality attributes for which it is relatively easy to define clear-cut satisfaction criteria with quantifiable measures, softgoals are more commonly used for the variability of other quality attributes, such as security or usability. Yet, even performance requirements may be represented in less specific form as softgoals during the early stages of product line development, for example, High performance, and later converted into features with clear-cut satisfaction criteria (Jarzabek et al 2006). Finally, it is also possible to attach performance information directly to variation points, that is, to an orthogonal variability model (Roos-Frantz et al 2012). In addition to representing variability, one must be able to derive or optimize products with specific characteristics. However, algorithms that take into account quantitative impacts or optimization are computationally very expensive. Earlier CSP-based solvers resulted in exponential solution times to the size of the problem (Benavides et al 2005). White et al (2009) showed that finding an optimal variant that adheres to both the feature model constraints and the system resource constraints is an NP-hard problem. Therefore, approximation algorithms (Guo et al 2011; White et al 2009) as well as HTN planning (Soltani et al 2012) have been proposed. In contrast to the multitude of approaches that study performance variability at the feature level, there is much less explicit discussion on the realization of performance variability in the product line architecture. In principle, varying architectural design decisions can be represented as varying features: for example, feature Euclidean describes a certain variant of a face recognition algorithm (White et al 2009). This is because features can be used to capture design decisions (Jarzabek et al 2006), domain technology or implementation techniques (Lee and Kang 2010). Thus, it is not always clear-cut whether the approach is focusing on features or on architectural design decisions in realizing performance variability. Nevertheless, the different architectural tactics are used to analyze how varying performance requirements can be met (Kishi and Noda 2000; Kishi et al 2002). Also the effect of different algorithms (White et al 2009; Bagheri et al 2010) or different patterns (Hallsteinsen et al 2006b) to time behavior and resource consumption has been captured. The role of hardware is sometimes discussed in conjunction with performance variability; varying hardware constrains the resource consumption and thus affects how software features can be selected (Botterweck et al 2008; Karatas et al 2010). For example, hardware features 1024kB and 2048kB represent the choice between different memory components in a system (Botterweck et al 2008). Thereafter, explicit constraints relate the software and hardware features: for example, software feature DiagnosticsAccess excludes 1024kB (Botterweck et al 2008). Such explicit constraints may stem from known incompatibility issues, or from the externalized knowledge on reference configurations, as described by Sinnema et al (2006). The varying hardware can also be outside the scope of the product line, for

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

9

example, when the application resource consumption is constrained by the mobile device capabilities (White et al 2007).

2.4 Empirical Evidence on Quality Attribute Variability in Industrial Product Lines In the following, we review empirical evidence on quality attribute variability in industrial product lines. In particular, we are interested in the evidence on the existence, characteristics, and practices of quality attribute variability in industrial product lines. Typically, such evidence has been produced following the observational research path (Stol and Fitzgerald 2013), which may range from informal experience reports to rigorous case studies. Nevertheless, we are also interested in those methods or prescriptive theories that have been tested in the industrial context, for example, with experiments. The empirical evidence on quality attribute variability in industrial product lines is scarce (Myll¨arniemi et al 2012). There are case studies and reported industrial experience on product lines and variability, also in the telecommunication domain (Jaring and Bosch 2002), but the focus of the reported empirical evidence has not been on quality attribute variability. Due to the lack of studies, we review the empirical evidence on quality attribute variability in general, instead of focusing only on performance variability. Moreover, the rigor of empirical evidence differs. As discussed by Runeson and H¨ost (2009), the term ”case study” is an overloaded word in software engineering research: the presented case studies range from ambitious and organized case studies to small toy examples, c.f., (Yin 1994; Dub´e and Par´e 2003; Runeson and H¨ost 2009). In fact, it is very common in software engineering that a ”case study” is merely an example used to provide a proof-of-concept for a method or a construct, similarly as described by Hevner et al (2004). According to Yin (1994), a case study studies a phenomenon within its real-life context. For studies in software engineering, the context is typically a company. Thus, a case study in software engineering should describe the case company and how the phenomenon of interest exhibits there. Since the phenomenon and its context are not always distinguishable from each other, data collection and data analysis strategies become an integral part of case studies (Yin 1994). Thus, a case study should explain the data collection and analysis procedures and establish a chain of evidence from the data to the results. There are a few studies (Kishi et al 2002; Sinnema et al 2006; Niemel¨a et al 2004; Myll¨arniemi et al 2006b; Hallsteinsen et al 2006a) that can be characterized as case studies and that describe a product line company with quality attribute variability, that is, say directly that the phenomenon of quality attribute variability happens within its real-life industrial context. In a similar fashion, there are studies that describe a specific open-source software that is explicitly mentioned to have quality attribute variability (Siegmund et al 2012b; Sincero et al 2010). By contrast, there is evidence on industrial contexts in which variability in quality attributes was not needed (Galster and Avgeriou 2012). However, in most of these studies, quality attribute variability is only a minor characteristic, and the main contribution lies elsewhere. Some studies merely mention quality attribute variability briefly (Niemel¨a et al 2004). In many studies, the varying quality attribute is mentioned to include performance (Sinnema et al 2006; Myll¨arniemi et al 2006b; Niemel¨a et al 2004; Siegmund et al 2012b; Sincero et al 2010). However, the motivation and the realization of quality attribute variability are not discussed in these studies, or discussed only very briefly. For the motivation, the refresh rate and memory consumption of the 3D mobile game software is varied to maximize the game attractiveness and playability on all devices with varying capabilities (Myll¨arniemi et al 2006b). For the realization, the variability of adaptability,

10

Varvana Myll¨arniemi et al.

availability, suitability and interoperability causes architectural variation in a product line, which is realized with varying patterns (Hallsteinsen et al 2006a). Also, performance and memory consumption variability may be the result of selecting various installation options in infrastructure-oriented software product lines (Siegmund et al 2012b; Sincero et al 2010). However, none of these studies, with the exception of (Myll¨arniemi et al 2006b; Galster and Avgeriou 2012), provides an adequately explicit description about the method using which the case study data was collected or analyzed. Therefore, it is hard to assess the level of rigor and resulting construct validity in terms of correspondence to the real phenomenon. There are also studies (Lee and Kang 2010; Jarzabek et al 2006; Tun et al 2009; Kuusela and Savolainen 2000) that propose a method or a construct, utilize an example of varying quality attributes, and mention or imply an industrial product line behind the example. In a similar manner as in case studies, the rigor of empirical evidence differs (Shaw 2002; Fettke et al 2010). Consequently, it is not clear if these studies are actually rigorous empirical studies about a contemporary phenomena, or are the studies merely statements backed up by exemplary experience (Fettke et al 2010), or slices of real life or toy examples (Shaw 2002) being influenced by industrial software product lines. Finally, in addition to empirical evidence about quality attribute variability in industrial contexts, there are also other kinds of empirical studies: for example, a student experiment reported, slightly surprisingly, that students were able to identify varying quality attributes better than varying functionality (Galster and Avgeriou 2011). To summarize, there is scarcely empirical research, and case studies in particular, that describe quality attribute variability in its real-life context and allow assessment of data collection and analysis procedures.

3 Research Method 3.1 Research Design This research was carried out following the case study methodology (Yin 1994; Patton 1990; Runeson and H¨ost 2009). The case study was augmented with a systematic literature review (Wohlin and Prikladniki 2013; Wohlin 2014) (see Fig. 1). Additionally, the analysis and theory building utilized some guidelines from the grounded theory methodology (Urquhart et al 2010). Qualitative methods in general permit researchers to study selected issues in depth and detail (Patton 1990). A case study is a suitable approach for situations in which the phenomenon of interest is complex, non-manipulable and the understanding of the topic is still lacking (Yin 1994): quality attribute variability in industrial software product lines fits all such characteristics. Further, since the research questions are about ”why” and ”how”, a case study seemed appropriate (Yin 1994). According to Yin (1994), the main component in the case study research is a theory, both as the starting point and as the end result. In empirical software engineering, two distinct research designs can be identified: theory building (observational path) and theory testing (hypothetical path) (Stol and Fitzgerald 2013). For example, the theoretical model of the open source software communities was built based on a case study on Apache; later on, in a follow-up study, the model was tested with Mozilla to close the theorizing cycle (Stol and Fitzgerald 2013). This research was designed to follow the observational path (Stol and Fitzgerald 2013), that is, to build theories from the empirical data and observations, and theory testing was

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

Legend

11

Proposed theoretical models of performance variability in software product lines

Logical artifact in the theory building

was utilized to build

Case account of capacity variability in a base station product line was utilized to build

Data Internal documents Informal discussions, notes Second author’s first-hand experience, notes Publicly available information about the domain Comments and answers from the chief architects

was utilized to augment and refine

Examples and accounts of performance variability in software product lines from the primary studies

was utilized to build

Validation Two reviews of the case account by the chief architects

was utilized to identify

139 primary studies of quality attribute variability in software product lines from a systematic literature review

Fig. 1 The theorizing levels in this study: from the data to the accounts, and from the accounts to the more general proposed theoretical models. Also the scope and focus of each level are indicated.

left as future work. To follow the observational path, the empirical data and observations were achieved through a case study with the unit of analysis (Yin 1994) covering capacity variability in a mobile network base station. Thereafter, the theory building expanded the unit of analysis to cover performance variability in software product lines in general (Fig. 1). The results were formulated into three descriptive and explanatory theoretical models, which were aligned to consist of theory constructs, relations and scope (Gregor 2006). Consequently, the theorizing consisted of three levels: the data level, the account level, and the theory level (Fig. 1); these levels also expanded the case-specific scope into more a generalized one.

3.2 Case Selection The case company was Nokia, formerly Nokia Solutions and Networks, and the selected case was a base station in the 3G radio access network. The case selection utilized snowball and convenience sampling methods (Patton 1990). Snowball sampling selects cases on the basis of asking well-informed people of suitable information-rich cases: the first author asked the second author about his knowledge on cases that exhibit quality attribute variability in a product line. By contrast, convenience sampling selects easily accessible cases for the study. Initially, three product lines from the case company portfolio were identified as potential cases, as they exhibited quality attribute variability. However, only one case was included in the final version of this study, mainly because the results could be validated and published without confidentiality issues. Additionally, the remaining case had rich information that was available and accessible; and it was possible to collect additional data from the people who had been developing the case product line. The case study was performed post mortem: the studied product line was designed and even partially developed, but discontinued before the production stage; the reason is discussed in Section 4. This of course creates threats to the study validity, which are discussed in Section 6.1. However, the post mortem nature of this study made it possible to access confidential project documentation making this single case an information-rich special case. Moreover, although this particular base station was discontinued, the case company and the

12

Varvana Myll¨arniemi et al.

Table 1 The stages of the overall research process. The data collection, analysis and validation activities were iterated to build the theoretical models. 1

Activity Data collection

2

Analysis

3 4 5

Validation Data collection Analysis

6 7 8 9

Validation Analysis Reporting Data collection

10

Analysis

Detailed description Collecting internal documents and publicly available information; conducting informal discussions; recording first-hand experience. Light-weight coding; identification of the main concepts; formulation of the detailed research questions and initial findings. A review of the findings by the chief architects. Comments and answers to a list of open questions from the chief architects. Identification of new concepts and findings; revision of the research questions. A review of the findings by the chief architects. Constructing the first case account. Myll¨arniemi et al (2013). 139 primary studies of quality attribute variability in software product lines, selected through a systematic literature review (see Table 2). Identification of accounts and passages on performance variability from the primary studies; identification of example instantiations for existing categories; identification of new concepts and categories; theory building.

people who participated in the development continued to work with similar base stations. At the same time, this single case was also a typical case from the unit of analysis perspective, because similar findings seem to apply to other base stations in the case company portfolio.

3.3 Data Collection The data collection took place iteratively, as illustrated in Table 1, and it consisted of collecting the case data and finding and selecting the primary studies (the lowest level in Fig. 1). 3.3.1 Collecting case data The main data source for the case account consisted of various documents, including a product line software architecture document, a detailed subsystem architecture document, a product line architecture evaluation document, and a product specification document. In total, roughly 300 pages of technical documentation were included in the analysis. All these documents were originally deemed for internal use within the company. Further understanding about the application domain was acquired from various sources, especially from an edited book (Holma and Toskala 2000) by the employees of the case study company. In addition, open or unclear issues within the documents were discussed in informal meetings between the authors. Secondly, the second author had participated in the architectural evaluation of the case product line, which had resulted in notes and observational first-hand experience. Hence, the second author acted as one data source. Consequently, data was also collected via informal discussions that took place between the first and the second author as well as with one employee at the case company who was familiar with the case. Written notes about these discussions were stored. These informal discussions gave background information as well as clarified unclear issues in the documents. The discussions also explicated implicit rationales and other contextual facts not covered in the technical documents. Further, for the validation of the findings that also provided an opportunity to collect additional data, the results were reviewed twice by a group of the chief architects of the case

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

13

product line. These chief architects were involved with the case project from the beginning to the end and thus had first-hand experience. In the review process, the first author provided a written list of questions to clarify open issues. Answers and comments were collected and refined via e-mails and phone discussions. Finally, triangulation was used in several forms. In particular, we compared the experiences of the second author, the comments from the chief architects and the original documents from the time the product line was designed to each other. Thus, there was triangulation between the responders and triangulation between the responders and the documents: this aimed at preventing the subjective interpretations of individuals to bias the results. Additionally, the investigator triangulation was applied: the first author’s analysis was subjected to the other authors. When collecting data, a case study database (Yin 1994) was established. This included all documents, notes from the informal meetings, e-mails, and other observations. All data was produced into textual form.

3.3.2 Searching and selecting primary studies Existing literature constituted another source of data (lowest level in Fig. 1). For this purpose, we followed the systematic review guidelines (Wohlin and Prikladniki 2013; Wohlin 2014) that utilize snowballing as the primary search strategy. The data collection in our review protocol had wider scope than within our case study: primary studies were searched and selected to address quality attribute variability in software product lines in general (Fig. 1). During the analysis phase, the primary studies were analyzed only from the performance variability point of view. In the following, we briefly outline the search and selection protocol. As discussed by Myll¨arniemi et al (2012), the protocol did not utilize any search strings or database searches. Thus, the search protocol was not dependent on any specific terms utilized to characterize quality attribute variability; such terms tend to be highly heterogeneous. Further, the search protocol did not exclude any primary studies based on metadata only, but all potential publications were retrieved and their full content was read before the decision to exclude was made. Thus, studies that did not mention quality attribute variability in the title or abstract but nevertheless contributed to it were not excluded; this increased the completeness. The scope of the search strategy was set to cover purposeful quality attribute variability in software product lines. For this purpose, the following inclusion and exclusion criteria were utilized. Inclusion criteria. The primary study says explicitly (OR uses an example / case) that there is purposeful variability of quality attributes in a software product line OR that different products in a software product line have purposefully different quality attributes. Here, purposeful quality attribute variability refers to intentional, managed ability to choose or derive products with different quality attributes. Exclusion criteria. The primary study does not explicitly mention that the quality attribute variability takes place in a software product line or a software product family. For example, component-based software, service-oriented software and self-adaptive architectures without any link to software product line paradigm are excluded. Exclusion criteria. Quality attribute variability is not part of the study contribution, for example, it is mentioned only in the related work, discussion, or future work. Exclusion criteria. The study is not a peer-reviewed publication: for example, books, book chapters, websites and tech reports are excluded. The contribution is not assessable from the

14

Varvana Myll¨arniemi et al.

Table 2 The backward and forward snowballing iterations taken to select the 139 primary studies. Search action Start set Candidate for selection Selected as new Manual reading 221 26 Manual reading: 26 primary studies selected Search action Start set Candidate for selection Selected as new Backward snowballing 26 92 28 Backward snowballing 28 74 7 Backward snowballing 7 17 1 Backward snowballing 1 Backward iterations: 36 primary studies selected as new Search action Start set Candidate for selection Selected as new Forward snowballing 62 (= 26+36) 342 54 Forward snowballing 54 69 9 Forward snowballing 9 1 Forward iterations: 63 primary studies selected as new Search action Start set Candidate for selection Selected as new Backward snowballing 63 155 13 Backward snowballing 13 30 1 Backward snowballing 1 Backward iterations: 14 primary studies selected as new Search action Start set Candidate for selection Selected as new Forward snowballing 14 52 1 Forward snowballing 1 Forward iterations: 1 primary study selected as new Iteration Start set Candidate for selection Selected as new Backward snowballing 1 In total: 140 primary studies selected; 1 primary study excluded in analysis

study: for example, studies not written in English are excluded, as well as tutorial and panel summaries. Table 2 illustrates the search and selection iterations in the research protocol; the order of iterations followed the guidelines by Wohlin (2014). Firstly, the initial start set (Wohlin 2014) was identified and selected by reading through all full publications in the Software Product Line Conferences up until 2010; further details about this process is reported by Myll¨arniemi et al (2012). After applying the revised set of inclusion and exclusion criteria, 26 primary studies were selected for snowballing. For backward snowballing, the primary studies in the start set were processed as follows. The reference list of each primary study was pruned based on the recommendations by Wohlin (2014): by firstly looking at the publication type, and thereafter by looking at the context of the actual reference in the primary study. If an item in the reference list passed both criteria, it was deemed as a candidate for selection. After all reference lists in the start set had been examined, the candidates for selection were recorded, duplicates removed, and new, previously unprocessed studies retrieved. The inclusion and exclusion criteria were then applied, based on the full content, for all retrieved studies. For forward snowballing, the primary studies in the start set were processed as follows. Two citation databases were used: ISI Web of Science and Scopus. The forward citations covered studies published up until February 2013. For each primary study in the start set, the studies that cited it in either database were recorded, duplicates removed, and new, previously unprocessed studies retrieved. The inclusion and exclusion criteria were then applied, based on the full content, for all retrieved studies. As the result of the iterations in Table 2, 140 primary studies were selected; however, during the analysis phase, one primary study was still excluded based on its contribution.

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

15

3.4 Data Analysis The data analysis was iterated with the data collection and validation (Table 1). We adopted some of the analysis principles from the grounded theory (Urquhart et al 2010; Strauss and Corbin 1998). The analysis included understanding and uncovering the phenomena of the case; constructing a descriptive and explanatory account; making conceptual generalizations; and constructing descriptions and explanations about the phenomena in more general terms (Lee and Baskerville 2003; Gregor 2006). In the first analysis stage, the first author analyzed all the data from the case, which was in textual form, using light-weight coding of text passages. Through coding, initial concepts were identified and compared between different sources for data, thus following the constant comparison guideline (Urquhart et al 2010). For example, concepts that were identified in the informal discussions were also analyzed from the documents and the publicly available information. The low-level concepts were then generalized to understand the phenomenon of the case. Further analysis took place in the informal discussions where the case and generalizations were discussed. Notes of the analysis were kept and recorded in the case study database. However, to minimize researcher bias due to close involvement with the case, the first author acted as a primary investigator in the analysis. In the second analysis stage, new case data that emerged from the validation session was added to analysis, thus serving as an additional slice of data (Urquhart et al 2010). Since this data collection was partly analytically driven, that is, driven by the open questions that were raised from the first analysis, this can be considered as an instance of theoretical sampling (Urquhart et al 2010). The resulting analysis identified new low-level concepts, which were in turn compared and analyzed against other data, and generalized. During the analysis, issues with the emerging concepts were resolved with e-mail exchange with the case chief architect, again recording all additional data to the case study database. In the third analysis stage, the final validation comments were taken into account, and the case study account and findings were reported. These results have been reported in our earlier work (Myll¨arniemi et al 2013). In the last analysis stage, the existing literature served as additional slices of data. For this purpose, the 139 primary studies of quality attribute variability were analyzed, and accounts and examples that cover performance variability were coded in a light-weight manner in the publications. When building the theory, the primary studies were again visited to find instances of the already identified categories, thus serving as a way to saturate categories (Urquhart et al 2010). The primary studies also served as a way to identify a few new categories as well as to state the theory boundaries (Gregor 2006). When analyzing the primary studies, there was an explicit aim to ensure neutrality and objectiveness: in particular, we analyzed accounts from the literature that were both similar and contrasting to the findings of our case study. In the resulting models, most of the explanations and characterizations were originally identified from the case account; however, additional examples were drawn from the literature to aim at theoretical saturation. Also, literature served as a way to identify the scope of the explanations and characterizations, that is, to identify the boundaries of the theory (Gregor 2006). Further, a few categories were identified solely from the primary studies through comparison to the case account. As a concrete example of the analysis interplay between the case account and primary studies, the case account was first used to identify the downgrading realization (Section 5.2). Thereafter, this characterization was compared with the approaches often presented in literature, which led to the identification and formulation of the impact management realization. Finally, the trade-off realization was identified from

16

Varvana Myll¨arniemi et al.

CN (Core Network)

Iu-CS

Iu-CS

Iu-PS

Iu-PS

All-IP 3G RAN (Radio Access Network) RNC (Radio Network Controller) IuB

IuR

RNC

IuB

IuB

Scope of the case study product line IP-BTS base station

Uu

Uu

UE UE UE (UE= User Equipment, e.g. mobile phone)

IP-BTS base station

Uu

IP-BTS base station Uu

Uu

UE

UE

Uu

UE

Fig. 2 The products from the case product line, IP-BTS, were base stations (Node Bs) in All-IP 3G RAN.

the literature and the taxonomy (Fig. 7) was identified based on the concept of a design tactic (Bass et al 2003). As a result, we constructed three theoretical models for describing and explaining (Gregor 2006) the phenomenon. To ensure analytical generalization (Yin 1994), the constructs and relations in the models were described in domain-independent terms, aiming at raising the degree of conceptualization and scope (Urquhart et al 2010). To mark the theory boundaries (Gregor 2006) as settings in which models can be applied, we identified the scope either analytically or relying on existing literature. Thus, building theoretical models with domain-independent constructs and interpretations and explicit scope aimed at generalizing the results beyond the domain of the case account.

4 Results as the Case Account In the following, we give an overview of the case product line and describe the results to each research question; these are summarized in Table 3. The case account focuses on capacity variability in a base station product line, which means the account is kept specific to the case study domain. Thereafter, Section 5 describes the results in more general terms by proposing theoretical models on performance variability in software product lines.

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

17

4.1 Overview of the Case Product Line The case study company is Nokia, formerly Nokia Solutions and Networks. Nokia is one of the largest vendors in the domain of mobile telecommunication network products. This case study covers the product line named IP-BTS, which is a configurable base station in 3G (3rd generation) radio access networks. Products from the IP-BTS product line were base stations (Node Bs) that utilized 3G radio access technologies, such as WCDMA (Wideband Code Division Multiple Access). The domain and scope of the IP-BTS product line is illustrated in Fig. 2. The IP-BTS base station took care of connectivity between mobile phones and the rest of the core telephony network infrastructure. For this purpose, the IP-BTS had responsibilities in three different planes: the user plane carried speech and packet data; the control plane controlled the data, connections, cells and channels; and the management plane took care of network management and base station administration. A base station typically contains a cabinet, an antenna mast and the actual antenna. Additionally, a radio network controller (RNC) was responsible for controlling several base stations such as handling soft handover to another base station when a mobile phone moves out of the range of one base station. Together the IP-BTS base stations and RNCs formed a Radio Access Network (RAN), which was responsible for handling traffic and signaling between a mobile phone and the Core Network. The IP-BTS base station was designed to work in All-IP RAN, that is, in a radio access network based on IP (Internet Protocol). From the design point of view, the IP-BTS base station was a complex telecommunication network element operating in a resource-constrained, embedded environment. To characterize the size of the software, the design was divided into less than 20 system-level components, out of which some were described in more detail in separate architecture documents; overall the software would have consisted of millions of lines of code. The customers of IP-BTS were mobile phone operators who invested in the 3G infrastructure. The motivation that initiated the design of IP-BTS was the anticipated introduction of All-IP radio access networks (Bu et al 2006). IP-BTS was designed to support several radio access standards as well as both IP-based and traditional, point-to-point RAN: the idea was to be more flexible and thus replace the first-generation Node Bs that supported only one specific RAN and radio access technology. However, IP-BTS project was discontinued before reaching the production stage: after the operators had made the large investments to the first 3G network elements, the idea of yet another round of investments did not take off, despite the argued benefits of IP-based RAN (Bu et al 2006). When IP-BTS was discontinued, it was at the prototype stage; this took place approximately ten years ago. In total, the lifespan of the product line was approximately two years from the initial planning to the termination of the project. Despite the discontinuation, IP-BTS was designed to a detailed level, along with implemented prototypes. Since the architecture was designed and evaluated, technical documentation existed. Additionally, similar kind of flexibility in one base station has been later utilized successfully in the case company. In the following, we describe the main variability of the IP-BTS product line, focusing only on the aspects relevant for this case study (Fig. 3). One major source of variability in the IP-BTS base station was caused by the need to support several radio access standards and both traditional and IP-based RAN; this is represented as feature Radio access technology in Fig. 3. This choice also affected other functionality in the base station. For example, the user plane functionality, that is, carrying the data, was specific to the radio access protocol. Similarly, managing the base station cells, channels and resources at the control plane was also partly dependent on the radio access

18

Varvana Myll¨arniemi et al. IP-BTS base station

HW configuration @installation-time

Radio access technology @build-time

Radio access standard

WCDMA EDGE

User plane processor units

...

Licenses @runtime ...

...

number = {n, … , m}

Processor type

HSDPA Multimode

Number of channels elements UL/DL

Memory

RAN

Point-topoint RAN

IP-RAN

Basic UL/DL

Licensed1 UL/DL

...

Licensedmax UL/DL

Operators use UL/DL (uplink/ downlink) channel elements for calculating the capacity. 1 voice channel (12.2kbs) = 1 channel element. {value of Licensedmax = maximum of HW resources} @start-up

Resource management Legend Channel management and monitoring

HW management

Channel element downgrading @runtime

CEbasic

CE1

...

...

{WCDMA => WCDMA channel management} {EDGE => EDGE channel management} ...

WCDMA channel EDGE channel HSDPA channel management management management @build-time @build-time @build-time

CEmax

{Basic UL/DL => CEbasic} {Licensed1 UL/DL => CE1} . . . {Licensedmax UL/DL => CEmax}

Feature @bindingtime attribute = {values} {Constraint} Comment Optional Mandatory

Alternative

Or

Fig. 3 An overview of the IP-BTS base station variability. The diagram is constructed merely to illustrate the case study results: only the features related to the results are shown; and some exact values have been obfuscated.

technology. The radio access variability was bound at build-time and it was realized mostly through composition of software components. To lessen the cross-cutting effect of radio access variability, the software design separated the common, reusable parts of the base station software from the radio access specific parts. In addition, the IP-BTS product line had hardware variability that was bound when the hardware was installed and the base station was started (see feature HW configuration in Fig. 3). The IP-BTS base station was designed to support a varying set of hardware configurations and even new kinds of hardware components. In particular, there were dedicated hardware processor units for user plane processing to handle high-speed data streams; this was because the user plane processing was tightly constrained by requirements on capacity, throughput, latency and jitter. The drawback of hardware variability was in the rebinding effort: hardware changes required physical on-site maintenance and a break of service. Typically, the on-site hardware upgrades of the base stations involve the laborious installing of new hardware units, start-up, testing, integration to network, and taking into use. Further, the number of base stations is typically high (hundreds or even thousands), they may be geographically very scattered, and their accessibility is sometimes poor. Because of this, the hardware rebinding happens quite seldom: on average, the lifetime of the hardware in the case company base stations is about eight years although certain hot-spot areas may require more frequent hardware upgrades.

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

19

Table 3 Summary of the case account. RQ1

RQ2

RQ3

RQ4

Which characteristic of performance is decided to be varying? Capacity, that is, the maximum number of phone calls one base station can serve at a time. Characterized for a certain base station configuration as the available uplink and downlink channel elements. Capacity was a key driver for the operators in making the investments in the 3G networks. Why is performance decided to be varied? Initial differences in the capacity needs due to different usage estimations. The importance of capacity to the operators, which enabled price differentiation. Operators were able to understand the characteristics related to capacity and network planning, and these characteristics were guaranteed by the vendor. The evolution of the usage and hence capacity needs for long-lived products. Starting with smaller investments and upgrading later supported price differentiation and brought flexibility to the operators. What is the strategy for realizing performance variability within the product line architecture? Both software and hardware were used to realize capacity variability. Software realization: Downgrading by restricting channel elements visible for the software components. Mostly utilized at runtime variability binding. Hardware realization: Different installed hardware in products, software scalability achieved with resource abstraction and layers. Mostly utilized when the base station was taken into use. Why is the realization strategy chosen? Motivation for the downgrading software realization: Quick and efficient runtime rebinding, compared with the cost and difficulty of on-site maintenance for hardware upgrades; this better supported the operators in starting with smaller investments and upgrading capacity as needs evolved. The downgrading mechanism was architecturally focused, and the mechanism was mostly independent of other software variability in the base station; this made testing easier. The runtime capacity variability was not introduced to resolve design trade-offs but to enable upgrades and price differentiation. Motivation for the hardware realization: Trade-off between capacity and production costs: expensive Bill of Material (BOM) for efficient hardware. Known practices for implementing the software scaling.

Finally, it was possible to vary the base station functionality at runtime through licenses (represented as feature Licenses in Fig. 3). This eased rebinding task compared with hardware variability: the operators could remotely purchase and enable new functionality or capabilities in the base stations. When an operator wished to upgrade a base station through licenses, she entered her new license key to the network management system, which in turn connected to the base station; the new capabilities were then immediately available. This is called a license key driven configuration (Linden et al 2003) and it is a known practice in the telecommunication domain (Jaring and Bosch 2002). Licenses and other runtime variability were mostly used to vary the management plane as well certain aspects of the control plane, such as channel management. To realize the licenses and other runtime variability, the design utilized parameterization and default parameter values for startup.

4.2 Varying Performance Characteristic (RQ1) The varying performance characteristic in the case study was the base station capacity (Table 3). In the IP-BTS base station, capacity referred to the maximum number of voice calls a base station can serve. Capacity could also have been defined as the amount of packet data one base station can route in a given time, but at the time of IP-BTS design, 3G packet data

20

Varvana Myll¨arniemi et al.

transfer was not that common. In general, capacity variability was not specific to the case base station, but has been a common phenomenon in the case study domain. In general, capacity and coverage were and still are two key drivers when the operators are making the investments in the mobile telecommunication networks. Therefore, the operators planned capacity and coverage in several stages and at different levels of detail. For capacity, the network planning involved estimating the traffic density and subscriber growth forecasts; and the planning result gave the needed number of base stations along with required station configurations and dimensioning parameters, such as base station interference and power. Due to the complex calculations involved, the operators did the network and capacity planning with dedicated tools. After the hardware configurations and dimensioning parameters were known, the capacity of a base station was characterized as the number of channel elements in both uplink and downlink directions (see feature Number of channel elements UL/DL in Fig. 3). A channel element was an abstraction of the resources that were needed to provide capacity for one voice channel. One voice channel could carry dozens of voice calls, and thus a channel element directly related to the maximum number of phone calls that could be served. The amount of channel elements that a base station configuration supported was calculated at the base station start-up. Thereafter, the operators could configure the base station capacity through purchasing a new license to increase the number of channel elements.

4.3 Motivation to Vary Performance (RQ2) The following explanations for the decision of varying capacity in the IP-BTS were identified (Table 3). First and foremost, the capacity needs for the base stations varied: different base stations had to cover different usage, that is, serve different number of phone calls. The usage varied both between the operators and between the operators’ individual base stations. Second, the capacity variants could be differentiated in pricing: a base station with higher capacity could be more expensive. The capacity was a key driver for the operators when making the investments in the networks. The operators could estimate the business value and the return of investment (ROI) of various capacity levels. Because a higher capacity could be justified financially, the customer was willing to pay more for better capacity and price differentiation was possible. The operators were well versed with the matters related to the base station capacity; this made it easier for the case company to distinguish the capacity variants, for example, through the number of channel elements available. Additionally, the capacity-related characteristics of the base stations were guaranteed. In general, a base station must deliver the quality that the operator pays for; the operator typically tests the products herself, possibly subjecting several competing products to a test bench before making the investment decision. Since the operators understood what the base station capacity meant and trusted that the promised capacity was delivered, it was easier to conduct price differentiation. Finally, the capacity needs evolved over time. The products in the telecommunication domain are very long-lived and have to be able to cope with evolving capacity needs after the installation. Although the usage of 3G networks was modest in the beginning, the traffic exploded with the deployment of new 3G-enabled devices. If the usage, that is, the number of phone calls made, exceeds the available capacity of a base station, users will experience unacceptable congestion. Therefore, the operators wanted to adjust the base stations to follow the evolving needs: as the usage of networks grows, the operators could upgrade the base

Licensedmax UL/DL

Licensedn UL/DL

21

Licensed’1 UL/DL

Hardware realization for capacity variability: Different hardware configurations, scalability of software by abstraction of resources. Utilized particularly when taking the base station into use.

Basic’ UL/DL

Software realization for capacity variability: Downgrading available channel elements, abstraction of resources. Utilized particularly at runtime.

Basic UL/DL

Capacity as the number of voice channels

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

Legend A base station variant; the height of the bar indicates the number of voice channels available in the variant

... Variants

Fig. 4 Both software and hardware were used to realize capacity variability in IP-BTS base stations. Table 4 Software architecture elements that were responsible for software realization of capacity variability. Element BTS O&M

Option Manager, License Key Manager Resource Manager

Responsibilities System component in the management plane that is responsible of capacity variability through runtime licenses as well as through managing the hardware configuration. Two logical components in the management plane that support setting runtime variability with licenses and corresponding options and provide a database and corresponding operations for application level parameters. A component in the control plane that is responsible for monitoring and restricting dedicated capacity resources, e.g., channel elements, in the user plane; a channel element corresponds to the capacity of one voice channel.

station capacity. This mode of upgrading more capacity also supported price differentiation: the ability to start with smaller initial investments and to easily purchase more capacity later better served the needs of the operators. This enabled the case company to both differentiate between the products as well as from other competitors.

4.4 Performance Variability Realization (RQ3) The IP-BTS base station utilized both software and hardware means to vary capacity (Table 3); this is illustrated in Fig. 4. Traditionally, different levels of capacity in the telecommunication domain have been achieved by having different hardware in the product variants. With the hardware realization, a base station with more efficient hardware configuration yielded better capacity. To implement capacity variability through hardware, the product line architecture was designed to support varying hardware configurations including different numbers of processing units and memory, and different processor types. Since there was dedicated hardware for the user plane processing, that is, for handling voice call traffic, the capacity could be upgraded by adding more or better hardware units to the user plane. This is illustrated in Fig. 3 as the varying feature User plane processor units. When the base station was started, the maximum number of channel elements available was determined from the installed hardware resources; this maximum amount is represented as feature Licensedmax UL/DL in Fig. 3.

22

Varvana Myll¨arniemi et al.

To enable the software to accommodate to the varying hardware, that is, to make the software scalable, the software architecture of the IP-BTS utilized a layered architecture to limit the hardware visibility, and abstraction of the hardware and available resources with property files. Only system-level software component BTS O&M (Table 4) was aware of the actual hardware configuration; for the rest of the system-level software components, the hardware capability was accessed through virtual devices and drivers. This is illustrated in Fig. 3 as features Resource management and HW management. The abstraction and management of actual hardware resources is an established practice in the telecommunication domain. However, the drawback of the hardware realization was that it did not support runtime rebinding. Therefore, software realization was used to vary the base station capacity at runtime (Fig. 4). For this purpose, the operators could buy licenses to enable different numbers of uplink and downlink channels (feature Number of channel elements UL/DL in Fig. 3). To implement capacity variability through software, the software architecture was designed to downgrade the capacity, that is, to downgrade the maximum number of uplink and downlink channel elements achieved with the current hardware configuration. At the time of designing the IP-BTS architecture, the exact downgrading mechanism was not decided, but later the decision was made to restrict the number of channel elements; this design has been used in other base stations in the case company. The variability imposed by this realization strategy is illustrated in Fig. 3 as feature Channel element downgrading. This feature was implemented by a dedicated component Resource Manager in the IP-BTS software architecture (Table 4). Component Resource Manager monitored and limited the number used of resources by other software components, including the channel elements. Since a set of channel elements corresponded to a certain set of hardware resources, more channel elements could be added by enabling the corresponding, dedicated hardware resources in the user plane. Moreover, this could be done at runtime, compared with build-time variability. There were two important architectural aspects in realizing capacity variability with software. First, the software realization of capacity variability was not crosscutting: the aim was to implement the variability solution behind a handful of components (see Table 4). That the variability was not crosscutting of course required that the actual hardware resources were hidden from most software components; however, this was realized through software scaling. Second, the realization mechanism, that is, limiting the amount of available channel elements, was mostly independent of other software variability in the base station. Also the impact from other software variability to the capacity was kept to the minimum. Since the dedicated channel resources were reserved for handling the user plane traffic, the variability of management and control plane functionality did not directly impact the user plane capacity. There is, however, one exception to this: as channel elements are part of the radio access standards, reducing the channel elements needed to be implemented differently for different radio access technologies. Thus, the runtime downgrading was dependent on the radio access technology used, that is, on a specific build-time variability choice. This is illustrated in Fig. 3 as feature Channel management and monitoring having separate variants for different radio access technologies.

Initial investment for operator

Capacity (price)

Initial investment for operator

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

Capacity > Usage

23

Legend Evolution of a base station that realizes capacity variability through software Evolution of a base station that realizes capacity variability only through hardware

Usage > Capacity (Congestion occurs)

A usage level where operator has purchased and upgraded to a new capacity variant

Usage

Fig. 5 Compared to the hardware realization (gray line), the capacity upgrades with the software realization (black line) could be made more often and the operators could start with less expensive base stations and upgrade as needed. This ensured both customer satisfaction and better price differentiation.

4.5 Motivation for the Realization (RQ4) The following explanations were identified for the selected realization strategies in the IPBTS base station (Table 3). The main motivation for the software realization was to make capacity upgrades easy for the operators; this supported both price differentiation between products and differentiation to the competitors. Due to the remote location of base stations, the cost of on-site hardware upgrade was high, it took more time, and the upgrades required that compatible hardware components should be available even after several years of installation. Because of this, the software realization brought flexibility depicted in Fig. 5: capacity rebinding could be made more often and operators could start with smaller initial investments and pay more as the needs evolved. Consequently, after the deployment of a base station, capacity rebinding was designed to be done mainly via software (see Fig. 4), and hardware upgrades were done only when the maximum license capacity was not enough. Moreover, the software realization was not introduced to resolve design trade-offs between capacity and other quality attributes, but the realization was about enabling upgrades and differentiation of capacity. Therefore, the software realization could simply utilize the downgrading strategy. Although several trade-offs existed in the base station design, they were resolved during the design in a way that fulfilled the maximum capacity requirements (represented as feature Licensedmax UL/DL in Fig. 3) in a certain hardware configuration; this design was then explicitly downgraded to reach lower capacity levels. Moreover, although the number of channel elements was reduced in the base station, this did not change other capabilities, such as the design of the channels or way the data was managed inside the base station. Similarly, the software realization was not about adjusting to differences in the production costs of capacity variants. With the downgrading realization, the production costs for the different license-based capacity levels were the same; however, the price of the capacity licenses varied. For the hardware realization, the main motivation was to minimize the cost of Bill of Materials (BOM) for the lower-priced and lower capacity base station, that is, to resolve the trade-off between hardware costs and capacity. This was because the hardware played a major role in the cost of the base stations. The hardware realization meant that the lower

24

Varvana Myll¨arniemi et al.

Table 5 Summary of the proposed theoretical models. The models describe performance variability in software product lines, and are based on the case account and accounts from the literature. Proposed theoretical model Explaining the decision of varying performance purposefully

Addresses RQ2

Described in Fig. 6, Table 6, Table 7 Fig. 7, Table 8, Table 9, Fig. 8, Table 10

Characterizing the strategies for realizing performance RQ3 variability Explaining the strategies for realizing performance variability RQ4 All theoretical models contain Explanations or characterizations that are defined through domain-independent concepts to enable generalization (Urquhart et al 2010). Scope as the identified limits of generalization (Gregor 2006). Example instantiations either from the case account or from the literature. Graphical model of the explanations and characterizations. Tables to describe the definitions, scope and example instantiations.

capacity variants had less efficient hardware configuration and a smaller cost of BOM. By contrast, the software realization meant all capacity variants had the same hardware cost, and due to price differentiation, the price-to-cost ratio was worse in the lower capacity base stations. Another motivation for the hardware realization was its ease and efficiency: the capacity as the amount of channel elements was directly affected by the user plane hardware (Fig. 3), and the domain had established practices for building software that scaled to different hardware configurations. This efficiency of the hardware is highlighted by the fact that even the software realization relied partly on hardware to alter capacity: adding more channel elements meant enabling related hardware resources in software. Additionally, the testing effort also affected the decisions on the realization. The vendor must thoroughly test the base station to ensure the capacity can be guaranteed to the operators. Since the software realization downgraded capacity without affecting other quality attributes, and the mechanism of downgrading channel elements was mostly independent of other variability in the base station, the testing effort was reduced by utilizing samplebased testing (c.f., sample-based analysis by Thum et al (2014)). That is, instead of testing all capacity variants against all software variability in the base station, it was sufficient to test only the maximum, minimum, and selected throttled-down variants per one hardware configuration and radio technology. This was because the realization of the channel element management was dependent on the hardware configuration and the radio access technology (Fig. 3), but independent of other software variability in the base station. In the telecommunication domain in general, the hardware realization has been the traditional way of varying capacity, partly because of the ability to have lower BOM and production costs to lower capacity variants, partly because it has been straightforward to design and test. In the case study, the specific reason to utilize the software realization was to enable flexible upgrades to match evolving needs and better price differentiation for the operators. Further, the testing and implementation complexity of the software realization was alleviated by simply downgrading the resources needed to deliver the required channels.

5 Results as the Proposed Theoretical Models The case account in Section 4 serves to characterize the real-world phenomenon in its context, that is, capacity variability in a base station product line. However, from the mere case account, it difficult to see how the results can be generalized beyond this domain or to other

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

Explanations related to product and design trade-offs

Explanations related to the customers Differences in the customer performance needs, caused for example by differences in the amount of events or data that need to be handled Differences in how customers are willing to pay for better performance

Trade-off between performance and production costs

Purposeful performance variability in a software product line

Evolution of the customer performance needs over time, long-lived products

An explanation to the decision

Trade-off between performance and other quality attributes

Differences in the resources available in the product operating environment that constrain performance

Ability to distinguish the performance differences and guarantee the performance to the customer

Legend

25

Explanations related to the operating environment constraints

A class of explanations

A decision

Identified as one explanation; no predictive causality

Fig. 6 Explaining the decision of purposefully varying performance in a software product line. The identified explanations can motivate the decision but do not necessarily imply causality. Details, scope and example instantiations are given in Table 6 and Table 7.

performance attributes. To enable analytical generalization, the case study results can be built as theories (Yin 1994; Urquhart et al 2010). Besides enabling generalization, theories allow knowledge to be accumulated in a systematic manner; this accumulated knowledge enlightens both research and practice (Gregor 2006). To describe the results in more general terms, we propose three theoretical models to characterize and explain performance variability in software product lines (see Table 5). The models have been constructed by augmenting the analysis of the case study account with examples and accounts in the literature (see Fig. 1). Each theoretical model consists of a number of characterizations or explanations, that is, of theory constructs and relations between the constructs that fit the theory type (Gregor 2006). All proposed models address purposeful performance variability in software product lines in general, not capacity variability in base stations in particular. The following tactics have been used to enable generalization (Table 5). First, the explanations and characterizations are defined using domain-independent concepts (Urquhart et al 2010). Second, the scope is stated as boundaries showing the limits of generalizations (Gregor 2006), e.g., by identifying settings in which the explanations and characterizations may not be applicable. Third, where appropriate, further validation has been drawn by having several example instantiations either from the case account or from the literature. The example instantiations also indicate the origin: if the case is not mentioned as an example, the explanation or characterization originated from the literature. Most of the explanations and characterizations in the models originated from the case account.

5.1 Motivation to Vary Performance (RQ2) When performance is varied purposefully, there is an explicit decision behind it. Based on the case account and previous studies, the decision of purposefully varying performance may

26

Varvana Myll¨arniemi et al.

Table 6 Explaining the decision of purposefully varying performance: the customers. Differences in the customer performance needs, caused for example by differences in the amount of events or data, explain the decision of varying performance. Description If there are differences in the customer performance needs, these differences can be satisfied with different performance variants. The customer performance needs are affected, for example, by the amount of events or data that the system must handle or store. Scope Differences in the explicitly stated customer needs is not always the reason to have performance variants (Myll¨arniemi et al 2006b). Differences in the customer performance needs do not always lead to different variants (Kishi et al 2001). Example The case study; enterprise software systems (Ishida 2007); information terminals (Kishi instantiation and Noda 2000). Differences in how customers are willing to pay for better performance explain the decision of varying performance. Description Differences in how customers are willing to pay for performance enable price differentiation. Price differentiation is a powerful way for vendors to improve profitability (Phillips 2005). Price differentiation can take place even without any differences in the production costs; this is called price discrimination (Belobaba et al 2009). Scope The inability of the customer to explicitly understand or justify the higher price, for example, by relating it to her own business value and revenue, may decrease the willingness to pay more. Example The case study; an electronic patient data exchange product line: some hospitals are willing instantiation to pay 15.000 Euro more for 0.3s smaller latency (Bartholdt et al 2009). The evolution of the customer performance needs over time, together with long-lived products, explains the decision of varying performance. Description If the performance needs increase over time, and needs exceed the capabilities of the product, it motivates to support future performance upgrades. By supporting flexible performance upgrades or ”pay-as-you-go” models, the vendor ensures customer satisfaction and continuity. Scope Is more relevant if the product is designed to operate many years and the cost or effort of changing to another vendor is considerable (all indicated by the example instantiations). If the products are inexpensive and short-lived, instead of rebinding to a better performance variant, the customer can just buy a new product. Example The case study; enterprise software systems (Ishida 2007). instantiation The ability to distinguish the performance differences and guarantee the performance to the customer explains the decision of varying performance. Description To make differentiation easier, the customers must understand the differences between the performance variants and trust that the performance is what they pay for. This is not always the case: in many consumer product domains, the notion of ”quality” is often described in imprecise and vague terms, and the quality of the products is not guaranteed. Scope If differentiation between the products is not used, the differences in performance do not need to be communicated to the customer (Myll¨arniemi et al 2006b). Example The case study. instantiation

be explained by the customer needs and characteristics; by product and design trade-offs; and by varying constraints stemming from the operating environment; these are illustrated in Fig. 6 and described in more detail in Table 6 and Table 7. Firstly, the decision of varying performance may be explained by the customer needs and characteristics (Fig. 6), that is, by explanations related to the problem domain. These explanations include different customer needs; evolution of the customer needs; the customer’s ability to understand and trust the performance differences; and the willingness of some customers to pay more for better performance. From this perspective, one overarching driver is the ability to conduct price differentiation, that is, charging a higher price for better quality (Phillips 2005), or even price discrimination (Belobaba et al 2009), that is, price

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

27

Table 7 Explaining the decision of purposefully varying performance: trade-offs and operating environment constraints. A trade-off between performance and production costs explains the decision of varying performance. Description A decision on the product or design may create a trade-off between performance production costs, for example, through more expensive hardware; this trade-off can be resolved with variability by having separate variants to optimize performance and low production cost, respectively. Otherwise the resulting cost (and product price) may end up being prohibitively high for some customers (Hallsteinsen et al 2006a). This explanation leads to product differentiation, that is, having different price categories based on the production costs and product quality (Phillips 2005; Belobaba et al 2009). Scope Assumes the production cost differences are considerable and reflected in the pricing; and that some customer segments are willing to pay the higher price. Example The case study: more expensive base station hardware for better capacity. An electronic instantiation patient data exchange product line (Bartholdt et al 2009): an expensive license for a betterperforming external software component. A trade-off between performance and other quality attributes explains the decision of varying performance. Description A decision that enhance other quality attributes, such as security, reliability or modifiability, may impose a penalty on performance (Bass et al 2003; Barbacci et al 1995); this trade-off can be resolved with variability by having separate variants to optimize performance and other quality attributes, respectively. Scope Assumes the trade-off between quality attributes is considerable; and the customers should have different, conflicting needs, or different preferences over quality attributes. Example An electronic patient data exchange product line (Bartholdt et al 2009): the secure chaninstantiation nel increased the response time by 50 percent. Terminal application that varies the length of the encryption key (Myll¨arniemi et al 2006a). Having better graphics increased game attractiveness but decreased performance (Myll¨arniemi et al 2006b). Differences in the resources available in the product operating environment that constrain performance explain the decision of varying performance. Description Performance may be constrained by resources that are outside the control of the product line owner; examples include CPU, buses, memory, disk, and network connection. If these external resources vary, it may be necessary to adapt the product performance instead of providing a product that consumes the least amount of resources. Scope The resources have to constrain performance and be outside the product line scope (for example, hardware resources for software-only product lines). The single solution that consumes the least amount of resources must have otherwise unwanted consequences. Example The varying operating environment resources of mobile and embedded software products instantiation are often stated as the reason to vary: train ticket reservation service (White et al 2007); mobile games (Myll¨arniemi et al 2006b); database management systems (Siegmund et al 2012b); personal mobile assistants (Hallsteinsen et al 2006b).

differentiation without differences in the production costs. This was also evident in the case study. Another driver is the evolution of the customer performance needs: the amount of events or data that needs to be handled tend to grow in the long run. In fact, the concept of ”variation point” has been suggested to be named as ”evolution point” (Kozuka and Ishida 2011). Upgrading to better performance also supports price differentiation: the customer can start with inexpensive but less efficient product and upgrade to a premium version when the needs evolve. Secondly, the decision of varying performance may be explained by trade-offs stemming from the products or design (Fig. 6), that is, by explanations related to the solution domain. Such trade-offs may be between performance and other quality attributes or between performance and production costs; in particular, the latter are caused by the higher cost of more efficient hardware.

28

Varvana Myll¨arniemi et al.

Performance variability realization strategy

Hardware realization

Legend

Characterization in the proposed model

Is-a

Software realization

Impact management realization

Downgrading realization

Design tactic realization

Trade-off tactic realization

Fig. 7 Characterizing the strategies for realizing performance variability in a software product line. Definitions, scope and example instantiations are given in Table 8 and Table 9.

Thirdly, the decision of varying performance may be explained by varying resources in the product operating environment that constrain performance (Fig. 6). This explanation creates a non-negotiable constraint, which constitutes a good reason to adapt the product. When looking at the prevailing literature, it seems most studies do not explicitly discuss the motivation to purposefully vary performance. When the motivation is discussed in the literature, the focus is typically on the solution domain: performance is motivated either by trade-offs or by the operating environment constraints. By contrast, the explanations related to the problem domain, that is, related to the customers, played a major role in the case account: this may be because the capacity was a key selling point to the customers. The value of Fig. 6 is in highlighting the diversity of situations in which it makes sense to vary performance. For each product line and context, the relevancy of the proposed explanations can be analyzed; thus the model in Fig. 6 helps to make more informed decisions regarding the product line variability.

5.2 Performance Variability Realization (RQ3) In order to realize product variants with different performance, the product line architecture must be able to create differences in performance. Based on the case account and literature review, a variety of strategies for realizing performance variability were identified; a taxonomy of the strategies is given in Fig. 7. Most importantly, performance variability can be realized with software and hardware means (Table 8). This is because performance is affected both by the software design and implementation and by the available hardware resources. Although it sounds quite obvious that performance variability can be realized through hardware differences, the literature does not really discuss this phenomenon. When hardware is discussed in conjunction with performance variability in the literature, it is not treated as a means of creating performance differences, but as a constraint to resource consumption (Section 2.3). In fact, hardware realization is only applicable to time behavior and capacity, since these are system properties (Table 8). Further, there are several different ways to realize performance variability with software (Table 9). In particular, the software realization in the case study was clearly different from the prevailing approaches in the literature: the realization either can rely on a specific design

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

29

Table 8 Characterizing of the realization: hardware and software. See the taxonomy in Fig. 7; Table 9 elaborates the software strategies. Performance Variability Realization Strategy Description The explicit product line architecture design means, and to the corresponding implementation, of purposefully creating differences in performance between the product variants. A software product line can apply several strategies simultaneously. Hardware Realization Description Differences in performance are achieved by having different installed hardware in the product variants. Scope Only applicable to those software product lines that include both software and hardware. Applicable to time behavior and capacity: they are system properties that are directly affected by the hardware resources and exhibit only at a system level. Not applicable to memory consumption, which is a property of the software that is constrained by the hardware: one cannot purposefully vary the software memory consumption by varying hardware. Example The case study. Also instantiated for reliability as hardware redundancy in weather station instantiation systems (Kuusela and Savolainen 2000). Software Realization Description Differences in performance are achieved by varying software; all products have the same hardware installed. Scope When the product line scope consists of software only, software realization is the only choice (c.f., (Myll¨arniemi et al 2006b)). Applicable to time behavior, capacity and memory consumption (see example instantiations). Example The case study, database management systems (Siegmund et al 2012b); mobile phone instantiation games (Myll¨arniemi et al 2006b); graph product line (Sincero et al 2009; Bagheri et al 2012).

tactic or on managing impacts from other variability. In the following, we describe these software realization strategies, focusing especially on those identified primarily from the literature. As discussed in Section 2.1, performance is an architectural quality attribute (Bass et al 2003). For time behavior, several entities in the architecture participate in the execution and thus contribute to the overall response time. For resource consumption, all included code modules in the product increase the overall binary footprint and all memory allocations increase the overall heap or stack memory consumption. Consequently, it is possible to vary performance by varying any of the software parts that contribute to performance. In such a case, performance variability is an emergent ”byproduct” of other variability, or results from the impact of other variability. Functional variability may indirectly cause variation in qualities (Niemel¨a and Immonen 2007) and may even be an unwanted consequence: managing quality attributes in a product line is difficult, since each functional feature influences to some degree all system quality attributes (Bartholdt et al 2009). However, if carefully managed, it is possible to use indirect variation to realize purposeful performance variability. In the impact management realization, differences in performance are realized by managing the indirect variation, that is, impact from varying features or components (see Table 9). To make the impact management possible, one must be able to characterize or measure how each varying feature impacts the performance and know how these impacts can be aggregated into overall product performance (see also Section 2.3). For example, if the leaf features are characterized with the memory consumption of that feature, the memory consumption of the composite features is then the sum from the constituent features (Tun et al 2009). In addition, the impact management realization also needs to manage feature interactions (Siegmund et al 2012b, 2013): the features (and components) in a software product line are not independent of each other, but their combinations may have unexpected effect

30

Varvana Myll¨arniemi et al.

Table 9 Characterizing of the software realization: impact management and design tactics. See the taxonomy in Fig. 7. Impact Management Realization is-a Software Realization Description Differences in performance are achieved by indirect variation from software features or components: performance variability is an emergent byproduct from other software variability. This realization happens by characterizing or measuring the impact of each feature or component to performance; during derivation, individual impacts are aggregated to the overall product performance, taking into account feature or component interactions. A feature impact characterizes how a particular feature contributes to performance; a feature interaction occurs when the feature impacts depend on the presence of other features. Scope May be difficult for system-level performance attributes, such as time behavior or capacity. May be difficult without efficient derivation support. Example Database management systems (Siegmund et al 2012b); Web shop product line (Soltani instantiation et al 2012); intelligent traffic systems (Sinnema et al 2006). Design Tactic Realization is-a Software Realization Description Creates differences in performance by one or more purposefully introduced, varying design tactics (Bass et al 2003) that affect performance; performance variability is managed through these tactics. Can be either about downgrading or trading off tactic consequences (see below). Scope The selected tactics should affect performance considerably compared to indirect variability. A cross-cutting tactic may be difficult to develop and manage. Downgrading Realization is-a Design Tactic Realization Description Vary design tactics with the purpose of decreasing performance without affecting other quality attributes, for example, by limiting the available resources with software. Can be done by limiting both hardware and software-based resources through operating system or middleware, for example, as enabled hardware, or as the connections or processes serving incoming requests. Thus, can be both hardware neutral and hardware dependent, c.f., Jaring et al (2004). Scope Downgrading the resources cannot be used to create differences in resource consumption (see Table 8). See also the scope of the design tactic realization. Example The downgrading of the channel elements in the case study. instantiation Trade-off Tactic Realization is-a Design Tactic Realization Description Vary design tactics with the purpose of decreasing performance but increasing other quality attributes or lowering the production costs. Scope See the scope of the design tactic realization. Example Attractiveness and resource consumption in mobile phone games (Myll¨arniemi et al instantiation 2006b); memory consumption and resilience without connectivity for maintenance assistant applications (Hallsteinsen et al 2006b); patient data exchange system (Bartholdt et al 2009).

on performance compared with having them in isolation. For example, when both features Replication and Cryptography are selected, the overall memory footprint is 32KB higher than the sum of the footprint of each feature when used separately (Siegmund et al 2012b). Based on the literature, it seems that the impact management realization is relatively straightforward for resource consumption. Siegmund et al (2013) illustrate how one can measure the impact and interactions of individual features on footprint and main memory consumption. Thereafter, the aggregation is about summing up the impacts (Tun et al 2009; White et al 2007). However, the impact management realization seems to be more challenging for system-level performance properties, such as time behavior and capacity. Although Soltani et al (2012) imply that response time can be measured for and assigned per feature, Siegmund et al (2012b) argue that time behavior is not meaningful at a feature level, but can be characterized only per product variant. Further, Soltani et al (2012) argue that summing up the impacts can be also applied to response time; however, this is not the case if the leaf

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

31

features do not directly map to tasks that are executed sequentially, or if there is contention for system resources. Another challenge of the impact management realization is that the derivation is difficult without dedicated tool support. It may be possible to manage the impacts manually by trying to codify the tacit knowledge into heuristics, or by comparing with predefined reference configurations (Sinnema et al 2006). However, when the case study company needed to create a high performance variant, the manual impact management caused the product derivation to take up to several months instead of only a few hours (Sinnema et al 2006). Even when supported with a tool that evaluated the performance of a given configuration, only a few experts were capable of performing a directed optimization towards the high-performance configuration (Sinnema et al 2006). Even with tool support, the algorithms behind the tools may be computationally expensive, as discussed in Section 2.3. In addition to being impacted by other variability, performance can also be altered through explicit architecture design. Several design tactics, such as decreasing resource demand or increasing resources, or patterns as their instantiations, can be used to improve performance (see Section 2.1). Further, some tactics and patterns improve other quality attributes, such as security or reliability, at the expense of performance. In a design tactic realization (see Table 9), varying architectural tactics or patterns are purposefully introduced in the design to create performance variability. Compared with the impact management realization, in which the performance differences emerge as an impact from the overall variability, a design tactic realization relies on a purposeful design mechanism using which performance can be altered. Varying architectural styles and patterns to vary quality attributes is also addressed elsewhere (Cavalcanti et al 2011; Matinlassi 2005; Hallsteinsen et al 2003). Two different kinds of design tactic realizations can be identified (see Table 9). Firstly, in the downgrading realization, one or more varying design tactics are used in the design to decrease performance without trying to affect other capabilities; the case study downgrading involved disabling the available hardware resources programmatically. By contrast, the trade-off realization varies design tactics that increase or decrease performance at the expense of other quality attributes. As an example, 3D mobile games utilized a number of tactics related to game graphics and game levels to decrease the resource demand at the expense of game attractiveness and playability (Myll¨arniemi et al 2006b). The applicability of the design tactic realization is limited as follows. Firstly, the selected design tactic must considerably affect performance: if the tactic has only a small impact on the overall product performance, indirect variability may outweigh any performance differences achieved through the tactic. Secondly, a crosscutting tactic may cause architecture-wide variation (Hallsteinsen et al 2006a) and hence be difficult to develop and manage; it is therefore advisable to localize the tactic realization. In the case study, one component was able to implement the resource downgrading and the actual resources were abstracted from other software components.

5.3 Motivation for the Realization (RQ4) There may be different kinds of explanations behind the decision of realizing performance variability with a certain strategy described in Section 5.2. Based on the case account and accounts from the literature, the identified explanations behind different realization strategies are illustrated in Fig. 8 and Table 10. However, the model is not exhaustive, but other explanations may also be identified.

32

Varvana Myll¨arniemi et al. Trade-off between performance and hardware production costs; ability to develop scalable software

Hardware realization

Performance variability is motivated by trade-offs in the software design

Software realization

The need for the customers to rebind performance easily at runtime without changing product functionality

Design tactic realization

Performance variability is motivated by differentiation, not by trade-offs

Downgrading realization

The need to vary or optimize several product characteristics; functionality-wise similar, rich or invisible variability

Impact management realization

Legend An explanation to the decision A decision

Explains; no predictive causality

Fig. 8 Explaining the decision of using a specific realization strategy. Details, scope and example instantiations are given in Table 10.

One overarching theme is that the reason to vary performance in the first place (RQ2) also affects the decisions on the selected realization strategy (RQ4). If performance variability is motivated by a trade-off in the software, it is straightforward to vary that trade-off to realize performance differences. Similarly, if performance is motivated by the hardware production cost differences, the realization should involve having different hardware in the products. Thus, when a trade-off in the solution domain motivates performance variability, it makes sense to vary performance through that trade-off. By contrast, when there are no trade-offs involved, but variability is motivated by price differentiation, downgrading is a straightforward way to alter performance without affecting any other capabilities. That is, it is not always necessary to try to maximize performance from the design. If some customers are satisfied with lower performance and are willing to pay less, and there are no specific trade-offs involved, it may make sense to simply downgrade the premium version. Moreover, downgrading also supports the mode of offering an inexpensive (or even completely free) version with less performance, and later letting the customers upgrade to a premium-priced version. Similar examples can be found elsewhere, for example, in the way Spotify packages the better bitrate to its premium version in order to attract the customers to pay for its services. Thus, doing price differentiation with downgrading may actually increase the competitive edge to the competitors since it is possible to target a wider setting of different customer needs. This was also evident in the case study.

6 Discussion 6.1 Validity and Reliability Validity refers to whether the results correspond to the reality. The results of a case study can be studied from different perspectives: construct validity, internal validity, external validity, and reliability (Yin 1994). Construct validity is about establishing correct operational measures to be able to answer the research questions (Yin 1994). One threat to construct validity may be posed by the lack of interviews that provide rich qualitative data. That is, do the measures or the interpretations from the documents really correspond to the concepts? However, the lacking richness and the risk of incorrect interpretations were alleviated by having an involved participant as an author, as well as by asking clarifying questions from the chief architects. Further, validation

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

33

Table 10 Explaining the decision of using a specific realization strategy. A trade-off between performance and hardware production costs, along with the ability to develop scalable software, explains the hardware realization. Description With the hardware realization, variants with lower performance typically have less expensive hardware, which either means a better profit margin or a lower product price. (This is especially relevant when the hardware is expensive or when the products are mass-market products with tight profit margins.) The gain in the production costs should outweigh the effort of developing scalable software; therefore, it must be known how to implement software scaling, for example, through explicit resource management. Scope Not applicable if the product line scope consists of software only, or the performance attribute is not a system property (see Table 8). Example The case study. instantiation If performance variability is motivated by trade-offs in the software design, it explains the use of the software realization. Description If the decision of varying performance is motivated by a trade-off, and the trade-off concerns software design, it is straightforward to realize the variability through software by varying that particular design trade-off. Scope Assumes it is possible to vary the design characteristic that causes the trade-off. Example Game graphics that improve attractiveness but decrease performance (Myll¨arniemi instantiation et al 2006b). Messaging queue that improves performance and increases production costs (Bartholdt et al 2009). The need for the customer to rebind performance easily at runtime without changing product functionality explains the use of the design tactic realization. Description If the customer needs to upgrade the performance, and the aim is to provide easy and quick rebinding, it motivates the use of software realization. Moreover, if the aim is to upgrade performance independently of other varying functionality, it is easier to use a design tactic realization. This is because the impact management realization changes performance by changing other features: for example, to decrease the resource consumption, one may have to drop out feature Statistics (Siegmund et al 2012b). Scope The assumption that hardware upgrades cannot be automated may not hold for infrastructure-as-a-service. It may be difficult at runtime to rebind a design tactic that is not independent of other variability: the variants must be tested beforehand and exhaustive variant-based testing (Siegmund et al 2012b) may not be possible. Example The case study: hardware upgrades were difficult, and the selected, mostly independent instantiation design tactic could be validated with sample-based testing beforehand. If performance variability is motivated by differentiation, not by trade-offs, it explains the use of the downgrading realization. Description If performance variability is introduced to cater price differentiation, and there are no specific trade-offs involved, performance can be varied by simply downgrading the best available performance. Scope Differentiation can also be realized with design trade-offs, for example, to conduct price differentiation to cater for different hardware production costs. Example The licensed capacity variability in the case study. instantiation The need to vary or optimize several product characteristics, along with functionality-wise similar, rich or invisible variability, explains the use of the impact management realization. Description The impact management realization supports derivation that takes into account functionality and quality attributes all at once (Ognjanovic et al 2012; Soltani et al 2012) and enables multi-attribute optimization (Olaechea et al 2012). Impact management realization needs other variability to alter performance; this is not an issue when the product line has functionally similar features with different performance characteristics, rich variability (Olaechea et al 2012) or user-invisible variability (Sincero et al 2010). Scope Example Database management systems (Siegmund et al 2012b), algorithm-oriented applicainstantiation tions (Bagheri et al 2010, 2012).

34

Varvana Myll¨arniemi et al.

with the key informants (Yin 1994) was used twice as a tactic to enhance construct validity. Triangulation and multiple sources of evidence were also used to address construct validity (Yin 1994). The threat of biased observations from the participating author; the threat of incorrect interpretation from the documents; and the threat of incorrect measures in the questions and answers with the chief architects were all mitigated by checking all sources of data against each other. Another threat to construct validity is the post mortem nature of this case study: the product line was discontinued before it was taken into use. Even if the measures were correct, do the measures properly represent concepts related to operation, such as future capacity upgrades? Also, do the measures on a discontinued base station correctly represent successful base stations? This threat was mitigated by the architects contrasting this specific base station to other operational base stations. Consequently, similar results seem to apply to base stations in general in the case company portfolio. Further, the units of the analysis, that is, the performance variability and the related design decisions, were established similarly to successful base stations before the product line was discontinued. Finally, the unit of analysis and the conceptualizations made of it were not directly related to the reasons of discontinuing the project. Further, another threat to construct validity was that only architects and architecture evaluators were involved in the data collection, and the documents were only architectural in nature. Therefore, can the measures be used to operationalize concepts related to customers and their intentions, that is, to answer research question RQ2? However, when studying the motivation to purposefully vary, it is often not about the intentions of the customers themselves, but about how the product line owner interprets the intentions of the customers. Further, software architects have to have a solid understanding of the stakeholders’ needs and concerns (Bass et al 2003) in order to be able to make informed decisions. As a final threat to construct validity, the data analysis of both case data and existing literature utilized only light-weight coding that served to identify low-level concepts and relations. A large part of the case account analysis was conducted through writing and informal discussions, whereas the analysis of the literature took place mostly through comparison to the case account. Thus, are the operationalized high-level concepts and relations grounded to the data (Urquhart et al 2010)? However, during different stages of the analysis, newly identified concepts were checked against the existing data, and the case study account was validated. Additionally, when augmenting the theory with the literature, the original primary studies were revisited once again. Internal validity is about mitigating the threats to establish incorrect causal relationships between the constructs (Yin 1994). Although this was an explanatory case study, the point was not to establish causality. The relations between the constructs in the explaining theories were not about causality, that is, ”if X, then Y”. Instead, they were about insufficient and unnecessary (Shadish et al 2002) but affecting factors that contributed to the motivation: ”Y was motivated by X”. The motivating factors were validated with the chief architects, which validates the inferences made to create the explanations. However, there may be other rival explanations (Yin 1994), that is, motivating factors that were not identified and which may also be explaining the phenomenon. External validity is about generalizing the findings of a case study (Yin 1994). Even the results of a single case study can be of value when generalizing analytically (Yin 1994). This is because case studies are generalizable to theories and not to populations (Runeson and H¨ost 2009; Yin 1994). Therefore, instead of only describing the case account, the results were formulated into the proposed theoretical models (Section 5). To ensure generalization to other domains and settings, the theory constructs and relations were described

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

35

in a domain-independent way; and the scope of each characterization and explanation was described as limit to generalization (Table 5). Where possible, several instantiations from the case and the literature were utilized to further validate the models. As a threat to external validity, it is possible that we identified the scope of our theories (Gregor 2006) incorrectly. That is, there may be some other situations in which the characterizations and explanations do not hold. This is partly because we did not employ any literal or theoretical replication (Yin 1994); only utilized a number of accounts and examples from the literature; and partly deduced the scope analytically. Therefore, future empirical evidence is needed to test the proposed theory scope: are there any specific situations in which the explanations and characterizations do not apply, and what is the reason for this? The post mortem nature of this case study may have implications for external validity: how can the results be generalized from a product line that was designed approximately ten years ago and then discontinued? However, the case unit of analysis is representative in the case company portfolio and the role of capacity in mobile networks is even more crucial today. Further, since the proposed models address the phenomenon in more general terms, the generalizability of the results is about the generalizability of the models: how well do the characterizations and explanations apply to modern or market-wise successful software product lines? At least within the case study domain, the characterizations and explanations seem apply to more current base stations as well. Further, example instantiations from the literature also mitigate the threats to external validity. Finally, we could not identify any characterization or explanation that was related to the reason of discontinuing the case product line. Reliability in a case study is about demonstrating that the protocol can be repeated with similar results (Yin 1994). For this purpose, the main tactic was related to producing all data into written form, and to establish a case study database to which all steps and actions were recorded. Validity of the results is also affected by the literature review, but its role is slightly different from a standalone systematic literature review. Although the selection process in the review protocol was conducted independently from the case study, the analysis and synthesis were carried out in conjunction with the analysis of the case account. The aim of our literature review was not to provide an analysis and synthesis of the literature as a standalone contribution, but the aim was to find both confirming and contrasting findings compared to the case account. Therefore, some proposed practices for stand-alone systematic literature reviews, for example, regarding the way individual studies should be described, may be exaggerated within the scope of this study. Nevertheless, we assess the quality of this systematic literature review with the questions used by Kitchenham et al (2009). Firstly, are the inclusion and exclusion criteria described and appropriate (Kitchenham et al 2009)? We believe this is a crucial aspect in a literature review that utilizes snowballing; therefore, we spent effort and several iterations on formulating the criteria. Secondly, is the literature search likely to have covered all relevant studies (Kitchenham et al 2009)? This is mostly decided by the ability of the snowballing protocol (Wohlin and Prikladniki 2013; Wohlin 2014). Some indication is given by the high number of selected primary studies (139) compared with, e.g., the number of selected studies (196) about any variability and not just in software product lines (Galster et al 2014). We did not exclude any studies based on metadata only, which meant more detailed scrutiny of the primary studies. Thirdly, is the quality or validity of the primary studies assessed (Kitchenham et al 2009)? Within the scope of this study, a full quality assessment was not done; only the level of empirical evidence was evaluated (Section 2.4). However, studies with poor quality did not provide example accounts

36

Varvana Myll¨arniemi et al.

to be utilized. Fourthly, are the individual studies and their data described (Kitchenham et al 2009)? To keep this study focused on the case, individual studies were not described; however, citations were used when appropriate to ground the results to the original primary studies.

6.2 Lessons Learned In the following, we discuss the novel insights that can be learned from our contribution. What can the research community and industrial practice gain from the case account in Section 4, and in particular, from the theoretical models proposed in Section 5? Firstly, to argue the novelty of our contribution: to the best of our knowledge, the characterizations and explanations in Section 5 have not been explicated before. Moreover, although several example instantiations exist in the literature, and literature was used as the main data source to identify some characterizations and explanations, our analysis and synthesis on them are completely novel: that is, the higher-level concepts and their relations in the proposed models have not been explicitly described before. Also, our aim is at gaining fundamental understanding about performance variability in its real-life context, in contrast to the studies that propose a method or technique and then validate it with an industry-based example. Even if some characterizations in the proposed models seem relatively obvious, like the hardware realization, they have remained more or less as tacit knowledge. Besides explicating and synthesizing such common knowledge into a more general model, the value of our study is in showing that such phenomenon is really happening and relevant in industrial product lines. One important contribution is that the decision of varying performance may be motivated by the customer needs and characteristics, by trade-offs or by varying constraints (Fig. 6). Consequently, the decision-making requires understanding the customer needs, customer value, pricing, technical constraints, production costs and design trade-offs. This indicates that quality attribute variability is a challenging topic that requires the careful analysis of both problem and solution domain. By contrast, the current literature does not much discuss the reason to purposefully vary performance, and in particular, does not discuss the customer needs and characteristics. It often seems that performance variability is just driven by the trade-offs or constraints that force to vary, instead of being driven by the differences in the customer needs and valuations. As an example, despite being an obvious explanation, it was somewhat difficult to find studies that explicitly say performance variability is due to different customer needs (Table 6). There may be several reasons for this lack of attention. Firstly, trade-offs may very well be one important souce of explanations in industrial product lines that vary performance: for example, the case study on 3D mobile phone games was partly explained by a trade-off between performance and game attractiveness (Myll¨arniemi et al 2006b). After all, trade-offs indicate situations in which all customer needs cannot be satisfied with one product. Another reason may be the difficulty of realizing and managing performance variability, which causes the research effort to focus on the technical matters. However, some studies may just make certain assumptions, for example, that the customer always wants the best possible performance instead of wanting what fulfills her needs, and consequently trade-offs are the only reason to vary performance. However, as Fig. 6 indicates, the customer wants what fulfills her varying needs, and the willingness to pay a certain price is tied to satisfying these needs. Finally, this gap may be due to the prevailing constructive research paradigm:

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

37

it is difficult to study the customer needs without empirical research that is conducted in a real industrial context. Another important contribution is to identify the variety of ways performance variability can be realized (Fig. 7). In the literature, the feature impact management realization is the prevailing although implicitly stated strategy. Therefore, it is interesting to see that the case company utilized a purposeful design tactic to downgrade performance: the aim was to keep the impact and interactions between capacity variability and other variability to the minimum, instead of utilizing impacts from other variability to realize capacity variability in an emergent fashion. There may be several explanations for this gap between literature and the case study. Firstly, the focus on impact management realization may be due to the dominance of feature modeling in the research community: the research has simply extended the feature models with quality attributes, instead of starting from the characteristics of quality attribute variability. Consequently, the impact management may be difficult to use to vary time behavior and capacity (Table 9). Secondly, the case study needed to support runtime rebinding and price differentiation: simply downgrading performance was a viable option when the aim was to let the customers start with an inexpensive product and later upgrade to a premium version (Table 10). Further, when the aim is to purposefully create and guarantee differences in performance, it may make sense to utilize a purposefully introduced design mechanism, instead of relying on emergent variability. Because of this gap, it would be extremely valuable to report more industrial cases, and to contrast their realization mechanisms to the proposed theoretical model in Section 5.2. The proposed theoretical models also provide insight into the nature of trade-offs in performance variability. Trade-offs in the solution domain may explain both the motivation (Section 5.1) and affect the selection of the realization strategy (Section 5.2, 5.3). In general, many quality attributes are often largely determined by the way the trade-offs are resolved in the design. It seems the models indicate two alternative approaches to realize performance variability in regard to these trade-offs. First option is to have several different resolutions of the trade-offs and switch between these either as varying design tactics or as emergent variability. The second option is to design the architecture and resolve the trade-offs to fulfill the requirements in the full system configuration; thereafter, the performance can be either downgraded with software, or upgraded by having better hardware. To better serve price differentiation and future upgrades, the case study took the latter approach. Further, the proposed models in Section 5 indicate that different performance attributes, such as time behavior, capacity and resource consumption, should be treated and analyzed separately. Some of the characterizations and explanations only apply to certain performance attributes, for example, the hardware realization is not applicable to main memory consumption (Table 8). Further, the impact management realization is more complicated for response time than for memory footprint (Table 9). Also, since the case study was about capacity, that is, about maximum throughput, it may explain the use of both hardware realization and hardware dependent downgrading realization (Table 8, 9). After all, the hardware configuration typically sets the maximum limits to throughput and response time, whereas the response time and throughput usually vary between each execution. Therefore, we stress the need for the researchers to carefully explicate the scope and the assumptions their proposed methods make on the nature of varying quality attributes: even ”performance” cannot be treated as one uniformly behaving quality attribute. The literature about quality attribute variability does not much address hardware; moreover, hardware is mostly treated as a constraint to memory consumption. This may be because software is often used to implement value-adding features and to differentiate products

38

Varvana Myll¨arniemi et al.

in software product line engineering. However, the case account and the proposed models pointed out the importance of hardware in performance variability, both as the means to realize and as a driver in the decision-making. The aim to optimize the trade-off between performance and hardware production costs both motivates the decision to vary in the first place (Table 7) as well as motivates the selection of the hardware realization (Table 10). However, the often stated assumption ”a better performance variant costs more” is not necessarily true, as indicated by the case study: performance may be varied and priced differently even when the production costs are the same. How should the proposed models be utilized, or what is their value? For the practitioners, the explanations in Section 5.1 may help in analyzing the various drivers for the decision of varying, and also highlight the need to analyze customer needs and characteristics besides focusing only on solution domain trade-offs and constraints. Further, the models in Section 5.2, 5.3 may help in understanding the variety of different realization strategies for performance variability. For the researchers, the theoretical models help in positioning both existing methods and techniques as well as the reported accounts and case examples. The proposed models can also be used when designing future empirical research, for example, by following the hypothetical path (Stol and Fitzgerald 2013) and aiming at replication or negation: this is similar to doing multiple experiments on the same topic. As a concrete example, the explanations about the motivation to vary performance (Section 5.1) could be used to construct survey or interview questions about quality attribute variability in general. It may be that further empirical studies identify completely new characterizations and explanations, or even refute the models in Section 5 by identifying situations in which they are not applicable. This enables to accumulate knowledge and to build theories incrementally (Stol and Fitzgerald 2013).

7 Conclusions This paper studied the motivation and the realization of purposeful performance variability in software product lines. The study was conducted as a descriptive and explanatory case study of capacity variability in a post mortem mobile network base station product line. To build theories about performance variability in more general terms, the data analysis augmented the case study account with the existing literature to build theories, following the observational path to empirical software engineering research. As a result, we proposed theoretical models to explain the motivation to vary performance and to characterize and explain the realization of performance variability. The theoretical models were constructed to be applicable beyond this single case study and to performance in general: each characterization and explanation was defined in domain-independent concepts; scope was described as the identified limits of generalization; and example instantiations were drawn also from the existing literature. There are several lessons to be learned. Firstly, performance variability is not only motivated by trade-offs and constraints in the solution domain, but also customer needs and characteristics need to be analyzed in the decision-making. In particular, price differentiation and the need to offer future upgrades, that is, the problem domain explanations, may be a good enough reason to vary. Thus, performance variability is not only about resolving trade-offs; and the better performance variant does not always cost more to develop or produce. However, trade-offs and constraints are important explanations as well, since they indicate situations in which all customer needs cannot be satisfied with one product. The trade-offs, which can be between performance and other quality attributes, or performance

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

39

and production costs, provide both a reason and a possible realization strategy in the form of a design tactic. From the realization point of view, the prevailing yet implicitly stated way to realize performance in the literature is the impact management realization, that is, performance variability is the emergent result of other variability in the software product line. Instead, downgrading the available resources and having more efficient hardware proved to be a straightforward way to vary performance; this was because the performance variability was introduced to support price differentiation and future runtime upgrades, not to resolve software trade-offs. Thus, there are clear differences between the case account and the dominant approaches in the literature. Finally, the proposed theoretical models indicate that different quality attributes may need to be addressed separately: even performance cannot be treated as one uniformly behaving attribute. As future work, further empirical research of quality attribute variability in industrial product lines is needed; the proposed models can be of value in this respect. Firstly, the characterizations and explanations can be used deductively, that is, to generate testable propositions that can be confirmed or refuted: the refuting studies are particularly important, since they help in better defining the scope of the theories. Secondly, the proposed theoretical models can be used inductively, that is, to construct new characterizations and explanations about performance variability. Also, it would be interesting to see how well the proposed models can be applied to other quality attributes than performance. Some characterizations and explanations may be specific to performance: for example, the hardware realization may be less applicable to security or usability. However, the explanations about the motivation to vary performance may be more or less directly applicable to other quality attributes as well. This calls for future work. Acknowledgements Aki Nyyss¨onen, Jukka Peltola, Ari Evisalmi, and Juha Timonen from NSN are acknowledged for evaluating the validity of and commenting on our results. Anssi Karhinen and Juha Kuusela are acknowledged for ideas and comments.

References Ahnassay A, Bagheri E, Gasevic D (2013) Empirical evaluation in software product line engineering. Tech. Rep. TR-LS3-130084R4T, Laboratory for Systems, Software and Semantics, Ryerson University Bagheri E, Di Noia T, Ragone A, Gasevic D (2010) Configuring software product line feature models based on stakeholders’ soft and hard requirements. In: Software Product Line Conference Bagheri E, Noia TD, Gasevic D, Ragone A (2012) Formalizing interactive staged feature model configuration. J Softw Evolut Proc 24(4):375–400, DOI 10.1002/smr.534 Barbacci M, Longstaff T, Klein M, Weinstock C (1995) Quality attributes. Tech. Rep. CMU/SEI-95-TR-021, SEI Bartholdt J, Medak M, Oberhauser R (2009) Integrating quality modeling with feature modeling in software product lines. In: International Conference on Software Engineering Advances (ICSEA), DOI 10.1109/ICSEA.2009.59 Bass L, Clements P, Kazman R (2003) Software Architecture in Practice, 2nd edn. Addison-Wesley Belobaba P, Odoni A, Barnhart C (2009) The Global Airline Industry. John Wiley Sons Benavides D, Mart´ın-Arroyo PT, Cort´es AR (2005) Automated reasoning on feature models. In: International Conference on Advanced Information Systems Engineering (CAiSE), DOI 10.1007/11431855 34 Berntsson Svensson R, Gorschek T, Regnell B, Torkar R, Shahrokni A, Feldt R (2012) Quality requirements in industrial practice – an extended interview study at eleven companies. IEEE Trans Softw Eng 38(4):923–935, DOI 10.1109/TSE.2011.47 Boehm B, Brown J, Kasper H, Lipow M, Macleod G, Merrit M (1978) Characteristics of Software Quality. North-Holland Publishing Company Bosch J (2000) Design and Use of Software Architectures: Adapting and Evolving a Product-Line Approach. Addison-Wesley

40

Varvana Myll¨arniemi et al.

Botterweck G, Thiel S, Nestor D, bin Abid S, Cawley C (2008) Visual tool support for configuring and understanding software product lines. In: Software Product Line Conference, DOI 10.1109/SPLC.2008.32 Bu T, Chan MC, Ramjee R (2006) Connectivity, performance, and resiliency of IP-based CDMA radio access networks. Mobile Computing, IEEE Transactions on 5(8), DOI 10.1109/TMC.2006.108 Cavalcanti RdO, de Almeida ES, Meira SR (2011) Extending the RiPLE-DE process with quality attribute variability realization. In: Joint Conference on Quality of Software Architectures and Architecting Critical Systems (QoSA-ISARCS), DOI 10.1145/2000259.2000286 Clements P, Northrop L (2001) Software Product Lines—Practices and Patterns. Addison-Wesley Czarnecki K, Helsen S, Eisenecker UW (2005) Formalizing cardinality-based feature models and their specialization. Softw Proc Improv Pract 10(1):7–29, DOI 10.1002/spip.213 Dub´e L, Par´e G (2003) Rigor in information systems positivist case research: Current practices, trends, and recommendations. MIS Q 27(4):597–635 Etxeberria L, Sagardui G (2008) Variability driven quality evaluation in software product lines. In: Software Product Line Conference, DOI 10.1109/SPLC.2008.37 Etxeberria L, Sagardui G, Belategi L (2007) Modelling variation in quality attributes. In: VaMOS Fettke P, Houy C, Loos P (2010) On the relevance of design knowledge for design-oriented business and information systems engineering – conceptual foundations, application example, and implications. Bus Inf Syst Eng 2(6):347–358, DOI 10.1007/s12599-010-0126-4 Galster M, Avgeriou P (2011) Handling variability in software architecture: Problems and implications. In: Working IEEE/IFIP Conference onSoftware Architecture (WICSA), DOI 10.1109/WICSA.2011.30 Galster M, Avgeriou P (2012) A variability viewpoint for enterprise software systems. In: Working IEEE/IFIP Conference on Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), DOI 10.1109/WICSA-ECSA.212.43 Galster M, Weyns D, Tofan D, Michalik B, Avgeriou P (2014) Variability in software systems—a systematic literature review. IEEE Transactions on Softw Eng 40(3):282–306, DOI 10.1109/TSE.2013.56 Gimenes IMdS, Fantinato M, de Toledo MBF (2008) A product line for business process management. In: Software Product Line Conference, DOI 10.1109/SPLC.2008.10 Gonz´alez-Baixauli B, Laguna MA, do Prado Leite JCS (2007) Using goal-models to analyze variability. In: VaMoS Gregor S (2006) The nature of theory in information systems. MIS Q 30(3):611–642 Guo J, White J, Wang G, Li J, Wang Y (2011) A genetic algorithm for optimized feature selection with resource constraints in software product lines. J Syst Softw 84(12), DOI 10.1016/j.jss.2011.06.026 van Gurp J, Bosch J, Svahnberg M (2001) On the notion of variability in software product lines. In: Working IEEE/IFIP Conference on Software Architecture (WICSA), DOI 10.1109/WICSA.2001.948406 Hallsteinsen S, Fægri TE, Syrstad M (2003) Patterns in product family architecture design. In: Software Product Family Engineering (PFE), DOI 10.1007/978-3-540-24667-1 19 Hallsteinsen S, Schouten G, Boot G, Fægri T (2006a) Dealing with architectural variation in product populations. In: K¨ak¨ol¨a T, Due˜nas JC (eds) Software Product Lines – Research Issues in Engineering and Management, Springer Hallsteinsen S, Stav E, Solberg A, Floch J (2006b) Using product line techniques to build adaptive systems. In: Software Product Line Conference, DOI 10.1109/SPLINE.2006.1691586 Hevner AR, March ST, Park J, Ram S (2004) Design science in IS research. MIS Q 28(1):75–105 Holma H, Toskala A (eds) (2000) WCDMA for UMTS: radio access for third generation mobile communications. Wiley IEEE Std 1061-1998 (1998) IEEE standard for a software quality metrics methodology IEEE Std 61012-1990 (1990) IEEE standard glossary of software engineering terminology Ishida Y (2007) Software product lines approach in enterprise system development. In: Software Product Line Conference ISO/IEC 25010 (2011) Software engineering—product quality—part 1: Quality model ISO/IEC 9126-1 (2001) Systems and software engineering—systems and software quality requirements and evaluation (SQuaRE)— system and software quality models Jaring M, Bosch J (2002) Representing variability in software product lines: A case study. In: Software Product Line Conference Jaring M, Krikhaar RL, Bosch J (2004) Representing variability in a family of MRI scanners. Softw: Practice and Experience 34(1):69–100 Jarzabek S, Yang B, Yoeun S (2006) Addressing quality attributes in domain analysis for product lines. IEE Proc-Softw 153(2) Kang K, Cohen S, Hess J, Novak W, Peterson A (1990) Feature-oriented domain analysis (FODA) feasibility study. Tech. Rep. CMU/SEI-90-TR-21, ADA 235785, Software Engineering Institute Kang K, Lee J, Donohoe P (2002) Feature-oriented product line engineering. IEEE Softw 19(4)

Performance Variability in Software Product Lines: Proposing Theories from a Case Study

41

Karatas AS, Oguztuzun H, Dogru A (2010) Mapping extended feature models to constraint logic programming over finite domains. In: Software Product Line Conference Kishi T, Noda N (2000) Aspect-oriented analysis of product line architecture. In: Software Product Line Conference Kishi T, Noda N, Katayama T (2001) Architectural design for evolution by analyzing requirements on quality attributes. In: Asia-Pacific Software Engineering Conference, DOI 10.1109/APSEC.2001.991466 Kishi T, Noda N, Katayama T (2002) A method for product line scoping based on a decision-making framework. In: Software Product Line Conference Kitchenham B, Pearl Brereton O, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technology 51(1):7 – 15, DOI 10.1016/j.infsof.2008.09.009 Kozuka N, Ishida Y (2011) Building a product line architecture for variant-rich enterprise applications using a data-oriented approach. In: Software Product Line Conference Kuusela J, Savolainen J (2000) Requirements engineering for product families. In: International Conference on Software Engineering (ICSE) Lee AS, Baskerville RL (2003) Generalizing generalizability in information systems research. Inf Systems Research 14(3):221–243, DOI 10.1287/isre.14.3.221.16560 Lee K, Kang KC (2010) Using context as key driver for feature selection. In: Software Product Line Conference Linden F, Bosch J, Kamsties E, K¨ans¨al¨a K, Krzanik L, Obbink H (2003) Software product family evaluation. In: Software Product-Family Engineering (PFE) Matinlassi M (2005) Quality-driven software architecture model transformation. In: Working IEEE/IFIP Conference on Software Architecture McCall JA, Richards P, Walters G (1977) Factors in software quality. Tech. Rep. TR-77-369, RADC Mellado D, Fern´andez-Medina E, Piattini M (2008) Towards security requirements management for software product lines: A security domain requirements engineering process. Comput Stand Interfaces 30(6):361– 371 Myll¨arniemi V, M¨annist¨o T, Raatikainen M (2006a) Quality attribute variability within a software product family architecture. In: Quality of Software Architectures (QoSA), vol. 2 Myll¨arniemi V, Raatikainen M, M¨annist¨o T (2006b) Inter-organisational approach in rapid software product family development—a case study. In: International Conference on Software Reuse Myll¨arniemi V, Raatikainen M, M¨annist¨o T (2012) A systematically conducted literature review: quality attribute variability in software product lines. In: Software Product Line Conference Myll¨arniemi V, Savolainen J, M¨annist¨o T (2013) Performance variability in software product lines: A case study in the telecommunication domain. In: Software Product Line Conference Mylopoulos J, Chung L, Nixon B (1992) Representing and using nonfunctional requirements: A processoriented approach. IEEE Trans Softw Eng 18(6) Mylopoulos J, Chung L, Liao S, Wang H, Yu E (2001) Exploring alternatives during requirements analysis. IEEE Softw 18(1):92–96 Niemel¨a E, Immonen A (2007) Capturing quality requirements of product family architecture. Inf and Softw Technology 49(11-12) Niemel¨a E, Matinlassi M, Taulavuori A (2004) Practical evaluation of software product family architectures. In: Software Product Line Conference Ognjanovic I, Mohabbati B, Gaevic D, Bagheri E, Bokovic M (2012) A metaheuristic approach for the configuration of business process families. In: International Conference on Services Computing (SCC) Olaechea R, Stewart S, Czarnecki K, Rayside D (2012) Modelling and multi-objective optimization of quality attributes in variability-rich software. In: Fourth International Workshop on Nonfunctional System Properties in Domain Specific Modeling Languages Patton MQ (1990) Qualitative Evaluation and Research Methods, 2nd edn. Sage Publications Phillips R (2005) Pricing and Revenue Optimization. Stanford University Press Regnell B, Berntsson-Svensson R, Olsson T (2008) Supporting roadmapping of quality requirements. IEEE Softw 25(2):42–47 Roos-Frantz F, Benavides D, Ruiz-Corts A, Heuer A, Lauenroth K (2012) Quality-aware analysis in product line engineering with the orthogonal variability model. Softw Quality J 20(3-4):519–565, DOI 10.1007/s11219-011-9156-5 Runeson P, H¨ost M (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical Softw Eng 14(2):131–164, DOI 10.1007/s10664-008-9102-8 Shadish WR, Cook TD, Campbell DT (2002) Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin Boston

42

Varvana Myll¨arniemi et al.

Shaw M (2002) What makes good research in software engineering? Int J STTT 4(1):1–7, DOI 10.1007/s10009-002-0083-4 Siegmund N, Kolesnikov S, Kastner C, Apel S, Batory D, Rosenmuller M, Saake G (2012a) Predicting performance via automated feature-interaction detection. In: International Conference on Software Engineering, DOI 10.1109/ICSE.2012.6227196 Siegmund N, Rosenm¨uller M, Kuhlemann M, Kastner C, Apel S, Saake G (2012b) SPL Conqueror: Toward optimization of non-functional properties in software. Softw Quality J 20(3-4) Siegmund N, Rosenmuller M, Kastner C, Giarrusso PG, Apel S, Kolesnikov SS (2013) Scalable prediction of non-functional properties in software product lines: Footprint and memory consumption. Inf and Softw Technology 55(3):491 – 507 Sincero J, Schroder-Preikschat W, Spinczyk O (2009) Towards tool support for the configuration of nonfunctional properties in SPLs. In: Hawaii International Conference on System Sciences (HICSS), DOI 10.1109/HICSS.2009.472 Sincero J, Schroder-Preikschat W, Spinczyk O (2010) Approaching non-functional properties of software product lines: Learning from products. In: Software Engineering Conference (APSEC), DOI 10.1109/APSEC.2010.26 Sinnema M, Deelstra S, Nijhuis J, Bosch J (2006) Modeling dependencies in product families with COVAMOF. In: Engineering of Computer Based Systems (ECBS) Smith CU, Williams LG (2002) Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software. Addison-Wesley Soltani S, Asadi M, Gasevic D, Hatala M, Bagheri E (2012) Automated planning for feature model configuration based on functional and non-functional requirements. In: Software Product Line Conference Stol KJ, Fitzgerald B (2013) Uncovering theories in software engineering. In: SEMAT Workshop on General Theory of Software Engineering (GTSE 2013) Strauss A, Corbin J (1998) Basics of Qualitative Research, 2nd edn. Sage Svahnberg M, van Gurp J, Bosch J (2005) A taxononomy of variability realization techniques. Softw— Practice and Experience 35(8) Thiel S, Hein A (2002) Modelling and using product line variability in automotive systems. IEEE Softw 19(4):66–72 Thum T, Apel S, Kastner C, Schaefer I, Saake G (2014) A classification and survey of analysis strategies for software product lines. ACM Comput Surv 47(1), DOI To appear. Tun TT, Boucher Q, Classen A, Hubaux A, Heymans P (2009) Relating requirements and feature configurations: A systematic approach. In: Software Product Line Conference Urquhart C, Lehmann H, Myers MD (2010) Putting the theory back into grounded theory: guidelines for grounded theory studies in information systems. Inf Syst J 20(4):357–381, DOI 10.1111/j.13652575.2009.00328.x White J, Schmidt DC, Wuchner E, Nechypurenko A (2007) Automating product-line variant selection for mobile devices. In: Software Product Line Conference White J, Dougherty B, Schmidt DC (2009) Selecting highly optimal architectural feature sets with filtered cartesian flattening. J Syst Softw 82(8) Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Conference on Evaluation and Assessment in Software Engineering Wohlin C, Prikladniki R (2013) Systematic literature reviews in software engineering. Inf and Softw Technology 55(6):919 – 920, DOI 10.1016/j.infsof.2013.02.002 Yin RK (1994) Case Study Research, 2nd edn. Sage: Thousand Oaks Yu Y, do Prado Leite JCS, Lapouchnian A, Mylopoulos J (2008) Configuring features with stakeholder goals. In: SAC