Communications in Global Software Development - Springer Link

11 downloads 227738 Views 245KB Size Report
and balanced across global software development projects. ... and small-sized software companies are beginning to establish worldwide ... in April 2010 [10].
Communications in Global Software Development: An Empirical Study Using GTK+ OSS Repository Liguo Yu1, Srini Ramaswamy2, Alok Mishra3, and Deepti Mishra3 1

Computer Science and Informatics, Indiana University South Bend, South Bend, IN, USA [email protected] 2 Industrial Software Systems, ABB Corporate Research Center, Bangalore, India [email protected] 3 Department of Computer & Software Engineering, Atilim University, Incek, Ankara, Turkey {alok,deepti}@atilim.edu.tr

Abstract. Effective communication is an important issue for global software development. Due to geographical limitations and travel challenges, face-to-face meetings are expensive to schedule and run. Web-based communication methods are thus the primary means of communication in global software development efforts. In general, two types of web-based communication mechanisms exist: synchronous and asynchronous communications; each serves a unique role. In this paper, we present an empirical study of the communication mechanisms in GNOME GTK+, a small-sized open-source distributed software project, in which Internet Relay Chat (IRC) and Mailing Lists are used as synchronous and asynchronous communication methods, respectively. The objective of this study is to identify how real time and asynchronous communication methods could be used and balanced across global software development projects. Keywords: Global software development, communications, empirical study.

1

Introduction

Global software development (GSD) is becoming an increasingly major trend for software systems development. Major software development companies have established their oversea research and development sites for years, such as Microsoft [1], IBM [2], and Oracle [3] and they are beginning to reap the benefits. For examples, products, such as Windows 7 are the result of globalized software development efforts [4]. Due to the many benefits of globalization, from the integration of multiple ethnic / market perspectives driven idea generation to development cost structuring, middle and small-sized software companies are beginning to establish worldwide development campuses / partners. Thus globalization has become an overwhelming phenomenon in the software industry and is rapidly defining the nature of software development in this first decade of the 21st century [5]. GSD, by its very nature, features distributed teams: where software developers are dispersed and spread around the world [6] [7]. Due to the geographical limitations, face-to-face meetings are expensive, and sometimes inefficient, if not supported by technology for the intervening R. Meersman, T. Dillon, and P. Herrero (Eds.): OTM 2011 Workshops, LNCS 7046, pp. 218–227, 2011. © Springer-Verlag Berlin Heidelberg 2011

Communications in Global Software Development

219

time periods between such face-to-face meetings, due to workforce availability and mobility issues. Computer network (more specifically, web) based communications thus play a vital role in requirement clarification, bug reporting, issue resolution, and commit notifications. There are many web-based communication methods, such as mailing lists, web conferences, instant messages, and wiki. Web-based communication methods can in general, be divided into two main categories depending upon the reaction time intervals between the participants: synchronous and asynchronous communication mechanisms. Synchronous, or real-time, communication includes web conferences, instant messages, tweets, chats, etc., where developers need to remain connected during the communication process; wherein a message is sent, read instantly and possibly acknowledged with no apparent delay. Asynchronous communications on the other hand, includes mailing lists, message boards, etc., where developers do not need to remain connected concurrently during the communication process; whereby a message is sent, read, and replied to, with certain acceptable time delays. The benefits of real time communication include fast issue resolution and fast requirement clarification. The disadvantage of real time communication include: (i) difficulty in scheduling, especially for developers located in different countries across varying time zones; and, (ii) it is suited for short messages only; long messages are not very feasible, because of the communication constraints imposed by the medium. Benefits of asynchronous communication include: (i) less urgency for responses; the developer can consider the issue presented more carefully before replying; (ii) ability to explain complicated issues with more clarity and better articulated justification and reasoning. The disadvantage of asynchronous communication is that it might take an arbitrarily longer period of time to receive a response, or a ‘no response’ if a developer chooses to ignore the requesting message. Nevertheless, for GSD projects, both of these communication mechanisms have to be considered. And, striking the right balance between using these two different communication mechanisms might not be a trivial undertaking. It should be based upon properties of a specific project, preferences of the developers, differences in time zones, and other specific project / organizational related issues. In this paper, we present a case study on GNOME GTK+ project and draw some conclusions that may be useful for such projects in general. We investigate both realtime communication (Internet Relay Chat) and asynchronous communication (Mailing List) method used through the project. The principle objective of this study is to find out how real-time and asynchronous communications are used in GSD, especially for small-sized projects. In specific, we are studying the quantitative aspects of such communications from a content-agnostic perspective, i.e. the intent is not examine the content of the messages for why each method is used; so that we can derive some general observations. It has been our observation and experience that suitable GSD projects with a larger scope can often be appropriately factored, whereby these partitions can be pseudo-independently developed by smaller groups geographically distributed. It is to be noted that this study is inspired, in part, by the work in [8]. The rest of this paper is organized as follows. Section 2 describes GNOME GTK+ project and its archive of synchronous (Internet Relay Chat—IRC) communication mechanism and archive of asynchronous (Email List) communication mechanism.

220

L. Yu et al.

Section 3 describes the data mining process used on these archives. Section 4 presents our results and the analysis of the observations. Our conclusions on the case study are summarized in Section 5.

2

Communications in GNOME GTK+ Project

GNOME GTK+ is a small-sized open-source toolkit with rich features, cross-platform compatibility and an easy to use API for creating graphical user interfaces. GTK+ is originally written in C, and has bindings to many other popular programming languages such as C++, Python and C# [9]. The stable version of GTK+ (v1.0) was released in April 1998. Until now (September 2010), 13 versions have been released. The latest version (v2.20) was released in April 2010 [10]. Currently, there are nine core maintainers in GTK+ project [11]. There have been hundreds (near one thousand) of developers who have participated in the development of GTK+. Two major communication mechanisms have been used by GTK+ developers: developer mailing list and Internet Relay Chat— IRC. GTK+ developer mailing list ([email protected]) is an asynchronous communication method. It is used by developers to discuss the design and implementation issues of core GTK+ libraries. GTK+ application development and general GTK+ questions are handled by different mailing lists. Bug reports are entered and handled through Bugzilla. As regularly as possible, GTK+ team meetings take place in the Internet Relay Chat (IRC) channel on irc.gnome.org (#gtk-devel). It is a synchronous, real-time communication method. Everyone is welcome to join the meeting. However, the ground rule is that the channel is to be used only for GTK+ team meetings, and not for general questions about GTK+. Therefore, both GTK+ developer mailing list ([email protected]) and the IRC channel (#gtk-devel) are used to discuss the core development issues of GTK+. The developer mailing archive contains data back to 1997 and IRC meetings are recorded and the data is available dating back to 2004. Figure 1 illustrates the mailing list structure: an email thread and a detailed message, where the highlighted element indicates the time stamp and the time zone info. Figure 2 illustrates the IRC meeting log structure with the message sender email ID highlighted.

Fig. 1. An email thread and a detailed message

Fig. 2. Part of the IRC meeting log

Communications in Global Software Development

3

221

Data Mining Process

Perl programs were written to mine the developer mailing list [12] and the IRC record [13] [14]. For each email message and IRC meeting the following information was retrieved from the data mining process and saved to a text file for further analysis. Email Message: • ID (every message is assigned a unique id) • Parent message ID (For new message, parent message ID is assigned value 0; for replying message, parent message ID is the ID of the original message) • Author • Subject • Date and time • Time zone

IRC Message: • Date • Participants • Posters of each message

In the mailing list, the email message time stamps, recorded as local time, were converted to GMT (UTC) time, to enable analysis. Similarly, to enable a comparative study, since the IRC meeting records were available only after 2004, we correspondingly analyzed the mailing list record back to the same year 2004.

4

Analysis and Results

4.1 The Correlations between Synchronous and Asynchronous Communication From February 2004 to November 2009, the total number of email threads and the total number of email messages are listed in Table 1; the total number IRC meetings and the total number of messages posted in these meetings are listed in Table 2. In mailing list, an email thread contains one root message with parent ID 0, and zero or more replying messages with non-zero parent ID. For example, Figure 1 illustrates one email thread and ten email messages. In contrast, Figure 2 illustrates one IRC meeting and seven messages (posts). In IRC communications, each meeting is planned to discuss an apriori determined number of issues. Therefore, the number of meetings is correlated with the number of issues discussed; and the number of messages posted in these meetings represents the complexity of the issues. Similarly in mailing lists, each email thread represents one issue. Therefore, the number of email threads represents the number of issues discussed in mailing lists and the number of email messages in each thread represents the complexity of each issue. To study whether synchronous (real-time) communication (IRC) and asynchronous communication (mailing list) were correlated in discussing issues, we study (i) the correlation between the number of IRC meetings and the number of email threads in the same period; and (ii) the correlation between the number of messages posted in IRC meetings and the number email messages posted in mailing list in the same period. Spearman’s rank correlation tests are performed. The results of tests on yearly aggregated data are shown in Table 3.

222

L. Yu et al.

Table 1. Number of Email threads and messages

Year 2004 2005 2006 2007 2008 2009

Number of threads 308 330 374 372 316 313

Number of messages 783 980 1166 1425 1067 983

Table 2. Number of IRC meetings and messages

Year 2004 2005 2006 2007 2008 2009

Number of meetings 35 37 13 9 16 8

Number of posts in meetings 5132 5524 1258 1945 3088 2005

Table 3. Spearman’s test on yearly-based data

Variable

x y

Data sets Correlation co. Significance (p)

Number of IRC meetings Number of Email threads 6 -0.142 0.802

Number of posts in IRC meetings Number of Email messages 6 -0.828 0.058

It can be seen from Table 3 that for each year, negative correlations are found (1) between the number of IRC meetings and the number of email threads; and (2) between the number of posts in IRC meetings and the number of email messages. However, none of these correlations are significant at the 0.05 level. To further study these correlations, the data are reorganized in a monthly aggregation basis. From February 2004 to November 2009, a total of 46 month data are retrieved (months without IRC meetings are ignored). Spearman’s rank correlation tests are performed to study the correlations on monthly-based data. The results are summarized in Table 4. Again for each month, negative correlations are found (1) between the number of IRC meetings and the number of email threads; and (2) between the number of posts in IRC meeting and the number of email messages. However, these correlations are not significant at the 0.05 level. The scatter plots in Figure 3 illustrate their relationships graphically. Table 4. Spearman’s test on monthly-based data

Variable

x y

Data sets Correlation co. Significance (p)

Number of IRC meetings Number of Email threads 46 -0.128 0.394

Number of posts in IRC meetings Number of Email messages 46 -0.083 0.583

Communications in Global Software Development

(a)

223

(b)

Fig. 3. The scatter plots of monthly data of (a) number of IRC meetings versus number of email threads; and (b) number of posts in IRC meeting versus number of email messages

Because negative correlations are found between (1) the number of meetings and the number of email threads, and (2) the number of posts in IRC meetings and the number of email messages, it seems that real-time communication method (IRC) and asynchronous communication method (email) are used complementarily, i.e. the more usage of one method corresponds to the less use of another method. However, because none of these correlations are significant at the 0.05 level, these findings are inconclusive and they are only our speculations. 4.2

Exploring Communication Participants

Table 5 shows the number of participants in GTK+ IRC meetings and the number of participants in mailing list communications of each year. A/R is defined as the ratio of the number of asynchronous communication participants (mailing list) to the number of real-time communication participants (IRC meeting). It can be seen that this value is in the range of 3 to 7. In most years, this value is about 4, i.e. about 4 times the number of developers participated using asynchronous communication (mailing list) compared with those, who participated using real-time communication (IRC meeting). Table 5. GTK+ project communication participants

Year 2004 2005 2006 2007 2008 2009

IRC meetings 49 71 32 55 73 44

Mailing list 158 184 209 214 190 190

A/R 3.2 2.6 6.5 3.9 2.6 4.3

Exploring further, we also studied the activeness of GTK+ communication participants. It has been observed that in both synchronous communication and asynchronous communication, majority of the messages are contributed by a small number of participants. In both communication mechanisms, active contributors are defined as

224

L. Yu et al.

the top active posters who have contributed to over 80% of all the messages (email message or IRC post) in each year. Table 6 shows the number of active contributors and total contributors in mailing list and Table 7 shows the number of active contributors and total contributors in IRC meetings, where A/T is defined as the percentage of total contributors who are active contributors. Comparing the A/T percentages in IRC meetings (average 26%) and the A/T percentages in mailing list (average 33%), it can be seen that in general regular developers are more active through asynchronous communications. It also means that in realtime communication (IRC meetings), the discussions are led by active posters. Table 6. GTK+ project mailing list participants

Table 7. GTK+ project IRC meeting participants

Year 2004 2005 2006 2007 2008 2009

Year 2004 2005 2006 2007 2008 2009

Active 55 62 68 64 65 64

Total 158 184 209 214 190 190

A/T 35% 34% 33% 30% 34% 34%

Active 9 13 8 19 17 15

Total 49 71 32 55 73 44

A/T 18% 18% 25% 35% 23% 34%

To see how often a developer use both communication methods, we studied the top 20 most active IRC meeting posters and the top 20 most active email posters in each year. The number of developers that belong to both top-20 lists is illustrated in Figure 4.

Fig. 4. The number of developers who are ranked as top 20 contributors to both synchronous and asynchronous communications

Figure 4 can be interpreted as follows: if a developer is an active communicator using one method (real-time or asynchronous), s/he is very likely to be active using a different method (asynchronous or real-time). In the GTK+ web site, there list nine core developers (Table 8). To see if these core project members are actively using both

Communications in Global Software Development

225

communication methods (real-time and asynchronous), we extracted the participants who have been on both of the two top-20 lists for at least one year (2004-2009) and their names are shown in Table 9. Table 8. Current core members of GTK+ project

Name Tim Janik Matthias Clasen Behdad Esfahbod Federico Mena Quintero Alexander Larsson Tor Lillqvist Kristian Rietveld Michael Natterer Emmanuele Bassi

Affiliation Lanedo GmbH Red Hat Red Hat Novell Red Hat Novell Lanedo GmbH Lanedo GmbH Intel

Table 9. The most active participants and number of years on both top-20 lists

Name Matthias Clasen Alexander Larsson Tim Janik Emmanuele Bassi Owen Taylor Behdad Esfahbod Federico Mena Quintero Kristian Rietveld

No. of years 6 5 5 4 3 2 2 2

Name Tristan Van Berkom David Zeuthen James M. Cape Johan Dahlin John Ehresman Jonathan Blandford Maciej Katafiasz Sven Neumann

No. of years 2 1 1 1 1 1 1 1

Comparing Table 8 with Table 9, it can be seen that 7 out of the 9 current core members are actively using both communication methods (real-time and asynchronous). This observation also indicates an often ignored, yet intrinsic part of many successful projects, and perhaps irrespective of the localized or distributed nature of the development activity itself, i.e., the emergence of a relative degree of stability and continuity amongst the core team participants. 4.3

Email Response Time Delay

Unlike real-time communication using IRC chat, which has no significant time delay between messages, asynchronous communication using email usually has some delays, from minutes, hours, to days. In this section, we study the time span between the original message and the replying message. The mailing list data of GTK+ project from February 2004 to November 2009 is analyzed. Figure 5(a) shows the frequency of time delay (less than or equal to 48 hours) between the replying message and the original message. In this time delay range, most messages are replied within 24 hours. Figure 5(b) shows the frequency of time delay

226

L. Yu et al.

(all messages) between the reply message and the original message. In this time delay range, most messages are replied within 2 days. Next, we study the effect of time zone differences on Email response delay. We assume that time spans over 48 hours are not affected by time zone difference. In other words, only if the replying message and the original has a time stamp difference less than or equal to 48 hours, time zone difference may take effect; if replying message and original has a time stamp difference greater than 48 hours, other factors have a more direct effect than the time zone. Therefore, we only studied the relation between time zone difference and response delay within 48 hours. The result is shown in Figure 6.

(a)

(b)

Fig. 5. The frequency of time delay between the replying message and the original message: (a) time span is less than or equal to 48 hours; and (b) all data

Fig. 6. The relation between time zone difference and the average email response delay within 48 hours

In Figure 6, the time zone differences are calculated as absolute values, which means the largest time zone difference is 12 hours. We can see that on average, time zone difference has no apparent effect on response delays.

Communications in Global Software Development

5

227

Conclusions

In this paper, we presented an empirical study of synchronous and asynchronous communication mechanisms on GNOME GTK+, an open source distributed software development project, where developers used Internet Relay Chat (IRC) and Email lists as the communication mechanisms. We mined this communication history of GNOME GTK+ developer network. Major observations of our studies include these specific results: (1) Insignificant negative correlations exist between the amount of real time communication (IRC) and asynchronous communication (mailing list); (2) Core developers are actively using both mechanisms to communicate with their group; (3) Most emails are replied within 2 days; and, (4) Time zone difference has no apparent effect on email response delay. The threats to validity of our study are that our research is performed on one smallsized open-source project. To further validate these results, more studies need be performed on other distributed software projects. Moreover, these observations serve as a useful starting point for some in-depth research including the addressing of other related issues such as variability in language use and cultural effects.

References 1. 2. 3. 4.

5.

6.

7.

8.

9. 10. 11. 12. 13. 14.

Microsoft Worldwide, http://www.microsoft.com/worldwide/ IBM Research, http://www.research.ibm.com/ Oracle Worldwide Site, http://www.oracle.com/global/index.html Windows International Team. Engineering Windows 7, http://blogs.msdn.com/e7/archive/2009/07/07/ engineering-windows-7-for-a-global-market.aspx AeA’s Board of Directors. Offshore Outsourcing in an Increasingly Competitive and Rapidly Changing World. AeA (American Electronic Association) (March 2004), http://www.techamerica.org/content/ wp-content/uploads/2009/07/aea_offshore_outsourcing.pdf Al-asmari, K.R., Yu, L.: Experiences in distributed software development with wiki. In: Proceedings of 2006 International Conference on Software Engineering Research and Practice, Las Vegas, Nevada, June 26-29, pp. 389–393 (2006) Al-asmari, K.R., Batzinger, R.P., Yu, L.: Experience distributed and centralized software development in IPDNS project. In: Proceedings of 2007 International Conference on Software Engineering Research and Practice, Las Vegas, Nevada, pp. 46–51 (June 2007) Shihab, E., Jiang, Z.M., Hassan, A.E.: On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project. In: Proceedings of the 6th International Working Conference on Mining Software Repositories, Vancouver, BC, Canada, May 1617, pp. 107–110 (2009) GNOME GTK+ project, http://www.gtk.org/ Wikipedia: GTK+, http://en.wikipedia.org/wiki/GTK%2B GTK+ development, http://www.gtk.org/development.html GTK+ developer mailing archive, http://mail.gnome.org/archives/gtk-devel-list/ GTK+ meeting space, http://live.gnome.org/GTK+/Meetings GTK+ team meetings, http://www.gtk.org/plan/meeting