Recovering Design Rationale from Email Repositories - IEEE Xplore

0 downloads 0 Views 97KB Size Report
Recovering Design Rationale from Email Repositories. Andrea De Lucia1, Fausto Fasano2, Claudia Grieco1, and Genoveffa Tortora1. 1Dipartimento di ...
Recovering Design Rationale from Email Repositories Andrea De Lucia1 , Fausto Fasano2 , Claudia Grieco1 , and Genoveffa Tortora1 1

2

Dipartimento di Matematica e Informatica, University of Salerno, Italy Dipartimento di Scienze e Tecnologie per l’Ambiente e il Territorio, University of Molise, Italy [email protected], [email protected], [email protected], [email protected]

Abstract Rationale is the justification behind decisions taken during the software development process. The usefulness of rationale pervades the entire software lifecycle. However, it is during maintenance that the benefits of rationale management are most evident, as it provides an insight into the motivations and the reasoning behind decisions taken during the original design and implementation. One of the strongest limitation to the capturing of rationale information during development concerns its timeconsuming and disruptive nature that cause many organizations to consider rationale management costs excessive. A possible solution is to extract and capture rationale information when it is needed. This can be done by analyzing documents shared or exchanged among software engineers during the development process. In this paper, we propose to supports the software engineer during the rationale capturing by automatically identifying candidate rationale information extracted from email repositories. Besides this, we also support the designer during the rationale retrieval by identifying possible rational information within a document repository starting from a query represented by a source document.

1. Introduction Software engineering is plenty of wicked [24] or illstructured problems. These problems usually do not have a unique, easy, or stable solution and the consequent decision is often the result of a heated discussion that may include the evaluation of several alternatives, criteria, and trade-offs. The reasoning and the justification behind the final decisions is called rationale. Rationale is an essential source of information during maintenance to comprehend the design of the implemented solution, detect the source of a failure, and evaluate possible alternatives that have been already considered. More-

over, rationale provides an important information to identify where improvements can be made, by looking for decisions made under assumption that have evolved. Despite it has been acknowledged as one of the most important information a software project produces, support for rationale management is still inadequate. One of the main obstacle to the diffusion of rationale management is the intrusiveness of the proposed approaches in the design process [10]. In fact, most proposed approach to design rationale require the software engineers to use specific design tools or guide the way the rationale is captured during design, usually dictating the design rationale scheme to adopt. In many cases, these approaches also cause a disruption of the design process, due to the necessity of capturing and structuring this huge amount of information. In order to avoid the intrusiveness of the rationale capturing, an alternative is represented by the possibility to avoid capturing of rationale during the discussion and try to reconstruct it when necessary. Unfortunately, the possibility that this information will no longer be available in the future is unacceptable for many software companies. On the other hand, in many cases the debate behind an issue resolution leaves traces in the project documentation archive. Examples of these kind of traces are email repositories, forums, chat logs, wiki pages, and draft documents store in the history of document versioning management systems. In this paper we propose to extract rationale information from one of these sources, namely the email repositories, using Information Retrieval (IR) techniques to query the email repository and Natural Language Processing (NLP) techniques to extract most significant terms from emails. It is worth noting that the presented approach can be easily adapted to manage other type of structured or unstructured documents written in natural language. Besides being useful to capture the rationale stored into the repository, the approach can be used during the retrieval of the rationale information. In particular, given a query represented by one of the available emails as well as by a

543 978-1-4244-4828-9/09/$25.00 2009 IEEE

Proc. ICSM 2009, Edmonton, Canada

new incoming message, it is possible to identify the most relevant elements in its repository that are related to it. The rest of the paper is organized as follows. Section 2 discusses related work on rationale capturing and retrieval. Section 3 presents an overview of the Rationale Extractor (REx) process. Finally, Section 4 concludes the paper and gives some indication for future work.

2. Related Work The rationale management problem can be divided into three main subproblems: representing and structuring the information, capturing the rationale during or after the discussion, and retrieving related rationale information from a rationale repository. This paper mainly addressed the latter two problems, thus we will focus mainly of them. Surveys of design rationale approaches and representation schemata can be found in [21, 10]. The Automatic Rationale Capture is a relatively new field of study. Most of the efforts in design rationale has been devoted to the definition of languages and tools to structure the Rationale Capture process or to speed up the Rationale Retrieval. These tools still require a substantial involvement by the designer and influence the workflow of the project members. One of the direction to address the rationale capturing problem is to reduce its intrusiveness. For example, MIKROPOLIS [17] and gIBIS [6] provide extensive support to browse, change, and retrieve of design rationale, but limited support to capturing is provided. Ramesh [21] supports the design capture by providing a user interface that makes rationale creation easier, however the cognitive overhead of capture is high. To reduce this overhead the strategy of differential description [10] has been proposed. For example, using PHI [15] rationale commonly used in projects in a given domain (issue, position, and arguments) can be used by designer in other domains that can add missing information and take their decisions. An implementation of differential description can been realized using rationalannotated cases of similar projects, like in ARCHIE [13] as well as design patterns annotated with rationale. Some researchers consider the use of a representation schema during capturing too labor intensive and critique the fact that it may cognitively interfere with the capture process. Thus, approaches that do not rely on a representation schema have also been proposed [19, 20, 26]. Several tools have been proposed that try to support the designer during the rationale capturing process. Phidias [16] executes the Rationale Capture in two phases: first all the documents, structured or not structured, are inserted in a graph, then links are used to connect the information. SAAMPad [22] captures and organizes automatically the Rationale generated during SAAM sessions.

544

SAAM (Software Architectural Analysis Method) is a complex analysis method for capturing Architectural Rationale shared during meetings. All the participants write on an electronic board and the program records all the changes in the drawing and the instant they happened. The verbal discussions are recorded too and later voice recording is synchronized with illustrations according to the time they happened. The IDIMS project [12] tries to assist the rationale classification work, finding unresolved issues, grouping emails by thread and allowing researches based by thread and by keyword. However, IDIMS requires every document to be accompanied by rationale annotations. Finally, the Rationale Construction Framework (RCF) [19] tries to integrate existing Computer Assisted Design tools (CAD) and Rationale by monitoring the designer’s actions with a CAD tool and creating a design event log. This log is then structured and interpreted using design metaphors (i.e. recognizing patterns of events). The rationale retrieval problem, can be seen as a specialization of the problem of recovering knowledge-based traceability links between software artifacts [4]. In this field, several authors have applied IR methods [2, 8] to the problem [1, 5, 7, 9, 11, 14]. Other authors [18, 25] use regular expressions to exploit naming conventions and map low level artifacts onto high-level artifacts. Recently, the use of ontologies has also been proposed to address a similar problem [23].

3. Rationale Extraction Process In this section we present a brief description of the Rationale Extractor (REx) process. The goal is to support the designer during the rationale capturing, retrieval, and representation process. In particular, we aim at automatically extract design rationale information from unstructured or free-text communications. Given a new element, we provide the designer with a likelihood value for each kind of rationale item included in the rationale schema. REx can be classified as a Supervised Machine Learning System. In fact, we use a collaborative approach between data analysis and human interaction: the user is provided with the results of the rationale retrieval functionality, that provides the discussion context and the rationale classifier, that provides the probability associated with each type of rationale item. However, the responsibility to select the correct type of rationale is left to the software engineer. On the other hand, the classified sentences are stored and used to improve the precision of future analysis. Obviously, the classifier performances are influenced by the number and correctness of already classified elements in the repository. The rationale extracted is maintained in an issue-base and cross-referenced to the originating documents. We adopt an argumentative model derived from the model pro-

posed by Br¨ugge and Dutoit [3] that represents a good compromise of expressiveness and simplicity.

3.1. Process Description The rationale extraction process has been divided into three phases. The first phase is represented by the preprocessing of the input email. During this phase, the email is analyzed to extract the metadata contained in the header (e.g., author, subject, sending date, etc.) and to separate quotes from the rest of the message. The message body is split into smaller sentences to allow a more precise rationale elements identification. The way the message body is split into sentences can be parametrized (e.g., dot-separated sentences, line-separated sentences, paragraphs, or full body text). The minimum term a single sentence must contain can also be specified. The second phase, consists of the identification of all the related rationale items stored in the issue-base that contains previously processed messages. Indeed, an essential information used by the designer during the reconstruction and capturing of design rationale is represented by the underlying discussion a specific rationale item is referred to. This information helps to link the element to an ongoing (as well as past) issue resolution discussion and define the context of the rationale item. This problem is usually referred to as the design rationale retrieval problem. We decided to use an IR technique, namely Latent Semantic Indexing (LSI) [8], to retrieve all the emails in the repository that are related to the input message. The output of this phase is a ranked list of related messages ordered by relevance with the input message. For each message, the type of rationale it contains is also shown. The third phase of the process consists of the classification of the input message within the adopted rationale schema. During this phase, the designer is providing with a likelihood value for each kind of rationale item included in the rationale schema. This value is used used by the designer to classify the new message. A high score for a specific rationale element type means a high probability that the sentence contains that kind of rationale item. In case, none of the considered rationale element types can be applied, the element is classified as non-rationale (e.g., a spam message or an element not referred to any reasoning within the project).

3.2. Rationale Score Details The rationale scores are calculated as follows: (i) the subject of the email is analyzed to identify the occurrences of common words for a specific type of rationale; (ii) rationale items already classified and stored in the issue based are clustered on the basis of their textual similarity and the

545

distance of the new element with respect to each cluster is computed; (iii) each sentence of the input is used to perform a NLP search into the set of already classified sentences. The results of these three searches are combined together and, for each type of rationale element, a likelihood value is provided. To convert the score (which is a cardinal number) in a likelihood value (that ranges between 0 and 1), each type of rationale is associated with the maximum and minimum scores obtained for that type of rationale in previous classifications. During the first step, a list of commonly used words and phrases (white-list) is used for each type of rationale. Each element in the white-list is associated with a weight that corresponds to how much the element characterizes the specific type of rationale. Moreover, the same term can be included in different white-lists with different associated weights. The second step tries to cluster the already classified elements in the repository and compute the distance between the new element and each of these clusters. In this step, we use the text similarity distance. During the last step, a natural language search is executed. This kind of search interprets the query string as a phrase in human natural language. Some terms and literals (included in a stop-words list) and words occurring in more than 50% of the emails are ignored as they do not provide additional information. NLP is used to analyze the input text and extract the most important parts of speech: we use the Unlexicalized Probabilistic Context-Free Grammar (PCFG) Parser created by the Standford NLP Group1 to obtain the syntactic structure of the text. Instead of using the entire text as search input, we extract the Noun Phrases (NP) from the syntactic tree and use them as keywords. This choice produced more precise results as compared to a full text search.

4. Conclusion and Future Work In this paper we have presented an approach for the capturing of design rationale from communication traces. In particular, we focused on email archives, a widely used communication media. The approach has been implemented in a tool that aims at providing support during maintenance by capturing and structuring design rationale. We have also conducted a preliminary evaluation of the effectiveness of the related design rationale retrieval functionality. In particular, we analyzed a set of emails taken from MarkMail2 , a message archive that stores mailing lists from technical projects. We chose to take emails from mailing lists related to two different open source projects: 1 http://nlp.stanford.edu/software/lex-parser.shtml 2 www.markmail.org

KDE3 , a modern desktop system for Linux and UNIX platforms and Apache HHTP Server4 an open-source HTTP server. The achieved results are encouraging. Indeed, given an email in input, the tool is able to identify most of the related rationale information (nearly 90%) with a high precision (about 76%). Besides retrieving related rationale from the repository, we also support the designer during the classification of the correct type of rationale contained in the email in input. This is accomplished by providing a likelihood value representing the probability that the email contains a specific kind of rationale. Future work will be devoted to the evaluation of the effectiveness of the classification functionality. Moreover, a case study with several student projects will be conducted to evaluate the effectiveness of the tool in different settings and with different sources of rationale information.

References [1] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE TSE, 28(10):970–983, 2002. [2] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [3] B. Br¨ugge and A. Dutoit. Object-Oriented Software Engineering, 2nd edition. Prentice Hall, 2003. [4] J. Cleland-Huang, A. Dekhtyar, J. Hayes, G. Antoniol, B. Berenbach, A. Eyged, S. Ferguson, J. Maletic, and A. Zisman. Grand challenges in traceability. Technical Report COET-GCT-06-01-0.9, Center of Excellence for Traceability, September 2006. [5] J. Cleland-Huang, R. Settimi, C. Duan, and X. Zou. Utilizing supporting evidence to improve dynamic requirements traceability. In Proc. 13th Int’nl Req. Eng. Conf., pages 135– 144. IEEE CS Press, 2005. [6] J. Conklin and M. L. Begeman. gibis: a hypertext tool for exploratory policy discussion. ACM Trans. Inf. Syst., 6(4):303–331, 1988. [7] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora. Recovering traceability links in soft-ware aftefact management systems using information retrieval methods. ACM Transactions on Software Engineering and Methodology, 16(4), 2007. [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. of the American Society for Inf. Science, 41(6):391–407, 1990. [9] C. Duan and J. Cleland-Huang. Clustering support for automated tracing. In Proc. 22nd Int’nl ASE Conf., pages 244– 253. ACM Press, 2007. [10] A. H. Dutoit, R. McCall, I. Mistrik, and B. Paech. Rationale Management in Software Engineering. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. 3 http://www.kde.org/ 4 http://httpd.apache.org/

546

[11] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram. Advancing candidate link generation for requirements tracing: The study of methods. IEEE TSE, 32(1):4–19, 2006. [12] Y. Kato, K. Taketa, and K. Hori. Capturing design rationale by annotating e-mails. In Proceedings of the 6th World Multiconference on Systemics, Cybernetics and Informatics, 2002. [13] J. L. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993. [14] A. Marcus and J. Maletic. Recovering documentation-tosource-code traceability links using latent semantic indexing. In Proceedings of 25th International Conference on Software Engineering, pages 125–135, Portland, Oregon, USA, 2003. [15] R. McCall. PHI: A conceptual foundation for design hypermedia. Design Studies, 12(1):30–41, 1991. [16] R. McCall, P. Bennett, P. d’Oronzio, J. Ostwald, F. Shipman, and N. Wallace. PHIDIAS: A PHI-based design environment integrating cad graphics into dynamic hypertext. In Proceedings of the European Conference on Hypertext, 1990. [17] R. McCall, I. Mistrik, and W. Schuler. An integrated information and communication system for problem solving. In Proceedings of the 7th International CODATA Conference, pages 107–115, 1981. [18] G. Murphy, D. Notkin, and K. Sullivan. Software reflexion models: Bridging the gap between design and implementation. IEEE Transactions on Software Engineering, 27(4):364–380, 2001. [19] K. L. Myers, N. B. Zumel, and P. Garcia. Automated capture of rationale for the detailed design process. In Proceedings of the 6th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference, pages 876–883, Menlo Park, CA, USA, 1999. American Association for Artificial Intelligence. [20] B. Reeves and F. Shipman. Supporting communication between designers with artifact-centered evolving information spaces. In CSCW ’92: Proceedings of the 1992 ACM conference on Computer-supported cooperative work, pages 394– 401, New York, NY, USA, 1992. ACM. [21] W. C. Regli, X. Hu, M. Atwood, and W. Sun. A survey of design rationale systems: Approaches, representation, capture and retrieval. Engineering with Computers, 16(3 - 4):209– 235, 2000. [22] H. Richter, P. Schuchhard, and G. D. Abowd. Automated capture and retrieval of architectural rationale. In Proceedings of the 1st Working IFIP Conference on Software Architecture, pages 22–24. Kluwer Academic Publishers, 1999. [23] J. Rilling, R. Witte, and Y. Zhang. Automatic traceability recovery: An ontological approach. In Proc. Int’nl Symp. on Grand Challenges in Traceability, pages 66–75. ACM Press, 2007. [24] H. Rittel and M. Webber. Dilemmas in a general theory of planning. Policy Sciences, 4:155–169, 1973. [25] M. Sefika, A. Sane, and R. Campbell. Monitoring compliance of a software system with its high-level design models. In Proc. 16th ICSE, pages 387–396. IEEE CS Press, 1996. [26] F. M. Shipman, III and C. C. Marshall. Spatial hypertext: an alternative to navigational and semantic links. ACM Comput. Surv., page 14, 1999.