Paper Title (use style: paper title)

0 downloads 0 Views 214KB Size Report
QBSCS: Query Based Source Code Summarization ... The other option is to use automated source code ... Automatic text summarization is concerned with the .... The relevance of software documentation, tools and technologies: a survey.
DOI 10.4010/2016.786 ISSN 2321 3361 © 2016 IJESC

Research Article

Volume 6 Issue No. 4

QBSCS: Query Based Source Code Summarization Chittibabu K1, Dr. Sethukarasi T2 Assistant Professor1, Professor2 Department of CSE RMKCET, Chennai, India [email protected], [email protected] Abstract: In the software evolution process a developer must analyze the source code in order to understand the entities in it. In general this analysis is done manually which takes lots of time and is a tedious task. The other option is to use automated source code summarization techniques. In most of the cases the developer will not analyze the total code he needs to analyse only one entity like packages or classes or methods or other entities in the source code. But all the Existing techniques provide the summary of the total code. Large software applications contain huge number of packages, classes, methods and other program entities so summarizing the total code is time consuming and tedious process. In this paper we propose a novel source code summarization technique called QBSCS which is based on the developers query. Based on the query the summary is generated for the entities which the developer is interested. Keywords: Source code entities; Source code summarization; Textual clues; source code comprehension I. INTRODUCTION During software maintenance, developers often cannot read and understand the entire source code of a system and rely on partial comprehension, focusing on the parts strictly related to their task at hand. When creating software libraries, most developers use source code summarization. This technique makes it possible to convey information about functionality to other software developers using the library. Source code summarization has potential for valuable applications in many software engineering tasks, such as: (a) Understanding new code bases. Often developers need to quickly familiarize themselves with the core parts of a large code base. This can happen when a developer is joining an existing open source project, or when a developer is evaluating whether to use a new software library. (b) Code reviews. Reviewers need to quickly understand the key changes before reviewing the details. (c) Locating relevant code segments. During program maintenance, developers often skim code, reading only a couple lines at a time, while searching for a code region of interest Recent studies have shown that developers spend more time reading and navigating the code than writing it [1-2]. During these activities developers often only skim the source code [3] (e.g., read only the header of a method and maybe the leading comments when available). When the code is well documented internally (e.g., good preceding comments, meaningful names and parameters), this is often sufficient to determine if it is relevant or not. The two activities (i.e., skimming and reading the whole implementation) are two extreme tactics; the former is very quick yet it can lead to misunderstanding, while the later is time consuming. An obvious option is something in the middle, i.e., offering developers a description of the source code, which can be read fast and leads to better understanding.

International Journal of Engineering Science and Computing, April 2016

With such descriptions (or summaries), developers could look over software entities fast and would make more informed decisions on which parts of the source code they need to analyze in detail. Two main challenges emerge in this context: determining what should be included in these summaries and how to generate them automatically. When developers are searching for code with a specific functionality, textual clues can help them determine which parts of the code they need to investigate. In this paper we propose to summarize those entities in which the developer or analyzer is interested but not all the entities in a source code. This improves the performance of the system because most of the software applications involve lots of code. II.

AUTOMATIC TEXT SUMMARIZATION FOR SOURCE CODE

Automatic text summarization is concerned with the production of a brief but accurate representation, called a summary, of one or more source documents, with the help of a computer program. Summaries need to be significantly shorter than the original document, while preserving the most important information in the document. The summary of a source code entity we are investigating is a term-based summary, which contains the most relevant terms for the entity found in the code. Summaries can be divided in two main categories: extractive and abstractive. Extractive summaries are obtained from the contents of a document by selecting the most important information in that document. Abstractive summaries, on the other hand, are meant to produce important information about the document in a new way, at a higher level of abstraction, and usually include information which is not explicitly present in the original document. In this paper, we focus on extractive summaries.

3398

http://ijesc.org/

III. EXIXTING TECHNIQUES A wide variety of techniques have been proposed for producing summaries in text summarization. Some of the most successful ones include techniques based on the position of words or sentences in the source document, and techniques based on text retrieval (TR). Among the techniques based on the position of terms, the lead Summaries are the most frequently used and most successful, being often selected as a baseline for assessing new techniques. Lead summaries are based on the idea that the first terms that appear in a document (i.e., the leading terms) are the most relevant to that document. Statistical based TR techniques have also been successfully used for text summarization [4-6]. Different studies of program comprehension show that programmers rely on good software documentation. [9, 12, 14]. Unfortunately, manually-written documentation is notorious for being incomplete, either because it is very timeconsuming to create [7, 11], or because it must constantly be updated [8, 10, 13]. One result has been the invention of the documentation generator. A documentation generator is a programming tool that creates documentation for software by analyzing the statements and comments in the software’s source code. The key advantage is that they relieve programmers of many tedious tasks while writing documentation. They offer a valuable opportunity to improve and standardize the quality of documentation. Still, a majority of documentation generators are manual. JSummarizer is an Eclipse plug-in for automatically generating natural language summaries of Java classes [15]. The tool uses a set of predefined heuristics to determine what information will be reflected in the summary, and it uses natural language processing and generation techniques to form the summary. The generated summaries can be used to re-document the code and to help developers to easier understand large and complex classes. The main drawback of this tool is that it works for only java and it generates summary based on the stereotype. It work for only object oriented source code. IV. PROPOSED WORK In this paper we propose a novel summarization technique called QBSCS uses simple mechanism which is based on the entities in the code and semantic of the source code as shown in Fig-1. A. Extraction Phase In this phase the details of entitiy are extracted including comment lines are from the source code. The extraction of comments involve the identification of the tokens // and /*. The extraction of the class is done by identifying the key word class. The extraction of method is based on the syntactical structure of method. The first line of a method definition”return type method name (argument list)”. B. Summary Generation Phase In this phase a text document is created from the semantic content extracted in the previous phase. This text file is used as input in this phase. This phase involves two main steps

International Journal of Engineering Science and Computing, April 2016

QBSCS Select the Language

C C++ Java

Select the Entity LOC Packages Classes Methods SUBMIT

Fig-1: Sample QBSCS System 1. Convert the extracted text into a corpus. 2. Determine the most relevant terms for documents in the corpus and include them in the summary. 1.

Source code corpus creation Source code contains a lot of text, yet it is not entirely natural text, we need to convert it into a document collection. What is a document? For OO software, the obvious choices are methods and classes, though one can think of files and packages as well. In each case, the identifiers and comments in the source code entities are extracted. The next step is using a stop-words list to filter out terms which do not carry specific meaning. Such terms are conjunctions, prepositions, articles, common verbs, pronouns, etc. (e.g., “should”, “may”, “any”, etc.) and programming language keywords (e.g., if, else, for, while, float, char, void, etc.). 2.

Determine the most relevant terms The generated document contains many terms from this the most relevant terms are extracted and added to summary. The most relevant terms are identified by using a special matrix called relevancy matrix (RM) which represents the terms in rows and it has two columns the first column represents frequency count and the second column repents the relevancy ratio. The relevant ratio is fixed to a threshold value based on which the terms are added to summary. For example if the threshold value for relevant ratio is 0.03, then all the terms whose relevant ratio is less than 0.03 are not added to summary. The relevant ratio of each term is calculated using the below formulae Frequency count (i)

Relevant ratio (i) = N LOC -



Frequency Count (i)

i= 1

Where LOC stand for Lines of Code, N indicates the total no of terms and i indicates the term no for which we want to find the relevant factor.

3399

http://ijesc.org/

code is displayed, the first tables speaks about the classes in the code, the second tables reveals the methods in the code and the third table gives the idea about the loops in the source code. At last the summary is display in the form of two to three lines.

For example consider three words find, sum and two, the RM Matrix contents will look as

Summary for the Query item: LOC

Relevant ratio

Frequency count

For example LOC is 320, summation of all frequencies is 180 and threshold value is 0.03, suppose we want to find the relevant ration for third term in the document, the frequency count (i) is 10. Then relevant ration (3)=10/320180.=0.07.since 0.07 > 0.03 the term is added to summary.

LOC

Total LOC Summary for the Query item: Packages Package

10

Find

Source file name

Purpose

No of classes:

0.02

Summary for the Query item: Classes Sum

30

0.08

5

0.01

two

The whole process involved in QBSCS technique is depicted in Fig.2. The QBSCS involves three main steps in the first step extraction of entities like classes, methods and control statements and semantics is done. In the second step the LOC is identified for each entity and in the last step the semantic content is stored in a text file, corpus is created and most relevant terms are identified. Based on these most relevant terms the summary is generated. The summary generated by QBSCS will be very useful for the developers to identify the entities in the program and to understand the logical aspect of the program. It makes the task of the developer easier by giving the line numbers of each entity and a formal description of the entities. Sample References Section Source Code

Get the entity from user

Extract the details about the entity

Generate the summary

1.

Input the entity

1. 2. 3.

Identify the entity in the source code Identify the Line no for the entity Find the LOC of entity

1. 2. 3. 4.

Store the semantic content in a text file Read the text from the text document Create the corpus Determine the most relevant terms using RM matrix

Fig-2: The Whole process involved in QBSCS In general the summary generated by QBSCSB will look like as in Fig-3. In the first line the total LOC of the source International Journal of Engineering Science and Computing, April 2016

Class

Line No

No of Methods:

CLOC

Purpose

No of Identifiers:

Summary for the Query item: Methods Method

Line No

MLOC

Purpose

Fig-3: Summary format of QBSCS Here CLOC stands for class lines of code and MLOCM indicates the Method Lines of Code. This parameter helps to identify the significance of the methods. Every entity is displayed in the form of a table giving the line no and it purpose. This makes the developer task easy in analyzing the code, the developer need not search the whole code for a particular method or loop he can directly go to the particular code by using line no, this reduces the time. To evaluate the quality of the automatically produced source code summaries and to measure their capability to capture the developers’ understanding of the code, we performed a study in which developers judged the quality of a large set of summaries for methods and classes. The study was particularly aimed at assessing the impact of the various factors affecting the generation of code summaries on the summary quality. V.

CONCLUSION AND FUTURE WORK

In this paper we have presented a novel approach for source code summarization which is useful for software developers and testers. This approach is based on the entities in which the developer is interested in the code. We found that our summaries were superior in quality and that our generated summaries providing contextual information. We also plan to investigate multi-document approaches for the summarization of source code packages and classes. We plan to study how the generated summaries impact program

3400

http://ijesc.org/

comprehension by running studies where developers make use of such summaries during their daily tasks.

[11] M. Kajko-Mattsson. A survey of documentation practice within corrective maintenance. Empirical Softw. Engg., 10(1):31–55, Jan. 2005

REFERENCES [1] T. D. LaToza, G. Venolia, and R. DeLine, "Maintaining mental models: a study of developer work habits," in 28th IEEE International Conference on Software Engineering, 2006. [2] A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung, "An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks," IEEE Transactions on Software Engineering, vol. 32, pp. 971-987, 2006. [3] J. Starke, C. Luce, and J. Sillito, "Searching and skimming: An exploratory study," in 25th IEEE International Conference on Software Maintenance, Edmonton, Alberta, Canada, 2009, pp. 157-166. [4] Y. Gong and X. Liu, "Generic text summarization using relevance measure and latent semantic analysis," in 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, United States, 2001, pp. 19-25.

[12] A. J. Ko, B. A. Myers, and H. H. Aung. Six learning barriers in end-user programming systems. In Proceedings of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing, VLHCC ’04, pages 199–206, Washington, DC, USA, 2004. IEEE Computer Society. [13] the joint European conferences on theory and practice of software, FASE’11/ETAPS’11, pages 416–431, Berlin, Heidelberg, 2011. Springer-Verlag. [14] A. A. Takang, P. A. Grubb, and R. D. Macredie. The Effects of Comments and Identifier Names on Program Comprehensibility: An Experimental Study. Journal of Programming Languages, 4(3):143–167, 1996. [15] Laura Moreno, Andrian Marcus, Lori Pollock, K. VijayShanker, "JSummarizer: An Automatic Generator of NaturalLanguage Summaries for Java Classes", IEEE 2013

[5] J. Steinberger and K. Ježek, "Update Summarization Based on Latent Semantic Analysis," in Text, Speech and Dialogue, ed: Springer Berlin / Heidelberg, 2009. [6] G. Salton, Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer: Addison- Wesley, 1989. [7] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira. A study of the documentation essential to software maintenance. In Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information, SIGDOC ’05, pages 68–75, New York, NY, USA, 2005. ACM [8] B. Fluri, M. Wursch, and H. C. Gall. Do code and comments co-evolve? on the relation between source code and comment changes. In Proceedings of the 14th Working Conference on Reverse Engineering, WCRE ’07, pages 70–79, Washington, DC, USA, 2007. IEEE Computer Society [9] A. Forward and T. C. Lethbridge. The relevance of software documentation, tools and technologies: a survey. In Proceedings of the 2002 ACM symposium on Document engineering, DocEng ’02, pages 26–33, New York, NY, USA, 2002. ACM. [10] . M. Ibrahim, N. Bettenburg, B. Adams, and A. E. Hassan. Controversy corner: On the relationship between comment update practices and software bugs. J. Syst. Softw., 85(10):2293–2304, Oct. 2012.

International Journal of Engineering Science and Computing, April 2016

3401

http://ijesc.org/