A Novel Technique to Detect Semantic Clones - International Journal ...

2 downloads 30451 Views 345KB Size Report
of code clone exists in the normal program, which is often source of bug in most cases ... any software in industries and business houses but today any industry ..... Medium Access Control (MAC) and Physical Layer (PHY). Specification, IEEE ...
International Journal of Computer Trends and Technology (IJCTT) – volume 23 Number 2 – May 2015

A Novel Technique to Detect Semantic Clones Nirpjeet Kaur#1, Sumeet Kaur Sehra*2, Diana Nagpal#3 IT Dept. Guru Nanak Dev Engineering College Ludhiana, India Abstract— Existing research shows there are more than 5-10%[1] of code clone exists in the normal program, which is often source of bug in most cases and is very difficult to detect without several test cases. Though several automated refactoring tools exists but it fails to tell the number of place where the refactoring is necessary. Many code clone detection techniques have been employed to remove syntactic clones but less research is geared towards finding semantic clones most because of its lower payoff though the error cheeped because of semantic clones are almost invisible even to most experienced programmer. This paper proposes a technique that may be employed to automatically detect semantic clones. Keywords— Code Clone, Syntactic Clone, Semantic Clone, Abstract Syntax Tree, Graph.

I. INTRODUCTION

Evolution of systems have grown in tremendous place. Earlier most systems were manually operated but now most of them are automated and network connected. Couple of decade ago there were barely any software in industries and business houses but today any industry has ten million mines of software, they are rolling compute farm outfitted with ten, dozen, even hundreds of processors that not only talk to each other but are also talking to outside world. As company strive to deliver higher quality products they really struggle with the complexity of the system and how to put them together today. Upwards of forty percent [20][23] are their engineering efforts are being spent on soft wares. Industry are taking on the challenges that organization face from this rising complexity. Most of business houses uses solutions for system and software engineering to simplify the complex design process and break it down to smaller manageable pieces and then to distribute these pieces to their engineering team which resides around the world part of them is the ability to real time capability to audit and trace the information through these complex system to design smarter product with a road map that helps them avoid the challenges of the complexity they face these days.

ISSN: 2231-2803

One of the major problem faced by software engineering is duplicate code also known as code clone. Code clones [3],[4],[5],[6] are small snippets of codes that are placed in different location of a module. They arises commonly when a software application is under augmentation for some time or by sloppy design and ill structure of the code. Code clones creates difficulty when modifying the application because finding issues and modifying codes needs to be done at different locations. Hence giving rise to maintenance costs and may cause unfavorable system crash and untraceable bugs leading to project failure and reputation degradation. There are several reason [9][10][11] for code clone to exist: 1. Copying and pasting of code – Usually caused when a fragment of code is working properly and similar functionality is required at another part of the program 2. Plagiarism –Where a piece of code is used reengineered or used without permission. 3. Generated Code – RAD Tools generate fragments of code that are used to reduce project development time. Code Smell is a known problem and is as old as the universe of programming itself. There are various types of code smell, but the focus of this paper is based on special type of code smell namely, code duplication. Code Duplication is an unwanted effect which usually arises when different parts of code in the same project are similar. Refactoring is considered one of the effective way to counter code delicacy. But there are several problems associated with Code Refactoring and they are:

http://www.ijcttjournal.org

Page 56

International Journal of Computer Trends and Technology (IJCTT) – volume 23 Number 2 – May 2015 1. It is hard to find the area where refactoring is required

A. Text Based Techniques

These techniques considers a program to be sequence of characters and compares raw text 2. Refactoring may break existing working without regard to any specific language constructs. code [12] , proposed a novel technique where each 3. Bug introduced while refactoring is difficult character of bounded segment of the code with rest to find pieces of code without taking spaces and comments 4. Only applicable when test cases are ready into account using a simple loop. The advantages of these types of technique lies in the fact that they are But one may break the existing code while easy to implement and little to no code refactoring hence, refactoring should be only transformation is required. [16] considered when some functionality is to be added , used differential file comparison algorithm in the existing code and it is hard to implement the also coined as DIFF algorithm. In this algorithm functionality or modify them, then refactor. One source code normalization is required by removing should take extreme precaution when test cases are whitespaces and comments followed by not present while refactoring, since without determining equivalence class of a file with the refactoring it is hard to know whether the code has lines of another and then finding longest common been broken after refactoring or not. subsequence and then on match and then weeding out spurious sequences also called jackpots. II. CODE CATEGORIZATION

Various kinds of code clone exists in software life cycle these code clone can be put down as: Syntactic Clone Syntactic code clone measures the closeness and similarity between different pieces of codes. It is partitioned in three types: Type I - Is a type of syntactic clone where clone group have exact same code, character by character except difference in whitespace and comments. Type II-Is a type of syntactic clone that is built on top of Type A and in addition to it the type name or variable name may be different. Type III -Is a type of syntactic clone that is built on top of Type B and in addition to it may have some new statements or maybe some statements are missing from the original. . A.

TABLE 1 COMPARISON OF RESULTS OF DIFF AND OTHER TOOLS

Techniq ues

[8][12][13]

DIFF DIFF Other Other

Appearance (in Time New Roman or Times) Comparison Total Total No. of No. of clones detected Lines Source 26 14 Destination 26 14 Source 26 24 Destination 26 24

The table shows DIFF superiority over other clone detection tools as there were only 14 clones present in their test cases. And other tool shows inaccuracy while detecting clones.

Token Based Techniques Is much like lexical analyser that breaks program up into tokens and then move sliding window across the token stream looking for similar sequences. [18] , proposed program slicing method by at first finding all global variable followed by searching for variables in various program sections such as B. Semantic Clone function, loop etc. and tagging each of these tokens Semantic type clones deals with meaning of the and at last slicing every section based on local and global constructs. They repeated this process till code and not the syntax. each of the tokens are tagged and sliced and then III. BACKGROUND they merged their result set and marked clones All paragraphs must be indented. All paragraphs present in their test files. They claim these must be justified, i.e. both left-justified and right- technique is not only faster than other techniques justified. present in different text based tools but also perform this classification with better accuracy.

ISSN: 2231-2803

B.

[8,12,13,21,22]

http://www.ijcttjournal.org

Page 57

International Journal of Computer Trends and Technology (IJCTT) – volume 23 Number 2 – May 2015 Block diagram of their token-clone detecting terms of actual programs structural meaning. The software is given below: disadvantages of using these types of techniques is that it is much slower than Token based techniques. [2][7] Produced abstract syntax suffix trees to detect clones present in test cases. Though their technique works well on all syntactic types but is mainly geared towards finding syntax based clones or Type III clones. Their algorithm comprises of following steps 1) Parse the Code in test cases 2) Generate Abstract suffix trees 3) Serialize Abstract suffix trees 4) Apply suffix tree detection 5) Finally, decomposing tokens into syntactic units

Figure 1: Token-Clone Detection using Program Slicing [17]

, proposed a scalable technique for token based clone detection in this method they used cosine similarity function on a count matrix. At first they constructed a matrix by partitioning different tokens Figure 2: Example of Suffix trees present in the program based on their vicinity. Then using the cosine similarity to reconstruct the matrix. IV. METHODOLOGY On using this technique clones are grouped together. The multiplicity of complexity in semantic clones The cosine similarity function is given below: are many higher than any existing semantic clone in the code. It is difficult to detect and remove and is often ignored though it reflects sloppy design and major source of internal inconsistency in interrelated systems. .Advantage of this technique over text based technique is that it can safely ignore text present in string literals of any code segment. But is slower than text based techniques.

Builds program dependence graph [14][15][19] which is a typical compiler internal representation. Lot of program structures are abstracted away for maintaining the meaning and so it is obviously C. Syntax Based Techniques difficult to detect similarity in structure if most Works similar to a parser. The source is parsed at fragments are abstracted but of course maintaining first and then a tree representation is built from it a meaning and also have a better chance of and then sub-trees are compared against each other. detecting these similar meaning but having different Advantage now we don’t get weird clones that syntax. The disadvantage is that it slowest of all the include the end of one method and the first two line techniques because sub graph are matched instead of another method. They break across the method of sub trees and graph are computationally more boundary because it doesn’t make much sense in expensive than tree structures.

ISSN: 2231-2803

http://www.ijcttjournal.org

Page 58

International Journal of Computer Trends and Technology (IJCTT) – volume 23 Number 2 – May 2015 This research at first creates two language and then graph are generated from their program codes further this graph are divided into n different sub graphs and each sub graphs are compared against each other based on their degree of similarity, the conclusion is drawn about the presence of semantic code clone. Graph_Compare(s,G): 1. Initialise (Q=Empty,V=subGraph(starts_with(‘#’))); 2. Foreach v [v]:= ”Unvisited”; ; d[v]:= ∞; d[s]:= 0; end foreach 3. 4. 5.

Figure 3 Clone detection

Three randomly selected test cases are selected, compared and represented in Figure 3. In the first test cases, which was very rudimentary and had 7 actual clone, all the 7 clones were detected correctly. But in more complex program the error percentage was still below 30.

Foreach v

; d[v]:= d[u]+1; End if End for

6. For Ti = (V, π[x], x) End for 7. ⁰Sim = Compare_Nodes_SubGraph(Vi,Vj) 8. Return (V,T, ⁰Sim) V. CONCLUSIONS

The clone detection were applied on two semantically different languages and few test cases were generated to demonstrate the capability of the algorithm.

ISSN: 2231-2803

Figure 4: Error Percentage

VI. FUTURE SCOPE

The time and space complexity of the algorithm used is of the order: Time Complexity: C1n+ C2 ∑ deg(u) O(n+m), where n=n(V), m=n(E) Space Complexity: O( n+m)

(u

V) =

This research work used Adjacency matrix as a data structure to store data in the algorithm to reduce program complexity. But Adjacency list can be used to in place of Adjacency matrix to improve space complexity. To improve time complexity or performance, some other data structures can be investigated.

http://www.ijcttjournal.org

Page 59

International Journal of Computer Trends and Technology (IJCTT) – volume 23 Number 2 – May 2015 [12]

ACKNOWLEDGMENT

I express my gratitude to Prof. Sumeet Kaur Sehra for their aspiring guidance, invaluably constructive criticism and advice during this research work. I am also grateful to Prof. Diana Nagpal for sharing her technical views on a number of issues related to this research. REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8] [9]

[10]

[11]

Yingnong Dang, Song Ge, Ray Huang and dongmei Zhang. Honululu ,"Code Clone Detection Experience at Microsoft", ACM, 23 May 2011, Microsoft Research Asia. R. V. Patil, Madhuri Lole, Ruchira Kudale, Rajani Konde, S. D. Joshi and V. Khanna. Pune ,"Code Clone Detection Technique Using Weighted Graph and CFG.", s.n., 16 March 2014, Proceedings of 4th IRF International conference. 978-93-82702-66-5. Ritesh V. Patil, S. D. Joshi and V. Khanna. 9, Chennai ,"Code Clone Detection using Decentralized Architecture and Parallel ProcessingLatest Short Revie", s.n., September 2014, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 4. 2277-128X. Kodhai.E, Perumal.A and Kanmani.S. 1, Puducherry ,'"Clone Detection using Textual and Metric Analysis to figure out all Types of Clones", IJCCIS, July 2010, International Journal of Computer communication and Information System, Vol. 2. 0976-1349. Jiyong Jang, Maverick Woo and David Brumley. San Francisco ,"ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions", IEEE Xplore, May 2012, Institute of Electricals and Electronics Engineers, pp. 48-62. 978-0-7695-4681-0 ., Hosam AlHakami, Feng Chen and Helge Janicke. 4, Leicester."An Extended Stable Marriage Problem Algorithm for Clone Detection.", IJSEA, July 2014, Internation Journal of Software Engineering & Application, Vol. 5. Rainer Koschke, Raimar Falke and Pierre Frenzel. [ed.] IEEE Xplore. Benevento,"Clone Detection Using Abstract Syntax Suffix Trees", IEEE, October 2006, Institute of Electrical and Electronics Engineers, pp. 253-262. 0-7695-2719-1 . Horwitz, Randy Smith and Suzan.," Detecting and Measuring Similarity in Code Clones",June 2009. Raemaekers, Steven. Amsterdam ,"Testing Semantic Clone Detection Candidates", Digital Academic Repository of the University of Amsterdam, 1995. Kamino: Dynamic Approach to Semantic Code Clone Detection. Neubauer, Lindsay Anne. s.l.: Columbia University, 2014, Columbia University Computer Science Technical Reports. CUCS-022-14. “PDCA12-70 data sheet,” Opto Speed SA, Mezzovico, Switzerland. Yue Jia, David Binkley, Mark Harman, Jens Krinke and Makoto Matsushita. London:,"KClone: A Proposed Approach to Fast Precise Code Clone Detection. ",The UCL Department of Computer Science, August 2010.J. Padhye, V. Firoiu, and D. Towsley, “A stochastic

ISSN: 2231-2803

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

model of TCP Reno congestion avoidance and control,” Univ. of Massachusetts, Amherst, MA, CMPSCI Tech. Rep. 99-02, 1999. A qualitative approach. Chanchal K. Roy, James R. Cordy and Rainer Kosche. 7, Canada:,"Comparison and Evaluation of Code Clone Detection Techniques and tools",Science Direct, May 2009, Science of Computer Programming, Vol. 74, pp. 470-495. Heejung Kim, Yungbum Jung, Sunghun Kim and Kwankeun Yi. [ed.] IEEE Xplore. Honolulu:,"MeCC: Memory Comparison-based Clone Detector",IEEE, 21 May 2011, Institute of Electrical and Electronics Engineers, pp. 301-310. 978-1-4503-0445-0. Cordy, Chanchal K. Roy and James R. [ed.] IEEE Xplore. Amsterdam:,"Scenario-Based Comparison of Clone Detection Techniques",IEEE, 10-13 June 2008, Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on, pp. 153-162. 978-0-7695-3176-2 Rothermel, Alessandro Orso and Gregg. [ed.] FOSE 2014. Atlanta:,"Software Testing: A Research Travelogue (20002014)",ACM Digital Library, 2014, Proceedings of the 36th IEEE and ACM SIGSOFT International Conference on Software Engineering (ICSE 2014), pp. 117-132. 978-1-4503-2865-4. Rowyda Mohammed Abd El-Aziz, Amal Elsayed Aboutabl and Mostafa-Sami Mostafa. 8, Cairo:,"Clone Detection Using DIFF Algorithm for Aspect Mining",IJACSA, 2012, International Journal of Advanced Computer Science and Applications, Vol. 3. Guo, Yang Yuan and Yao ,"Boreas: An Accurate and Scalable TokenBased Approach to Code Clone Detection", [ed.] IEEE Xplore. Essen: IEEE, 3-7 September 2012, Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference, pp. 286-289. 978-1-4503-1204-2. Shilpa, Rajnish Kumar and Prof. 4, Chandigarh:,"Token based clone detection using program slicing",IJCTA, International Journal of Computer Technology & Application, Vol. 5. 2229-6093. Florida: University of Central Florida, "JSCTracker: A Tool and Algorithm for Semantic Method Clone Detection Using Method IOEBehavior",15 October 2012, Technical Article, p. 16. Tekchandani, Kanika Raheja and Rajkumar. 5, Patiala,"An Emerging Approach towards Code Clone Detection: Metric Based Approach on Byte Code", IJARCSSE, May 2013, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3. 2277-128X. Singh, Girija Gupta and Indu. 9, Sirsa,"A Novel Approach towards Code Clone Detection and Redesigning", IJARCSSE, September 2013, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 3. 2277-128X. Bakr Al-Batran, Bernhard Schätz and Benjamin Hummel. [ed.] Springer. Wellington,"Semantic Clone Detection for Model-Based Development of Embedded Systems", Springer Berlin Heidelberg, 16 October 2011, 14th International Conference, MODELS 2011, Wellington, New Zealand, October 16-21, 2011. Proceedings, Vol. 6981, pp. 258-272. 0302-9743. Beverly, MA,"Highly Configurable and Extensible Code Clone Detection. IEEE, [ed.]", Reverse Engineering (WCRE), 2010 17th Working Conference, 13-16 October 2010, Institute of Electrical and Electronics Engineers, pp. 237-421. 978-1-4244-8911-4.Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specification, IEEE Std. 802.11, 1997.

http://www.ijcttjournal.org

Page 60