A NEW METHOD FOR REGULAR EXPRESSION ...

31 downloads 0 Views 292KB Size Report
All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. Microform Edition © ProQuest LLC. ProQuest LLC.
A NEW METHOD FOR REGULAR EXPRESSION MATCHING ON

IE

W

PREPROCESSED TEXT

A Thesis

PR EV

by

FATMA ABU HAWAS

Submitted to the Office of Graduate Studies of Texas A&M University-Commerce in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE December 2015

ProQuest Number: 1606027

All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

PR EV

IE

W

In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

ProQuest 1606027

Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346

A NEW METHOD FOR REGULAR EXPRESSION MATCHING ON

IE

W

PREPROCESSED TEXT

A Thesis

PR EV

by

FATMA ABU HAWAS

Approved by: Advisor:

Abdullah Arslan

Committee:

Mutlu Mete Unal Sakoglu

Head of Department: Sang C. Suh Dean of the College: Brent Donham Dean of Graduate Studies: Arlene Horne

iii

ABSTRACT A NEW METHOD FOR REGULAR EXPRESSION MATCHING ON PREPROCESSED TEXT Fatma Abu Hawas, MS Texas A&M University-Commerce, 2015

W

Advisor: Abdullah Arslan, PhD

IE

Text editing and information retrieval are common applications that use text matching (pattern matching). The problem is to search for a specific string within a text file and to find all

PR EV

the locations of this string in the text. In our work, we used regular expression to search for strings. The traditional approach for regular expression searching is based on translating the regular expression into finite automaton (FA) to recognize strings that match the regular expression (E) and use it to search the text as input. The disadvantage of this method is that a significant amount of time is required to search a large file, since all the text has to be searched. In this thesis, we introduce a new method for solving the classical regular expression string searching problem. We aimed to speed up the regular expression matching in texts such as Google Docs, Integrated Development Environments (IDEs), and biological sequence databases where regular expression searches are numerous and frequent. Ultimately, the developed method and results can be used on internet searches.

iv ACKNOWLEDGEMENTS Sincere gratitude is hereby extended to the following who never ceased in helping complete this thesis: •

Dr. Abdullah Arslan, the thesis advisor, for his patience, efforts, motivation, and unwavering guidance.



The members of the defense committee, Dr. Mutlu Mete and Dr. Unal Sakoglu, for sharing their precious time and for their insightful comments. The office of Graduate Studies at Texas A&M University-Commerce for supporting

W



this study.

IE

Special thanks to my great, loving husband (my present and my future) for his continuous encouragements during the period of my thesis work. To all my precious family members who in

PR EV

one way or another shared their support either emotionally or physically, I tell them thank you. Above all, to the great Almighty God, the author of the knowledge and wisdom, for His countless mercy, thank you ALLAH.

Writing this thesis is truly a memorial to me forever and always.

v TABLE OF CONTENTS LIST OF TABLES..............................................................................................................................vii LIST OF FIGURES .......................................................................................................................... viii CHAPTER 1. INTRODUCTION ............................................................................................................. 1 Statement of the Problem ............................................................................................ 1 Purpose of the Study .................................................................................................... 1

W

Research Questions...................................................................................................... 2 Significance of the Study ............................................................................................ 3

IE

Method of Procedure ................................................................................................... 4 Definitions of Terms.................................................................................................... 4

PR EV

Overview of Regular Expressions .............................................................................. 5 Limitations ................................................................................................................... 7 2. LITERATURE REVIEW .................................................................................................. 8 AstroGrep Tool ............................................................................................................ 8 3. METHOD ......................................................................................................................... 10 Building The Search Table........................................................................................ 10 Steps of the Method ................................................................................................... 11 Algorithms of the Method ......................................................................................... 15 The JAVA package .................................................................................................... 18 4. EXPERIMENTAL RESULTS AND DISCUSSION .................................................... 20 Comparing with AstroGrep Utility ........................................................................... 21

vi CHAPTER Comparison of the cases of our method ................................................................... 28 Comparison between the file search-only method and our method ....................... 33 5. CONCLUSION AND FUTURE RESEARCH .............................................................. 39 Future Study ............................................................................................................... 40 REFERENCES ................................................................................................................................... 42

PR EV

IE

W

VITA ............................................................................................................................................... 43

vii LIST OF TABLES TABLE 1. Some of meta characters used in Regular Expressions......................................................... 6 2. Examples of equivalent regular expressions. ........................................................................ 6 3. Classes used from "dk.brics.automaton" package............................................................... 18 4. Methods used from "dk.brics.automaton" package............................................................. 19 5. The results of the experiment. .............................................................................................. 22 6. Search time of AstroGrep and our method comparison (1). .............................................. 26

W

7. Search time of AstroGrep and our method comparison (2). .............................................. 26

IE

8. Performance of our method when compared with AstroGrep . ......................................... 28 9. Sorted information of Table 5. ............................................................................................. 29

PR EV

10. Number of values of each case of our method. ................................................................... 32 11. Percentage of values of each case of our method. .............................................................. 32 12. Time and speed-up of our method comparing to the file search-only method.................. 34 13. Our method cases Speed-up with respect to File search-only method. ............................. 38 14. Summary of Min and Max speed-up with respect to File search-only method. ............... 38

viii LIST OF FIGURES FIGURE 1. Illustrative diagram of the method steps in general. ............................................................ 13 2. Flowchart for the Equivalence Case. .................................................................................... 13 3. Flowchart for the Subset Case. .............................................................................................. 14 4. Flowchart for Searching the Text File .................................................................................. 14 5. Cases distribution according to their impact on the speed-up. ............................................ 32 6. Percentage of the methods cases participated in the experiment ........................................ 33

PR EV

IE

W

7. Speed-up of our method with respect to the file search-only method. ............................... 38

1 Chapter 1 INTRODUCTION Statement of the Problem With our methods, we aimed to speed up the regular expression matching and structure search in texts such as Google Docs, Integrated Development Environments (IDEs), and biological sequence databases where regular expression searches are frequently used. Ultimately, the developed methods and results can be used on internet searches. Problem (RES): Regular expression searching in preprocessed text (T).

W

Input: A regular expression (E).

Output: All matches to (E) in preprocessed text (T).

IE

Description of the Work — Regular expressions (E) and finite automata (FA) — are

PR EV

fundamental grammars and machineries in theoretical computer science. L(E) denotes the language described by E.

RES is a classical problem of pattern matching that uses regular expressions (E) or equivalently finite automata (FA) in searching text files (T). The user can define the search pattern by a regular expression.

Purpose of the Study

Google search corrects misspelling on string patterns and suggests other near search patterns. To the best of our knowledge, a similar set of functions (e.g., suggestions for near expression) does not exist for regular expression search (at least such functions are not public knowledge). In this thesis, we aimed to enhance the flexibility of using the regular expression operations, albeit at the cost of greater complexity of the string matching process. We made use of the closure properties of regular expressions such as complement and intersection operations, and also used equality and emptiness tests for regular expressions. The UNIX Grep utility that

2 supports regular expression pattern matching has been used by UNIX users for many years. This validates the usefulness of regular expression matching. For regular expression matching, it is desirable to automatically correct user’s search (or make suggestions), to rank regular expression search patterns, and to provide immediate answers if a new search can be answered by previous search results. Usually, regular expression is translated into finite automaton (FA) to recognize strings that match the regular expression. The general idea of this work was to use an automaton, which is a finite-state machine that was constructed from the input regular expression and used the

W

machine to scan the text file. We considered very large text files in our experiment. Searching large text files requires an extensive amount of time. To avoid that, we developed a method that

IE

makes use of previous results by building a table to store all the previous regular expressions that

PR EV

have been searched by the users together with their results; this accelerated the search. An input regular expression (E) is compared with the regular expressions stored in the index table. We used the method to identify which previous regular expressions are related to the new regular expression the user has input and to generate a solution by making use of the results found earlier for these selected regular expressions.

Research Questions

To achieve our goal, we built a table that includes a history of all the results of the previous regular expressions. The following research questions were examined: 1. Can we efficiently answer new Regular expression search by maintaining a history of Regular expression searches and their results? 2. Can we efficiently avoid accessing the text file for every regular expression search? 3. What is an efficient way to check the input regular expression against the contents of the search table?