A Comparative Study Of Different Approaches For

0 downloads 0 Views 328KB Size Report
noise-free web pages and removal of redundant web pages ... data sources include web pages, email, documents, PDFs, .... Usually search engines remove.
Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 1, Issue 3, September – October 2012 ISSN 2278-6856

A Comparative Study Of Different Approaches For Improving Search Engine Performance Surabhi Lingwal1, Bhumika Gupta2 1,2

Dept of Computer Science & Engineering, G.B.P.E.C Pauri, Uttarakhand,India

Abstract:

The enormous growth, diverse, dynamic and unstructured nature of web makes internet extremely difficult in searching and retrieving relevant information and in presenting query results. So to solve this problem, many researchers are moving to web mining. Noises on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The presence of near duplicate web pages also degrades the performance while integrating data from heterogeneous sources, as it increases the index storage space and thereby increase the serving cost. Classifying and mining noise-free web pages and removal of redundant web pages will improve on accuracy of search results as well as search speed. This paper presents a comparative study of different approaches for to improve the search engine performance and speed. The results show that the system easily provides relevancies and delivers dominant text extraction, supporting users in their query to efficiently examine and make the most of available web data sources. Experimental results revealed that Mathematical approach is better than statistical and signed approach.

Keywords: web content mining, outliers, redundant web pages, relevant, precision

1. INTRODUCTION Web due to the presence of large amount of web data has become a prevalent tool for most of the e-activities such as e-commerce, e-learning, e-government, e-science, its purpose has pervaded to the helm of every day work. The Web is enormous, widely scattered, global source for information services, hyperlink information, access and usage of information and website contents and organizations. With the rapid development of the Web, it is imperative to provide users with tools for efficient and effective resource and knowledge discovery. Search engines have assumed a central role in the World Wide Web’s infrastructure as its scale and impact have escalated [13]. This useful knowledge discovery is provided by web mining. Web mining process is given in figure1.

Figure 1.

Web mining process

Volume 1, Issue 3, September – October 2012

Web mining is categorized as: Web Structure Mining: It is the technique to analyze and explain the links between different web pages and web sites [1]. It works on hyperlinks and mines the topology of their arrangement. It tries to discover useful knowledge from the structure and hyperlinks. The goal of web structure mining is to generate structured summery about websites and web pages. It is using tree-like structure to analyze and describe HTML or XML. Web Content Mining: It focuses on extracting knowledge from the contents or their descriptions. It involves techniques for summarizing, classification and clustering of the web contents. It can provide useful and interesting patterns about user needs and contribution behavior [1]. It is related to text mining because much of the web contents are text based. Text mining focuses on unstructured texts. Web content mining is semi-structured nature of the web. Technologies used in web content mining are NLP, IR. Web Usage Mining: It focuses on digging the usage of web contents from the logs maintained on web servers, cookies logs, application server logs etc [1]. Web usage mining is the process by which we identify the browsing patterns by analyzing the navigational behavior of user. It focuses on technique that can be used to predict the user behavior while user interacts with the web. It uses the secondary data on the web. This activity involves automatic discovery of user access patterns from one or more web-servers. It consists of three phases namely: pre-processing, pattern discovery, pattern analysis. Web servers, proxies and client applications can quite easily capture data about web usage.

Figure 2. Structure of Web mining Page 123

Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 1, Issue 3, September – October 2012 ISSN 2278-6856 1.1Web Content Mining Web Content Mining is the process of extracting useful information from the contents of web documents. The web documents may consists of text, images, audio, video or structured records like tables and lists. Mining can be applied on the web documents as well the results pages produced from a search engine. There are two types of approach in content mining called agent based approach and database based approach [3][16]. The agent based approach concentrate on searching relevant information using the characteristics of a particular domain to interpret and organize the collected information. The database approach is used for retrieving the semistructure data from the web. Two groups of web content mining specified in are those that directly mine the content of documents and those that improve on the content search of other tools like search engine [11]. Web content mining approach is involved in: Structured Data Extraction: Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. Unstructured Text Extraction: Typically unstructured data sources include web pages, email, documents, PDFs, scanned text, mainframe report, spool files etc. Web Information Integration and Schema matching: Although the web contains a huge amount of data, each web site represents similar information differently. How to identify or match semantically similar data is a very important problem with much practical application. Building Concept Hierarchies: Concept hierarchies are important in many generalized data mining applications, such as multiple level associations rule mining. Segmentation and Noise Detection: In many web applications, one only wants the main content of the web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the page is interesting problem. Opinion extraction: Mining opinions is of great importance for marketing intelligence and product benchmarking. 1.2 Outliers Detection Pages on the Web have an additional template (we call it noisy) information that does not add value to the actual content of the page. Even worse, it can harm the effectiveness of Web mining techniques; these templates could be eliminated by preprocessing. Templates form one popular type of noise on the Internet [2]. Web content outlier mining is focused on detecting an irrelevant web page from the rest of the web pages under the same categories. Web outlier mining algorithms is applicable for varying types of data such as text, hypertext, video, Volume 1, Issue 3, September – October 2012

audio, image and HTML tags [18]. There are two groups of web content outlier mining strategies. Those that directly mine the content outlier of documents to discover information of outliers and those that reject outliers to improve on the search content of other tools like search engines. In many web applications, one only wants the main content of the web page without advertisements, navigation links, copyright notices. Outliers are observations that deviate so much from other observations to arouse suspicions that they might have been generated using a different mechanism or data objects that are inconsistent with the rest of the data objects [9]. Web Content Outliers are web document that show significantly different characteristics than other web documents taken from the same category. Outliers identified in web data are referred to as web outlier and mining of outliers is called as Web Content Outliers Mining. Outlier detection [5] broadly fall into following categories: Distribution based methods are conducted by the statistics community. These methods deploy some known distribution model and detect as outliers points that deviate from the model. Depth based algorithms organize objects in convex hull layers in data space according to peeling depth and outliers expected to be with shallow depth values. Deviation based techniques detect outliers by checking the characteristics of objects and identify an object as that deviates these features as outlier. Distance based algorithms give a rank to all points, using distance of point from k-th nearest neighbor, and orders points by this rank. The top n points in ranked list identified as outliers. Alternative approaches compute the outlier factor as sum of distances from k nearest neighbors. Density based methods rely on local outlier factor (LOF) of each point, which depends on local density of neighborhood. Points with high factor are indicated as outliers. 1.3Redundant Web Pages The performance and reliability of web search engines face huge problems due to the presence of extraordinarily large amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near duplicate web documents has created an additional overhead for the search engines critically affecting their performance [15]. The demand for integrating data from heterogeneous sources leads to the problem of near duplicate web pages. Near duplicate data bear high similarity to each other, yet they are not bitwise identical [17][12]. “Near duplicates” are documents (web Page 124

Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 1, Issue 3, September – October 2012 ISSN 2278-6856 pages) that differ only slightly in content. The difference between these documents can be due to elements of the document that are included but not inherent to the main content of the page. For example, advertisements on web pages, or timestamps of when a page was updated, are both information that is not important to a user when searching for the page, and thus not informative for the search engine when crawling and indexing the page. The existences of near duplicate web page are due to exact replica of the original site, mirrored site, versioned site, and multiple representations of the same physical object and plagiarized documents [5].

2. RELATED WORK An algorithm is proposed for mining web content [6] using clustering technique and mathematical set formulae such as subset, union, intersection etc for detecting outliers. Then the outlying data is removed from the original web content to get the required web content by the user. Also, the removal of outliers improves the quality of the results from the search page. There is an another paper [9] that proposed two Statistical approaches based on Proportions (Z-test hypothesis) and chi square test (T-test) for mining this outlaid content. Also comparative studies between these two methods are presented. Elimination of this outlaid content during a searching process improves the quality of search engines further. There is another paper that proposed a mathematical approach [4] based on signed and rectangular representation to detect and remove the redundancy between unstructured web documents also. This method optimizes the indexing of web document as well as improves the quality of search engines. In this approach web documents are extracted; preprocessed and n x m matrix is generated for each extracted document. Each page is mined individually to detect redundant content by similarity computation of a word taken from all the 4-tuples of n x m matrix. Followed by that, redundancy between two documents is found based on signed approach. This paper proposes new algorithm for mining the web content by detecting the redundant links [5] from the web documents using set theoretical(classical mathematics) such as subset, union, intersection etc,. Then the redundant links is removed from the original web content to get the required information by the user. The obtained web documents D is divided into ‘n’ web pages based on the links. Then all the pages are preprocessed, and each page is mined individually to detect redundant links using set theory concepts. Initially, the contents of first page is taken and compared with the content of the second page. This process is repeated till nth page. In general if any redundant links is noted, then that particular web page itself is removed from that web document. Finally, a modified web document is obtained which contains required information catering to the user needs.

Volume 1, Issue 3, September – October 2012

3. ARCHITECTURAL DESIGN The proposed system can be divided into 5 modules-1) user input 2) pre-processing 3) term frequency 4) comparison of term frequencies of similar word between both documents 5) Relevance computation. The first module is where the user gives the input query. Based on that query the documents are retrieved from the search engine. Most of the documents retrieved from the search engine may or may not be relevant to the user query [9]. The second module is pre-processing. The various steps involved in pre-processing are stemming, stop word removal and tokenization. Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. Stop words are common words that carry less important meaning than keywords. Usually search engines remove stop words from a keyword phrase to return the most relevant result. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing. The third module is the term frequency calculation. The words present in the document are compared with the words present in the domain dictionary. So the words that are matched with the dictionary ate taken for the term frequency calculation. The fourth module is the term frequency calculation of the compared words that match with domain dictionary between both the documents Di and Dj. The fifth module is the relevance calculation. The comparison is based on the precision calculation. 3.1 Precision It is the ratio between the number of relevant documents returned originally and the total number of retrieved documents returned after eliminating irrelevant documents [8]. Here the relevant documents indicate the required documents which satisfy the user needs. Precision = Relevant Retrieved after refinement 3.2Recall It is the ratio between the number of relevant documents returned originally and the total number of relevant documents returned after eliminating irrelevant documents [8]. Recall = Relevant Relevant after refinement 3.3 Time Taken The time taken by the entire process is the sum of the initial time taken by the general purpose search engines plus the time taken by the refinement algorithm to process the results [8].

Page 125

Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 1, Issue 3, September – October 2012 ISSN 2278-6856 4. STATISTICAL APPROACH Each document is mined to retrieve relevant web document through test hypothesis using proportions. When the value of Z is equal to or less than 1.645 then those documents are relevant [10]. Finally, a mined web document is obtained which contains required information catering to the user needs. In this algorithm, web documents are extracted based on the user query. The extracted documents are pre-processed for making the remaining process simpler. Followed by this, term frequency for the words presents in the document against domain dictionary is computed for the ith and jth (i+1th) documents. Then,similar words from the above documents along with their term frequencies are retrieved for performing test statistic (Z) using proportions. Finally, Z value is compared with the degrees of confidence at the level of 95% which is obtained from the table. If the calculated value is equal or less than 1.645 then both are considered as relevant documents otherwise, they are considered as irrelevant documents. The above process is repeated for all the remaining documents for computation of relevance.

Step 9: Compare Z value with the Z 95% at level of confidence, where Z is the Critical Value. Step10: If the Z value is lesser than Critical Value then Di and Dj are relevant documents. Else Di and Dj are Irrelevant. Step 11: Increment j, and repeat from step 5 to step 9 until j Step 12: Increment i, and repeat from step 4 to step 10 until i