Fusion Based Metasearch: An improved approach ...

5 downloads 94 Views 144KB Size Report
Mar 23, 2007 - search engines that index a portion of the Web as text databases. .... optimization technique was employed to determine optimal scalars for a ...
Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

Fusion Based Metasearch: An improved approach towards efficient Web searching Harmunish Taneja1, Karan Madan2 Department of Computer Science & Applications, Kurukshetra University Kurukshetra Seth Jai Parkash Mukand Lal Inst. of Engg. & Tech.- Radaur, Haryana, [email protected] 2 Department of Computer Science & Applications, Kurukshetra University Kurukshetra Seth Jai Parkash Mukand Lal Inst. of Engg. & Tech.- Radaur, Haryana, [email protected] 1

Abstract With the immense growth of information on the web, it is difficult for the user to find the relevant information related to the area concerned. This expansion in amount of information, triggered by growing number of search engines databases, web directories, digital libraries and other information repositories on the World Wide Web has necessitated easy, effective and efficient access to relevant information from multiple information resources. But this gigantic growth compels us to see little through the conventional search engines, towards metasearch engines. To meet this necessity, search engines have been developed to help users to retrieve information of their interest from the Web in real time. Certain limitations related with conventional search engines viz. smaller exposure of Web and suitably low retrieval effectiveness lead to development of fusion based search engines that are commercially termed as Metasearch engines. So the thirst necessitates the use of metasearch engines for easy, effective and efficient access to relevant information from multiple sources. Metasearch engines allow the clients, an automatically joined and fused access to multiple independent search engines. This paper describes various issues belonging to this fascinating approach in the form of building Metasearch engine and other related issues. KEYWORDS: Information Retrieval, Metasearch, Metasearch Engine, Search Engines, World Wide Web 1. INTRODUCTION Web can be viewed as a huge, almost complete but unstructured database distributed across the world. This nature of the Web as expected inspires the users to search the Web. In this paper, we are interested in the case, where a user searches the Web for data that fulfills his information needs. Many organizations have their own search engines. There is reason to believe that all these special-purpose and general search engines combined together can provide a better coverage of the Web than a conventional search engines. An alternative approach for providing the search capability for the entire Web is to combine many search engines known as metasearch engine. A metasearch engine is a system that supports unified access to multiple search engines. It does not maintain its own index on

web pages but a metasearch engine often maintains information about each underlying local search engine in order to provide better service to the user. When a metasearch engine receives a user query, it first passes the query to the appropriate local search engines, and then collects the results from its local search engines. In addition to the increased search coverage of the Web, another advantage of using such a metasearch engine over a general-purpose search engine is that it is easier to keep index data up to date as each local search engine covers only a small portion of the Web. In addition to this, running a metasearch engine requires much smaller investment in hardware in comparison to running a large general search engine which uses thousands of computers. Web searching is usually carried out in three ways. First, using search engines that index a portion of the Web as text databases. Second, using the Web directories, which categorize selected Web documents? Both of these are practiced and popular ways of Web searching. Another, not yet fully recognized, is using the hyperlink structure. This paper begins with shedding light upon the basic issues and the scope of search engines in the purview of Information Retrieval (IR) and then switches to an emerging relatively new trend in Web searching i.e. Metasearch engines. Search engine crawls the Web by downloading and indexing pages in order to allow fulltext searching. Besides millions of specialty search engines, there are many large scale general purpose search engines; unfortunately none of them comes close to indexing the entire Web [16]. MSEs are information retrieval systems based on fusion approach capable of automatically and simultaneously querying several autonomous search engines, interpreting and merging the result sets retrieved fro m them into a single unified result set. This unified result set is re- organized and reranked where after it is displayed to user in appropriate format. MSEs allow the web user to overcome the limitations of individual search engines by allowing them automatic and unified access to multiple independent search engines for harnessing the quality of each individual. This data fusion approach acts as a global interface between user and set of search engines. This is the appropriate way to take advantage of the benefits of these individual search engines by using them in a combined fashion.

236

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

2. Information Retrieval on the Web Information Retrieval is a branch of Computer Science that helps the user to access and retrieve relevant information from the Web in a speedy and accurate manner. To meet the goals of efficient and effective retrieval of information, several information retrieval systems have been developed that process the documents contained in the indexed databases and return the relevant documents in response to the user query. The retrieval of particular documents depends on the similarity between the documents and the query [4]. IR systems rank the documents based on their relevance to a given query and present the user a ranked list of documents in response to his information request. These search engines are based on various approaches such as set theoretic, algebraic and probabilistic approaches to model an IR problem. Boolean model, Vector Space model and Probabilistic models are the classical IR models. Apart from classical models[1], new models have been developed such as Inference network model[2], Latent Semantic Indexing model[13], Neural network model[11], Genetic algorithm based model [4]and Fuzzy retrieval model[11]. Incorporation of retrieval utilities such as clustering, relevance feedback, semantic networks, regression analysis etc. have considerably improved the retrieval accuracy. The general purpose search engines are capable of indexing larger portions of the Web. Google (www.google.com), Altavista (www.altavista.com), Excite (www.excite.com), Lycos (www.lycos.com), Hotbot (www.hotbot.com), MSN (www.msn.com), Yahoo (www.yahoo.com), AOL (www.aol.com) are few well known ones. The special purpose search engines, on other hand, have a smaller indexed database confined to some specific domain such as database of an organization or in a specific subject area. Cora search engine (www.cora.whizbang.com) focuses on computer science research papers and Medical World Search (www.mwsearch.com) is a specialty search engine for medical information. 3. METASEARCH Various search engines have been developed in the past years both in academia and industry to carry out the task of identifying and retrieving relevant documents to a given user query. Though all these systems cater to the user needs but none is superior in performance over others in all situations. Each search engines has strengths in terms of scale and varied source of evidences but has certain limitations in terms of coverage and retrieval effectiveness. One way to exploit the benefits of these search engines is by using them in a combined

fashion i.e. by combining the result sets, for a given user query, from multiple, different and independent search engines into a single result set which is then presented to the user. This approach is termed as data fusion or collection fusion depending on whether the query is on identical document collections or disjoint collections of documents respectively. WWW is a cross of data and collection fusion where each systems index has some overlap with collections indexed by other systems. This cross of fusion is popularly known as a Metasearch approach in IR. The fusion involves querying selected search engines, merging their result sets and presenting the fused result to the user. The Metasearch engines acts as a global interface between user and set of search engines. The user query is dispatched to the selected set of search engines and the relevant judgment from them are collected and combined into a single better judgment. Just as search engines were developed in response to the rapid growth of information on the Web, MSEs are being developed in response to the increasing availability of conventional search engines. Following are some reasons which motivated the development of MSEs: 1. Larger coverage- Coverage of any individual search engine is relatively decreasing faster as the web is growing rapidly. Web growth rate is much higher than the indexing capability of any search engine [6]. MSEs combine the coverage of multiple component search engines so as to cover almost the entire web. 2. Heterogeneity of data on Web- Different search engines have been designed to deal with different formats of data for IR tasks. No individual search engine is capable to cope with diversity of data on the Web. Selecting the right search engines in Metasearch can alleviate the heterogeneity problem on the web. 3. Scalability of search- MSE is more scalable than any SE in searching the Web as it covers a large pool of databases each containing millions of documents in it 4. Enhanced retrieval effectiveness- Despite stable expansion in retrieval accuracy, experiments has shown that average precision has not even reached 40% [17]. This deplorable situation motivates the application of Metasearch that has consistently improved retrieval accuracy beyond 40%. [11] tried to rationalize the approach of Metasearch with the help of the following effects, any of which can be leveraged by a fusion technique[14]. • Skimming Effect- Metasearch engine scan the top ranked documents from the result sets of the different component search engines to push non-relevant documents down and relevant ones up in the final ranked list.

237

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.



Chorus Effect- When several search engines retrieve a particular document as an appropriate one, this tends to be a stronger evidence of its relevance over a single search engine doing so. • Dark Horse Effect- A search engine may produce unusually accurate or inaccurate estimates of relevance for at least some documents, relative to other search engines. Metasearch meets various needs and address the problems of conventional search engines; it is valuable to use it in IR for widely accepted reasons such as extended coverage, scalability, user-convenient interface, modular architecture, and consistent improvement in retrieval value. Architecture of MSE is modular or hierarchical in nature that connects a selected pool of search engines to the user via a global interface. A graphic view of Metasearch Engine architecture depicting its functionality is shown in Figure 1.

Figure 1: Architecture of Metasearch Engine Metasearch approach can be summarized into following phases: 1. Selection of Search Engine- The users posed query on the MSE is dispatched to selected search engines. MSE decides the appropriate set of search engines that can be used for decision making processes. 2. Merging Result Sets- It refers to combining, reorganizing and reordering the retrieved documents, from the selected search engines, and arriving at a single ranked list of documents related to a user query.

3.

Presentation to User- The merged ranked list of documents is displayed to the user in an appropriate format. 4. RELATED WORK

Fusion based information retrieval has been an area of wide interest for researchers concerned with sophisticated web searching. Combination of two Boolean searches was used to improve retrieval performance [15]. Bayesian model took the prior performance of component search engines into account by assigning a variable weight to each ranked list [9], but was not fully implemented. TREC-1 [14] uses data and collection fusion data sets and further this work was extended in [15] using scores with ranks on TREC-2 data sets. CombSum and CombMNZ overcome the merging algorithms namely Max, Min, CombSum, CombANZ and CombMNZ. Numerical optimization technique was employed to determine optimal scalars for a linear combination of scores [18]. It was empirically observed that use of this technique gave good results for a relatively small collection. Metasearch via averaging log-odds of relevance for routing problem was discussed in [20]. It yielded improvement over more complicated regression and weighting techniques. [12] demonstrates significant performance improvement using CombSum and CombMNZ and concludes that the overlap of the result sets is an important facto r for fusion. It was observed that fusion is effective in case of high relevance overlap and low non-relevance overlap. This work was later supported in [19] by combining the systems that return different non-relevant documents. Different linear combinations of several results from TREC-5 data sets using linear regression model for fusion were investigated in [11, 14, 15]. The following four characteristics were considered desirable for effective fusion: 1. At least one result has high precision or recall. 2. High overlap of relevance and low overlap of nonrelevance. 3. Similar distribution of relevance scores. 4. Different ranking by each retrieval system. A probabilistic model for Metasearch is introduced in [3] and it claims to achieve 8.1- 14.7% improvement over best input system and 1.5% over CombMNZ algorithm. Novel fusion technique embedded with clustering approach was introduced in [17] to achieve guaranteed improvement in retrieval effectiveness. [20] proposed a highly scalable and effective MSE. A survey discussing various issues for building effective and efficient MSE is reported in [8]. This survey investigates the work on database selection approaches, document selection

238

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

methods and merging algorithms. Proprietary databases, intranets and Web search engines using personalization and clustering techniques to perform intelligent Metasearch. Work in [7] proposed a probabilistic approach to data fusion with two variants namely ProFuseAll and ProFuse Judged. This approach yielded 10% to 50% improvement over the popular CombMNZ algorithm. 5. PERFORMANCE EVALUATION An evaluation of performance of the search engine is carried out prior to its final implementation. In a system designed for providing information retrieval, metrics based on time, space and accuracy of retrieved results are of interest. Performance of a search engine is characterized by its effectiveness and efficiency. Effectiveness is measure of accuracy of retrieval whereas efficiency is quantification of resource utilization. In practice, the performance of an IR system is mainly evaluated by is effectiveness rather efficiency. Effectiveness can be quantified by two measures namely precision and recall defined as follows: Precision = Number of Retrieved Relevant Documents Number of Retrieved Documents Recall = Number of Retrieved Relevant Documents Number of Relevant Documents Precision measures the accuracy of retrieval whereas recall is concerned with the coverage of retrieval strategy. By plotting precision for various levels of recall, a precision-recall curve is obtained that yields a combined measure of performance i.e. average precision [4,18], which can be computed as follows: Nq P(r) = Σ Pi(r) / Nq i=1 Where P(r) is the average precision at the recall level r, Nq is the number of queries used, Pj(r) is the precision at recall level r for ith query obtained versus recall curve. A MSE is inherently a global information retrieval system based on fusion of multiple local search engines and therefore its performance evaluation bears not much difference than that of individual search engine. Average precision is considered equally important performance measure and is computed in the similar fashion. 6. PRESENT SCENERIO IN METASEARCH

Present scenario and anticipated future of Metasearch is quite scopic and appealing. Keeping in view of its significant contribution in enhancement of retrieval effectiveness and facilitating the user with relatively new and more advanced features. There are several trends perceived so far in Metasearch, each being distinct and widely scopic, which adds up new and important dimension to the MS research. Following are some challenges for IR: Multi-Media Retrieval- The major focus of IR research has been confined to the textual databases even knowing that searching for non-textual information objects. Image retrieval and integration of text to multimedia objects is one of the ongoing research interests [2]. MetaSEEk is a MSE that deals with image retrieval. Indexing and ranking techniques used for textual data are not competent to support the multi-media retrieval. Invisible or deep Web- It is estimated that size of Invisible Web that is hidden in the search results of any search engine for many reasons, is two to three or more times than that of traditional Web that can be searched [7]. Problem of searching for dynamic pages may also attribute to the failure of conventional search engines. MSEs are accordingly impacted by these issues as they rely on the search engines. But still there is a considerable and explicitly visible scope for further work on invisible web. Scalable metasearch engines- Though MSEs result in more scalable search as compared to any large general purpose search engine, nonetheless, metasearch solutions should be devised to scale in two orthogonal dimensions- data and access [5]. A highly scalable MSE must scale to thousands of different databases, with many of them containing millions of documents. None of the existing and proposed MS solutions has been evaluated under such conditions. Co-operative and collaborative search environment- An efficient and effective MS requires more knowledge about the component search engines such as more detailed database representatives, underlying similarity functions or ranking algorithms, term weighting schemes and indexing methods etc. This leads to the idea of co-operating search engines collaborating for MS but unfortunately commercial rivalry and technical reasons such as copyrights and patent rights have imposed constraints on it and no such co-operation is visible in practice. This situation necessitates the creation of standard protocol for such ideal environment boosting effective metasearch. Standard protocol like STARTS [9] may help establish a vital co-operation among component search engines and better co-ordination between MS interface and its components.

239

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

Selection problems- Success in retrieving more accurate results via MS is mainly governed by the selection of search engines and then documents within them. Does the selection method consistently perform well from query to query? Is this method scalable and competent to cope with heterogeneity and volatility of data on the Web? Can the selection method intelligently skim potentially useful databases from useless ones? Such realistic concerns about database selection method make the selection issue critical and therefore require further investigation. Some MSEs do not rely on the conventional approach of ranking of documents by their component search engines as it varies from search engine to search engine and causes high communication cost and more efforts. Various approaches for document selection such as user determination, weighted allocation, learning based approaches and guaranteed retrieval have been proposed [11] and still need to be further refined and evaluated. Results merging problem- The merging algorithms, which have been experimentally validated, were proposed in the context of data fusion. MS is a typical blend of data and collection fusion. As a matter of fact, collection fusion is the wide base and true picture of MS environment where databases of component search engines are barely identical. Therefore, new merging algorithms, keeping this characteristic of MS in view, should be designed and investigated Utility-based metasearch- Experiments have strongly endorsed the incorporation of utilities such as Clustering, relevance feedback,, thesauri, semantic networks and regression analysis etc. in order to improve retrieval effectiveness. Though researchers have begun exploiting the potential of these utilities for better metasearch but still a lot more work can be carried out in future. Vivisimo and Clusty are utility based MSEs employing clustering techniques User preferences and learning- Whether it is academic or industrial research, focus is always on user’s ease of search and preferences. User interface needs to be intelligent, interactive and adaptable to boost the u se of search tools. MSEs also experience these needs and their importance and facilitate it by offering more advanced features that enables even a naive user to make a precise query without incurring additional time. Vivisimo and Kartoo are MSEs that provide considerably good user interfaces with visualization and learning features Distributed architecture- MSE is inherently a global distributed IR system. New distributed schemes to traverse and search the entire Web must be devised to cope with rapid growth of Web. Needless to say, an explicit impact on existing crawling techniques for the Web is anticipated. Keeping server capacity or network bandwidth in view, it will become a bottleneck in the near future [18].

The true research and development question for future MS research is- how does one reach the synergy in the composition of all of the perceived and even not yet much perceived but truly appealing and critical factors influencing Metasearch efficacy 7. MSE: THE COMMERCIAL STATUS Despite victorious ongoing commercial exploitation of search engine paradigm, a lot more is projected to happen in tune with the swift growth of the Web. Though viable competitiveness powered by academic research and developments with heightening aspirations of global giants of Web search industry have revived the quality and quantity of current search engines. Nonetheless, relatively novel techniques such as MSE are being promoted and extensively adopted as a major thrust in the web search domain. Few metasearch engines along with the URL are listed as under: Dogpile (http://www.dogpile.com), Vivisimo (http://vivisimo.com), Kartoo (http://kartoo.com), Search.com (http://www.search.com), Mamma (http://www.mamma.com), Clusty (http://clusty.com), ProFusion (http://profusion.com), Turbo10 (http://turbo10.com), GoFish (http://gofish.com), Ixquick (http://ixquick.com), surfWax (http://surfwax.com) Dogpile is a popular MSE, launched in 1996 and owned by InfoSpace, sends a user query to a customizable list of search engines, directories and speciality search sites and then displays results from each individual search engine separately. A computer science researcher at Carnegie Melon University founded Vivisimo in June, 2000. It includes six most popular search engines, and other information sources such as CNNTM, BBC and eBAYTM. It uses Vivisimo clustering engineTM technology. Kartoo, which was launched in April 2002, provides the visualization of the web results showing the results along with sites being interconnected by keywords. Results are shown graphically using flash and with HTML interface to non flash users. Kartoo includes big search engines like All The Web, AltaVistaTM, Yahoo!R, MSNR , Mamma, which is one of the oldest MSEs founded in 1996, searches against a variety of major crawlers, directories and specially search sites. The service also provides paid listing options for advertisers and Mamma classifieds. Search.com ios operated by CNET. It

240

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

offers both web-wide search and a wide variety of specialty search options. It was formerly known as SavvySearch. iXquick combines 12 major search engines and offers ‘power search techniques’, ProFusion brings back listings from several major search engines, as well as ‘Invisible Web’ resources. Formerly based at the University of Kansas, the site was purchased by search company Intelliseek in April, 2000. Turbo10 too accesses both traditional and invisible web datgabases with a speedy interface. A MSE is commercially as good as its component search engines. As a matter of fact, there are many MSEs which include only or mostly pay-per-click search engines that rank sites and their contents based on payment and hence give biased results tht are not useful for genuine web users. MSEs such as ProFusion and Turbo10 partly cover the Invisible Web. Many MSEs, for instance Search.com, offer choice to the user to select search engines from the available pool of the selected search engines. In a nutshell, metasearch activity, however, is gaining popularity in web searching but still it needs to be more rampant and widely practiced across the world as unfortunately other than the developed countries of US and Europe, MSEs are even not heard about, among the common users of Web. It requires well-planned marketing and advertising for wide use across the globe. CONCLUSIONS The conventional search engines are failing to provide ideal search in so many ways. They cover a relatively small proportion of the web and can fail to provide sophisticated search when the user has a specialized category or topic of search in mind. Metasearch engines solve these problems in various ways. One way to take advantage of the benefits of these search engines is to use them in a combined fashion. This can be done by combining the result sets for the user query, from multiple different search engines into a single result set which will then be offered to the user. This data fusion approach acts as a global interface between user and set of search engines. Various aspects of Metasearch consolidate the claim of its high profile nature and practical utility. Different search engines have been designed to handle different formats of data. No individual search engine is competent enough to cope with diversity of data on the Web. Selecting the appropriate search engines in metasearch can improve the heterogeneity problem on the web. Ultimately the future of metasearch engines will be driven by technical and economic necessity nowadays.

References [1]. Allan J. “Report on a workshop at CIIR”, University o f Massachusetts, Amherst, September, 2002 [2]. Alshuth P., Hermes T., Herzog O. and Voigt L.. “On Video Retrieval: Content Analysis by Imageminer” in the proceeding of IS & TSPIE Symposium on Electronic imaging 98 , multimedia processing and application, storage and retrieval for image and video databases, vol. 33 (12), pp. 236-247, 1998. [3]. Aslam J.A. and Montague M., “Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems”. In the proceeding of SIGIR’00 ACM Press, New York USA, 2000. [4]. Baeza-Yates, R., Riberiro-Neto, B., “Modern Information Retrieval”, ACM Press, NY, 1999. [5]. Bartell B.T., Corttrell G.W. and Belew R.K., “Automatic combination of multiple ranked retrieval systems” in the proceeding of SIGIR’94, New York USA, 1994. [6]. Bartell B. T. “Optimizing Ranking Functions: A Connectionist approach to Adaptive Information Retrieval” in the Thesis of Research submitted to the Department of CSE, Uni. of California, San Diago, 1994. [7]. Bergman M., “The Deep Web : Surfacing the Hidden Value” www.completeplanet.com/tutorials/deepWeb/index.asp, Bright Planet , 2000. [8]. Bernes T., Lee, Cailliau R., Lcuotonen A., Nueksen J. F. and Secret A., Diamond T., “Information retrieval using dynamic evidence combination” in the Ph.d. dissertation proposal, 1996. [9]. Gravano L., Chen-Chuan K., Chang, and Hector GarciaMonolina.. “STARTS: Stanford proposal for Internet Meta-searching” in the Proceeding of ACM SIGMOD Int. Conference on Management of Data, pp. 207-218, May,1997. [10]. Belew R.. “Adaptive Information Retrieval” in the proceeding of the 12th annual int. ACM SIGIR Conference on research and development in IR, pp. 11-20, 1989. [11]. Chen H. “Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning and Genetic Algorithms” in the Journal of the American Society for Science Vol. 46(3), pp. 194-216, 1995. [12]. Dreilinger, D. and Howe, A.E. “Experiences with Selecting Search Engines using Metasearch”in the proceeding of ACM Transactions on Information Systems, Vol 15(3), pp. 195-222, July 1997 [13]. Dumais S. T. “Latent semantic indexing (LSI) : TREC-3 Report” in the proceeding of Third TREC, pp. 219-230 , 19 94 [14]. Fox E.A., Shaw J.A., Koushik M.P., Modlin R. and

241

Proceedings of National Conference on Challenges & Opportunities in Information Technology (COIT-2007) RIMT-IET, Mandi Gobindgarh. March 23, 2007.

Rao D., “Combining evidences from multiple searches” in the report of TREC-1, pp. 319-328, 1993. [15]. Fox E.A. and Shaw J.A., “Combination of multiple searches” in the report of TREC-2, pp. 243-249,1994. [16]. Grossman D. A. and Frieder O. “Information Retrieval: Algorithms and Heuristics” in the report by Kluwer Academic Publishers, Norwell, Massachusetts, 1998. [17]. Han E. , Karypis G., Mewhort D. and Hatchard K. “Intelligent Metasearch engine for Knowledge Management” in the proceeding of CIKM’03, ACM, Nov. 3-8,2003. [18]. Hull D.A., Pederson J.O., Schutze H., “Method combination for document filtering” in the proceeding of SIGIR’96, ACM Press, August 1996. [19]. Lawerence S. and Giles C. L. “Accessibility of Information on the Web” in the proceeding if Nature 400, pp. 107-109, July, 1999. [20]. Wu Z., Raghvan V., Qian H., Vuyyuru R. K., Meng W., He H., and Yu C. “Towards Automatic Incorporation of Search engines into a Large-scale Metasearch engine” in the proceeding of IEEE/WIC, International conference on Web Intelligence (WI’03), 2003.

242