Non-english web search: an evaluation of indexing and ... - CiteSeerX

10 downloads 18367 Views 548KB Size Report
Aug 8, 2005 - Arama appears to be the best Turkish search engine in terms of average ..... They do not use any Greek processing software, so they don't do.
Inf Retrieval (2009) 12:352–379 DOI 10.1007/s10791-008-9084-6

Non-english web search: an evaluation of indexing and searching the Greek web Efthimis N. Efthimiadis · Nicos Malevris · Apostolos Kousaridas · Alexandra Lepeniotou · Nikos Loutas

Received: 28 May 2008 / Accepted: 8 December 2008 / Published online: 16 January 2009 © Springer Science+Business Media, LLC 2009

Abstract The study reports on a longitudinal and comparative evaluation of Greek language searching on the web. Ten engines, five global (A9, AltaVista, Google, MSN Search, and Yahoo!) and five Greek (Anazitisi, Ano-Kato, Phantis. Trinity, and Visto), were evaluated using (a) navigational queries in 2004 and 2006; and (b) by measuring the freshness of the search engine indices in 2005 and 2006. Homepage finding queries for known Greek organizations were created and searched. Queries included the name of the organization in its Greek and non-Greek, English or transliterated equivalent forms. The organizations represented ten categories: government departments, universities, colleges, travel agencies, museums, media (TV, radio, newspapers), transportation, and banks. The freshness of the indices was evaluated by examining the status of the returned URLs (live versus dead) from the navigational queries, and by identifying if the engines have indexed 32480 active (live) Greek domain URLs. Effectiveness measures included (a) qualitative assessment of how engines handle the Greek language; (b) precision at 10 documents E. N. Efthimiadis (&) Information School, University of Washington, Seattle, WA, USA e-mail: [email protected] N. Malevris · A. Kousaridas · A. Lepeniotou · N. Loutas Department of Informatics, Athens University of Economics and Business, Athens, Greece e-mail: [email protected] Present Address: A. Kousaridas Department of Informatics and Telecommunications, University of Athens, Athens, Greece e-mail: [email protected]; [email protected] Present Address: A. Lepeniotou Technological Educational Institution, TEI Larisa, Greece e-mail: [email protected]; [email protected] Present Address: N. Loutas Information Systems Lab, University of Macedonia, Thessaloniki, Greece e-mail: [email protected]; [email protected]

123

Inf Retrieval (2009) 12:352–379

353

(P@10); (c) mean reciprocal rank (MRR); (d) Navigational Query Discounted Cumulative Gain (NQ-DCG), a new heuristic evaluation measure; (e) response time; (f) the ratio of the dead URL links returned, (g) the presence or absence of URLs and the decay observed over the period of the study. The results report on which of the global and Greek search engines perform best; and if the performance achieved is good enough from a user’s perspective. Keywords Non-English web search · Greek web · Greek queries · Search engines · Evaluation · Navigational queries · Homepage finding

1 Introduction The web continues to expand and the dominant search engines, Google and Yahoo! claim to have indexed more than 20 billion pages (Mayer 2005). Recent statistics on Internet usage by language show that 31.2% is English and 68.8% is non-English (Internet World Statistics 2007b). As the non-English web usage increases, there is an increasing number of non-English queries that need to be handled by the search engines. The goals of this research are to: (a) evaluate how well search engines respond to Greek language queries; (b) assess whether the Greek or global search engines are more effective in satisfying user requests, and, (c) evaluate the extent of coverage of the Greek web by the ten search engines. Preliminary results of the present study as these pertain to (a) and part of (b) above appeared in Efthimiadis et al. (2008). To achieve these goals the study was conducted as follows: 1. a set of queries were searched in 10 search engines (5 Greek, 5 global) and the results were evaluated to see if the correct answer was returned; 2. all the URLs found in the result sets were retrieved to identify the percentage that were live (active) or dead (non-active) links; 3. a sample of 32480 active URLs from the Greek web was used to evaluate whether the search engines had them indexed. The organization of the paper is as follows: Sect. 2 reviews related work, Sect. 3 gives a brief overview of the Greek language, Sect. 4 presents the methodology, Sect. 5 discusses the results, and the conclusions are given in Sect. 6.

2 Related work Bar-Ilan and Gutman (2005) explored how search engines respond to queries in four nonEnglish languages, Russian, French, Hungarian and Hebrew. For each of the languages they searched in three global search engines, AltaVista, FAST and Google, and in two or three local engines. The local engines were the Russian Yandex, Rambler, Aport; the French Voila, AOL France, La Toile de Quebec, the Hungarian Origo-vizsla, Startlap, Heureka, and the Hebrew Morfix and Walla. For each of the four languages the authors developed queries that emphasized specific linguistic characteristics of that language. The first ten results of each search were evaluated not for relevance, but for whether the exact

123

354

Inf Retrieval (2009) 12:352–379

word form or a morphological variant of the query was retrieved. They found that the search engines ignored the special language characteristics and did not handle diacritics well. Moukdad (2004) studied how three global search engines, AltaVista, AllTheWeb, and Google, handle Arabic queries compared to three Arabic engines, Al bahhar, Ayna, and Morfix. He employed the same methodology used by Bar-Ilan and Gutman (2005). A set of eight Arabic search terms was selected and run in the six search engines. He found that the global search engines had shortcomings in handling Arabic. Sroka (2000) evaluated Polish versions of English language search engines and Polish search engines. The evaluation focused on search capability and retrieval performance. Precision was based on relevance judgments for the first 10 matches from each search engine. The overlap of retrieved documents and the response time for each search engine were recorded. Of the five search engines that were evaluated, Polski Infoseek and Onet.pl had the best precision scores, and Polski Infoseek turned out to be the fastest Web search engine. Kelly-Holmes (2006) conducted a study searching with Irish Gaelic words on the Irish language version of Google. Five words from ‘typical’ and ‘non-typical’ domains for Irish were used, and the results were analyzed in terms of the “authenticity” of the search process and results, the language usage in the sites found through the search process, and the domains represented by the results. The study identified a number of problems encountered when searching using the Irish Gaelic language. Bitirim et al. (2002) investigated the performance of four Turkish search engines with respect to precision, normalized recall, coverage and novelty ratios. They used seventeen queries and searched Arabul, Arama, Netbul and Superonline. The queries were carefully selected to assess the capability of a search engine for handling broad or narrow topic subjects, exclusion of particular information, identifying and indexing Turkish characters, retrieval of authoritative pages, stemming of Turkish words, and correct interpretation of Boolean operators. Arama appears to be the best Turkish search engine in terms of average precision and normalized recall ratios, and the coverage of Turkish sites. The handling of Turkish characters and stemming still causes problems to the Turkish search engines. Superonline and Netbul make use of the indexing information in meta-tag fields to improve retrieval results. Griesbaum (2004) investigated the retrieval effectiveness of three popular German Web search services: AltaVista.de, Google.de and Lycos.de. Fifty queries were used both in German and in their English translation. The top twenty results were evaluated for precision. The findings indicated that Google performed significantly better than AltaVista, but there was no significant difference between Google and Lycos. Lycos also achieved better values than AltaVista, but the differences reached were not statistically significant. When comparing the 2004 results to a similar study by the author in 2002 the results were similar, but the gaps between the engines were closer. The overall conclusion of the study was that the retrieval performance of the engines was very close to each other. Lazarinis (2007) evaluated the performance of eleven search engines, seven global (AlltheWeb, AltaVista, AOL, ASK, Google, MSN, Yahoo) and four Greek (Anazitisis, In. gr, Pathfinder, Robby), with the use of six Greek language queries. He employed thirty-one users who were divided into six groups and each searched one query. Each group member retrieved twenty results. The retrieved results by all group members were evaluated for relevance collectively by the members of each group. Lazarinis reports that the precision of all engines is very similar. Based on the six queries the study further investigated how engines handle upper and lower case input, diacritics, stemming, and stop words. The study noted that there were variations in the handling of Greek.

123

Inf Retrieval (2009) 12:352–379

355

Moukdad and Cui (2005) investigated how Chinese language queries are handled by Google and AlltheWeb, as well as Sohu and Baidu, the Chinese search engines. They created ten queries by selecting terms from a Chinese-English dictionary. The terms emphasized certain linguistic characteristics of Chinese. The queries were searched in the Simplified Chinese script which is in use in mainland China. The results were evaluated based on the number of retrieved documents, word segmentation, and correct display of Chinese characters. Moukdad and Cui found that the global search engines did not use any linguistic processing and thus were not able to process the Chinese queries satisfactorily, which led to the introduction of unexpected results.

3 The Greek language Greek is a rich highly inflectional language that dates to 9th century BC. The Greek language uses a different script to that of Latin-based languages. The Greek alphabet set has twenty-four upper case letters, twenty-five lower case letters, and a number of diacritics or accent marks depending on the form used (see Fig. 1). The most commonly known forms of the Greek language are ancient or classical Greek, Katharevousa, and Demotic Greek (Dhimotiki) (Babiniotis 1998). Depending on the system of accents used, Greek is either polytonic or monotonic. The polytonic orthography system for Greek uses three accents, two breathings, iota subscripts and diaeresis. The polytonic system was used since the ancient times and was simplified into the monotonic system in 1982. The monotonic Greek language system uses one accent and the diaeresis, in order to signify that two adjacent vowels are pronounced separately and not as a diphthong. Transliteration of Greek to Latin letters is common but adds to the complexity of processing Greek because of the different transliteration standards. Furthermore, individuals often ignore the standards and apply their own phonetic interpretation. The widespread

Fig. 1 The Greek language alphabet

123

356

Inf Retrieval (2009) 12:352–379

use of computers and the Internet coupled with the slow progress in adopting non-Latinbased scripts has given rise to Greeklish, which is a form of transliteration used to exchange email messages and post to discussion forums (Karakos 2003; Tzekou et al. 2007). Alevizos et al. (1988) discuss the challenges faced by search systems in handling Greek. Kalamboukis (1995) introduces the inflectional aspects of Greek and presents a stemming approach. 4 Methodology The methodology used in carrying out this research is presented in this section. A user need scenario is first introduced. The search engines and the search process are presented. The subject categories selected for the navigational queries follow. A discussion of the evaluation criteria used for each part of the study concludes the section. 4.1 User needs The use of the Internet by Greeks has seen a threefold increase between 2000 and 2006, jumping from 9.1% to 33.5% respectively (Internet World Statistics 2007a). Similarly, the Greek web has proliferated with an increasing presence of governmental and commercial entities. In 2004 and then again in 2006, most of the Greek web pages (63.5%, and 63.4%) were in the Greek language (Efthimiadis and Castillo 2004). Most Greeks learn a second language to some degree of proficiency; however, it is reasonable to assume that Greeks would search in Greek to find information in the Greek web. Following the Broder (2002) classification of web queries we selected the “navigational” class as the basis of a user task definition. We assume that a user will search to find the specific site of an organization. To that respect our methodology relates to that of Hawking et al. (2001). 4.2 Search engines and the search process Ten search engines were used in this study. These were divided into two groups, five global or international in scope, and five Greek search engines. The global search engines are: A9, AltaVista, Google, MSN Search (this is not Live Search, as Live was introduced after the study was concluded), and Yahoo!. The Greek engines are: Anazitisis, Ano-Kato, Phantis, Trinity, and Visto. The Appendix lists the engines and the corresponding URLs used to send the search requests. A program in Java was developed to submit queries to each search engine automatically. The returned results were downloaded and stored in a MySQL database for further processing and analysis. The process is depicted in Fig. 2 and discussed throughout the methodology section. 4.3 Subject categories and queries Ten broad subject categories were identified using professional and business directories. The categories are: government departments, universities, colleges, travel agencies, museums, media (TV, radio, newspapers), transportation, and banks. Two hundred and seventeen (217) organizations that had a web presence were selected for searching. For each organization we established the formal name in Greek, its non-Greek equivalent if

123

Inf Retrieval (2009) 12:352–379

357

Fig. 2 The search process

available (usually in English or other Latin-based language) and the URL(s) of the web site. The URLs available for these organizations were used to download the corresponding webpage and verify that these were active. In addition, the robots.txt file was checked for every URL in order to establish if there were any indexing restrictions on the page. At that time none of the organizations had any restrictions to search engines for crawling and indexing their pages. Consequently, all search engines should have had access to them. Queries were generated from the Greek and non-Greek (English or transliterated) versions of the names of the selected businesses or organizations. Table 1 lists the subject categories and the numbers of the Greek organizations that correspond to each category.

Table 1 Queries by subject category and language searched Subject categories

Organizations in

(In English)

(In Greek)

Greek

Government departments

Υπουργεία

18

14

Universities

Πανεπιστήμια

21

20

Colleges

ΤΕΙ

14

8

Travel agencies

Ταξιδιωτικά Γραφεία

39

4

Museum

Μουσεία

19

0

Transportation & communication services

Μέσα Μεταφοράς, Επικοινωνίες

12

7

Banks

Τράπεζες

28

13

Newspapers

Εφημερίδες

17

16

Television stations

Τηλεόραση

12

3

Radio stations

Ραδιόφωνο

37

7

217

92

Total/Σύνολο

English

123

358

Inf Retrieval (2009) 12:352–379

There were a total of 217 organizations, of which 92 had a corresponding English or other non-Greek equivalent name, thus, resulting in 309 queries. Searches were submitted automatically to the engines in October 2004 and in August 2006. The exact same queries were used in both times. Examples of the queries are given in Table 2, which lists queries and their corresponding subject categories. Both the Greek language and the English or transliterated form of the name is given together with the target URL. As appropriate, there is an indication whether the non-Greek version of an organization’s name is a direct translation to English, a transliteration, a combined form of translation and transliteration, or whether the initials have been used or they have added new words or dropped part of the name. In order to simulate the input of a non-expert searcher the queries were submitted for search in the typical lay-searcher format by typing out the keywords separated by spaces. Advanced search operators and techniques were not used. Since these were “known item” searches, the ideal retrieval would be to get the target URL of that organization ranked first in the result set. 4.4 Evaluation criteria This section presents the criteria on which the evaluation was based. As this study aimed at evaluating both the effectiveness and the coverage of search engines in searching the Greek Web, the evaluation criteria are organized accordingly. The criteria used for evaluating the effectiveness of search engines are: (a) qualitative assessment of how the engines handle the Greek language; (b) precision at 10 documents (P@10); (c) mean reciprocal rank (MRR); (d) Navigational Query Discounted Cumulative Gain (NQ-DCG) a new heuristic evaluation measure developed for the study; (e) response time; (f) measuring the ratio of the dead URL links returned. The evaluation criteria for the coverage and freshness of the search engine indices is measured by the presence or absence of a large sample of Greek domain URLs and by the decay observed over the period of the study. 4.4.1 Evaluating search engines effectiveness in searching the Greek web 4.4.1.1 Greek language processing The Greek language includes a different script than Latin, is highly inflectional, and has variable forms and orthography. To evaluate how engines handle Greek, a set of queries was used that included keywords with and without accents. The results were qualitatively assessed in order to establish whether the engines take these into account. 4.4.1.2 Precision at 10 documents (P@10) For each of the 309 queries searched the top ten results were retrieved and evaluated. The methodology used for the evaluation includes the rank distribution of the successful results, failure rates, and precision at 10 documents (P@10). Precision at k documents is a well established evaluation measure. It however treats all top k answers equally. 4.4.1.3 Mean reciprocal rank (MRR) Each search engine is scored using the mean reciprocal rank (MRR) of the target URL (Hawking and Craswell 2002; Voorhees 1999). The reciprocal rank is the inverse of the rank assigned to the correct target URL and then averaged across all queries. Zero is assigned if no correct target URL is found in the top 10 results. 4.4.1.4 Navigational query discounted cumulative gain (NQ-DCG) The Navigational Query Discounted Cumulative Gain (NQ-DCG) is a new heuristic evaluation measure developed for the study. For every search the top ten results were downloaded and their rank order was recorded. These were then evaluated as to whether the target URL or its

123

Kerdos ΕΤ3 Ecclesia National Telecommunications And Post Commission (EETT) I.S.A.P

Kέρδος

Ελληνική Τηλεόραση 3

Εκκλησία της Ελλάδος

Εθνική Επιτροπή Τηλεπικοινωνιών και Ταχυδρομείων (ΕΕΤΤ)

Ηλεκτρικός Σιδηρόδρομος Αθηνών-Πειραιώς (ΗΣΑΠ)

Newspapers

TV

Radio

Transportation

Transportation

Argonaut Travel Ethnos

Αργοναύτης

Έθνος

Technological Education Institute of Thessaloniki

TEI Θεσσαλονίκης

Colleges

Travel agencies

University of Macedonia

Newspapers

Athens School of Fine Arts

Ανωτάτη Σχολή Καλών Τεχνών

Πανεπιστήμιο Μακεδονίας Οικονομικών & Κοινωνικών Επιστημών

Universities

National Technical University of Athens

Εθνικό Μετσόβιο Πολυτεχνείο

Universities

Universities

National Bank of Greece Geniki Bank

Εθνικη Τραπεζα της Ελλαδος

Γενικη Τραπεζα της Ελλαδος

Ministry of Health and Welfare

Υπουργείο Υγείας και Κοινωνικής Αλληλεγγύης

Government departments

Banks

Ministry of National Defense

Υπουργείο Εθνικής Άμυνας

Government departments

Banks

Query equivalent in English or transliterated form

Query in Greek

Category

Table 2 Examples of queries used in the evaluation

www.geniki.gr www.genikibank.gr (multiple domains)

www.nbg.gr

www.mohaw.gr

www.mod.gr

Target URL

www.asfa.gr

Initials only

Translation

Transliteration (dropped part of name)

Initials only

Transliteration

Transliteration

Translation

Translation

www.isap.gr

www.eett.gr

www.ecclesia.gr

www.ert3.gr

www.kerdos.gr

www.ethnos.gr

www.argonautravel.gr

www.teithe.gr

Translation (dropped part of name) www.uom.gr

Translation

Translation (dropped part of name) www.emp.gr www.ntua.gr (multiple domains)

Transliteration (dropped part of name)

Translation

Translation

Translation

Type

Inf Retrieval (2009) 12:352–379 359

123

360

Inf Retrieval (2009) 12:352–379

variants were found in the results set. For exact or partial matches the rank position was recorded. The measure includes two components, the rank position, and the depth of the page as indicated in the URL. The latter gives some credit for partial matches, assuming that the searcher will be able to identify that the returned result is related to the desired result. This way the search engine is penalized for the additional navigational effort that will be required by the user. This heuristic evaluation measure relates to the discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG) (Ja¨rvelin and Kekalainen 2000, 2002). A more formal description of the measure is given below. If m is the number of search engines examined, then each search engine j, (where j = 1,2,…,m) is allocated a score based on the first k results for each query. If the position of the returned result is i (where i = 1,2,…,k) and Vji is the value of the returned result at position i for engine j, then Vji ¼ k  i þ 1 The contributed score (Wji) of result at position i to the search engine j, is thus calculated as ( ðk  nÞ  Vji ; n\k ð1Þ Wji ¼ 0; n  k where j = 1,2,…,m, and i = 1,2,…,k; where n is the number of subdomains in the returned URL result and n \ k. Hence, if no subdomains exist in the returned URL then n = 0. Finally, the total score NQ-DCGj (where j = 1,2,…,m) for each search engine is calculated as: NQ  DCGj ¼

k X

Wji i ¼ 1; 2; . . .; k

ð2Þ

i¼1

For the purposes of this study only the first 10 returned results are considered, k = 10, and the number of engines evaluated is 10 therefore m = 10. In the examples below the total score assigned to a search engine is calculated based on the above NQ-DCG heuristic evaluation measure. Example: Let http://www.ypepth.gr be the target URL for the Ministry of Education (Υπουργείο Εθνικής Παιδείας και Θρησκευμάτων). (a)

If the returned result by a search engine, say 7, based on a query is found in the third place (i = 3) and contains only the main page (http://www.ypepyth.gr) then n = 0, following the notation above, V73 = k – i + 1 = 10 – 3 + 1 = 8, and the contributed score is calculated as W73 = (k − n) * V73 = (10 − 0) * 8 = 80. (b) If the returned result by the same search engine based on the query is found in the second place (i = 2) and contains one subdomain (http://www.ypepth.gr/el_ec_ category1806.htm) hence n = 1, then based on the above notation, V71 = k – i + 1 = 10 – 2 + 1 = 9, and the contributed score is calculated as W71 = (k − n) * V71 = (10 − 1) * 9 = 81. (c) And, for the URL below returned in the eighth position, then (i = 8) and n = 2 as it contains two subdomains (http://www.ypepth.gr/docs/aitisi_ipotrofion_klirodotimatvn. doc), the contributed score as calculated by the formula above is V72 = k − i + 1 = 10 − 8 + 1 = 3, yielding a weight of W72 = (k − n) * V72 = (10 − 2) * 3 = 24.

123

Inf Retrieval (2009) 12:352–379

361

Without any loss of generality, assume that the remaining seven results contributed no weight at all (this could happen for example if all contained 10 subdomains). (d)

Hence the total weight for search engine number 7 from Eq. 2 is calculated as NQ  DCG7 ¼

k X

W7i ¼ 81 þ 80 þ 24 ¼ 185

i¼1

It must be noted here that the coefficient (k − n) in (1), plays the role of a diminishing or discounting factor by penalizing the results that partially match the target URL and which contain subdomains in the returned URLs. This can be implemented in different ways. For example, as it is implemented here or by introducing a factor such as 1/ (1 + n) based on the number of subdomains only. However, in both approaches the result would have been along the same principle of penalizing the presence of subdomains, which in practice resembles to DCG with more emphasis on the presence of subdomains. The proposed NQ-DCG approach has been adopted here as it reflects the presence of the subdomains in a more direct way. Evaluation of these approaches is beyond the scope of the present paper. The proposed NQ-DCG evaluation measure follows a similar approach to NDCG. For example, NDCG uses a decay factor and measures the gain of each contribution depending on the level of relevance of each returned result. It accumulates the gains by calculating the sum of the gains. It discounts the gain of a returned result that is ranked low so that a highly ranked result will attribute more toward the gain. All these steps are encapsulated in the NQ-DCG evaluation measure. NQ-DCG measures the gain for each result by discounting its merit depending on how low it has been found based on rank position. This is reflected in the calculation of Vji. Thus, a high score is good, and a low score is not that good in terms of retrieval performance. The cumulative gain for each search engine is calculated under NQ-DCGj, thus reflecting the similarities between the two measures. However, the major difference between the two schemes is that the proposed measure considers the results in terms of the number of subdomains the returned URL contains. This is calculated by Wji where account is taken of the number of subdomains in the URL in an automated fashion by discounting the relevance of the URL based on its distance from the target. Therefore, NQ-DCG is in principle similar to NDCG and both model better a person’s judgment of a search engine than other measures, like precision at 10 or MRR. 4.4.1.5 Evaluating the returned results: response time Response time, that is, the time it took from a query submission to the receipt of the results set for each search engine was recorded using the computer’s clock time. Time was collected for all data collection periods and provides a measure for comparing the speed of the search between the ten search engines. In 2004 the searches were sent from the Athens University of Economics and Business (AUEB) computer lab. The 2006 data collection was conducted at the University of Athens and at the University of Macedonia computer labs. The computers used were desktops with Intel Pentium processors running Windows XP with similar configuration. The network infrastructure is the same, since all the universities use the GRNET/EDET network. Therefore, conditions were very similar per year.

123

362

Inf Retrieval (2009) 12:352–379

4.4.1.6 Evaluating the returned results: Live versus dead links The top ten results of each search were recorded and the URLs were extracted. Each URL was then called and its status was recorded in a binary mode as active (live) or non-active (dead) link. No further attempts were made to retrieve inaccessible links, since the average user usually would not persist once a “404 not found” error message is received. The search engines were therefore penalized for returning non-active links (Hawking et al. 2001). This provides an indication of the freshness of each search engine’s index and contributes to users’ cost, because it is associated with user frustration, time wasted, and overall dissatisfaction with the quality of the results and the search engine itself. 4.4.2 Evaluating search engine coverage of the Greek web To further measure the extent of coverage of the Greek Web (.gr) and the freshness of the index of the search engines we used a sample of 32480 top level domain URLs that were crawled from the Greek Web (.gr) (Efthimiadis and Castillo 2004). These URLs were all active at the time of the first data collection in May 2005. Some were inaccessible either permanently or temporarily during the second data collection in October 2006. These were treated as dead links and excluded from the evaluation in order to avoid penalizing the search engines for not returning them (Hawking et al. 2001). The 32480 URLs were submitted automatically to the search engines as queries through the developed Java program. The pseudo code of the algorithm is given in Table 3. The query syntax was tailored to each search engine. A similar methodology was used in the evaluation of Google, AltaVista, and AllTheWeb (Vaughan and Thelwall 2004) and Table 3 Pseudo code of algorithm for searching the URLs

Let U = {u1, u2,…,un}, n∈ℵ a set of URLs Let SE = {s1,s2,…,sk}, k∈ℵ a set of Search Engines For each ui in U do Form a URL_query. For each sj in SE do Set the corresponding HTTP connection properties. Emulate that a Web browser is used. Establish the HTTP connection with sj. Submit the URL_query. If a Web site of the same domain of ui is returned in the HTML code then ui is indexed by the index of sj else ui is not indexed by the index of sj Submit HTTP request for the not-indexed URL_query. If HTTP 404 Response message is received then The URL_query is dead. else The URL_query is alive. endif endif endfor endfor

123

Inf Retrieval (2009) 12:352–379

363

Google, Yahoo and Live (Vaughan and Zhang 2007). For example, for AltaVista, A9, Google, MSN, and Yahoo a URL could be searched using “site:www.aueb.gr” which would result in returning a list of pages indexed by the search engine from that particular domain. Although the Greek search engines did not support the “site:, link:, or url:” type of searches, it was possible to search for the URL string and receive results that contained them. The results were then examined to determine whether the target URL was present. If the URL was not found in a search engine’s index it was subsequently called with a HTTP request and the response was noted. If a “HTTP 404: file not found” was returned, then the URL was treated as a dead link, otherwise, the URL was considered to exist but it was not indexed by the search engine. Table 3 summarizes this process.

5 Results The study results are presented in this section. An overview of the issues surrounding Greek language processing and the effects on searching are given first. The evaluation of the 309 navigational queries, that were searched in the 10 search engines (5 Greek, 5 global), follows. The freshness of the index of a search engine is measured by the percentage of the returned URLs that were live (active) or dead (non-active) links; and the extent of the coverage of the Greek web by the search engines by evaluating if a sample of active URLs from the Greek web appears in their indices. 5.1 Greek language processing by search engines and effects on searching The way search engines handled the Greek language is presented in Table 4. The table shows whether the engines handled articles, prepositions, pronouns, etc. The table also reports on whether the results of Greek language queries that are submitted to search engines with or without accent marks are the same. For example, a searcher using either keyword “χωριο” or “χωριό” (village) as a query would expect to get the same results because the accent mark does not change the meaning of the word. However, this is not the case as reported in Table 4.

Table 4 How search engines process Greek language input

Search engine

Greek with or without accent marks produce

Handling of articles, prepositions, etc. Greek

English

Anazitisis

Different results

No

No

Ano-Kato

Same results

No

Yes

Phantis

Same results

Yes

Yes

Trinity

Same results

Yes

Yes

Visto

Same results

Yes

Yes

A9

Different results

No

Yes

AltaVista

Different results

No

Yes

Google

Different results

No

Yes

MSN

Different results

No

Yes

Yahoo

Different results

No

No

123

364

Inf Retrieval (2009) 12:352–379

The five global search engines and one Greek returned different results. The differences observed in the top ten results vary from providing totally different results, to having some small overlap in the results, but with differences in rank order. It appears that the way Google, MSN, and Yahoo handle Greek is very similar and it amounts to the following. They do not use any Greek processing software, so they don’t do any special segmentation or stemming for Greek. It seems that the default algorithm for Greek or any non-English language is simple white-space delimiting to find words, and indexing of these words minus a universal stop word list. As a minimum, these search engines seem to recognize at least two encodings, Unicode and Windows. 5.2 Search results by rank order and P@10 The 309 navigational queries, 217 in Greek and 92 in English, were submitted to each of the 10 search engines for a total of 3090 searches. Table 5 presents the rank distribution of the results for both the Greek and English queries by search engine for 2004 and 2006. The table lists also the number of organizations missed by each engine, and their success rate measured as precision at 10 (P@10). Of the organizations found it appears that most results were presented in the first three ranks.

Table 5 Rank position of the top ten search results for the 309 queries by search engine Greek/global

Search engines

Rank

Missed

1

2

3

4

5

6

7

8

9

10

Total found

P@10

2004 Global

Google

171

14

12

6

4

1

1

1

0

2

97

212

0.6861

Global

AltaVista

155

20

5

4

4

2

1

2

2

3

111

198

0.6408

Global

A9

152

17

15

3

4

1

2

1

1

2

111

198

0.6408

Greek

Trinity

160

3

8

3

4

1

1

1

0

0

128

181

0.5858

Global

Yahoo

127

29

8

3

4

2

1

0

2

1

132

177

0.5728

Global

MSN

107

25

11

7

2

5

4

3

0

3

142

167

0.5404

Greek

Visto

60

20

9

0

4

0

2

0

1

0

213

96

0.3107

Greek

Ano-Kato

59

12

4

4

0

1

1

0

1

0

227

82

0.2654

Greek

Anazitisis

45

20

8

4

0

1

0

0

0

0

231

78

0.2524

Greek

Phantis

17

7

2

1

1

0

2

0

0

1

278

31

0.103

Global

Google

199

11

8

1

2

3

1

1

1

1

81

228

0.7379

Global

AltaVista

166

30

10

2

2

2

4

2

0

3

88

221

0.7152

Global

Yahoo

133

30

13

7

3

3

3

1

2

0

114

195

0.6311

Greek

Trinity

142

10

5

3

0

1

1

0

0

0

147

162

0.5243

Global

A9

106

17

11

3

4

5

3

2

1

0

157

152

0.4919

Global

MSN

101

18

12

6

5

3

3

1

1

0

159

150

0.4854

Greek

Visto

78

20

6

4

4

2

1

1

1

0

192

117

0.3786

Greek

Ano-Kato

53

16

5

2

2

2

2

0

1

0

226

83

0.2686

Greek

Anazitisis

17

7

4

0

2

1

1

2

0

1

274

35

0.1133

Greek

Phantis

23

5

2

1

1

0

0

0

0

1

276

33

0.1068

2006

123

Inf Retrieval (2009) 12:352–379

365

The global search engines have higher success rates for both of the comparison years than the Greek engines. In 2004 the performance of the global engines ranges from 54.04% to 68.61% and in 2006 from 48.54% to 73.79%. The Greek engines range from 10.03% to 58.58% in 2004 and from 10.68% to 52.43% in 2006. Google is the best performing global engine and Trinity is the best Greek engine in both years. However, Trinity is ranked fourth overall in both 2004 and 2006. Figure 3 shows the overall success rate of the ten search engines. The engines are ranked in descending order based on their performance in 2004. What is also remarkable here, is that the Greek search engines with the exception of Visto, scored the same or slightly worse than what they achieved in 2004. For Anazitisis in particular, the success rate as can be seen in Fig. 3, was more than halved. On the contrary, among the global engines, three did better, whereas for two of them, namely A9 and MSN, the scores are significantly different between 2004 and 2006, especially for A9 where the success rate dropped by almost a quarter. Furthermore, A9 dropped from the third position in 2004 to the fifth in 2006, swapping over with Yahoo. The rest of the engines maintained their positions. Table 6 shows the percentage change of relevant results retrieved between 2004 and 2006. Table 7 shows the percentage change of relevant results retrieved on the first rank and the data is graphed in Fig. 4. The range in percentage change is pretty wide from −55.13% for Anazitisis to 21.88% for Visto. Four engines have negative overall change (Table 6). The percentage change is more pronounced in the results from the first rank (Table 7, Fig. 4). Five engines have negative change ranging from −5.61% (MSN) to −62.22% for Anazitisis. The above results give an overall performance rate for the search engines but do not show how the engines respond to Greek or non-Greek queries. Tables 8 and 9 present the rank distributions of the results by language, Greek and English respectively. In Table 8 it can be seen that AltaVista in 2006 handled Greek queries better than all the other engines with a success rate of 72.81%, while Google follows closely with 70.96%, whereas MSN and A9 are fourth and fifth with 50.60% and 50.23% respectively. The best performance of the Greek engines was recorded by Trinity with 49.30%. The rank distribution of the

250 200

228 221 212 198 198

2004

195 181 177 167 162 152 150

2006

150 100 50

117 96 82 83 78 35 31 33

0

Fig. 3 Search engine success rate over all queries, 2004–2006

123

366 Table 6 Percentage change on overall results

Inf Retrieval (2009) 12:352–379

Search engines

2004

Anazitisis A9

78

35

−55.13

198

152

−23.23

Trinity

181

162

−10.50

167

150

−10.18

Ano-Kato

82

83

1.22

Phantis

31

33

6.45

Google

212

228

7.55

Yahoo

177

195

10.17

AltaVista

198

221

11.62

96

117

21.88

2004

2006

Search engine

Percent change

45

17

−62.22

A9

152

106

−30.26

Trinity

160

142

−11.25

59

53

−10.17

Anazitisis

Ano-Kato

Fig. 4 Percentage change on first rank, 2004–2006

Percent change

MSN

Visto Table 7 Percentage change on first rank results

2006

MSN

107

101

−5.61

Yahoo

127

133

4.72

AltaVista

155

166

7.10

Google

171

199

16.37

Visto

60

78

30.00

Phantis

17

23

35.29

60.00% 40.00% 20.00%

Phantis

Visto

Google

AltaVista

Yahoo

Msn

Trinity

Anokato

-40.00%

A9

-20.00%

Anazitisis

0.00%

-60.00% -80.00%

Percent Change

results from the queries in either English or in transliterated form is given in Table 9. These show mixed results, as we observe variations in performance for almost all the search engines. When compared to the 2006 results from the Greek queries (Table 8), Google has increased its performance (80.43%), Yahoo!’s performance remained about the same (63.04%), whereas MSN, AltaVista, and A9 decreased theirs. Of the Greek search engines Trinity’s performance increased to 59.78%, whereas the performance of all other engines decreased.

123

Inf Retrieval (2009) 12:352–379

367

Table 8 Rank distribution of results for Greek queries, 2004–2006 Greek/ global

Search engines

Rank 1

2

3

4

5

6

7

8

9

10

Total found

P@10

2004 Global

Google

116

10

9

4

3

0

1

0

0

1

144

0.6636

Global

A9

108

10

9

3

2

1

1

1

0

2

137

0.6313

Global

AltaVista

104

13

3

3

3

1

1

2

1

2

133

0.6129

Greek

Trinity

109

1

7

3

3

1

1

1

0

0

126

0.5806

Global

Yahoo

96

13

2

3

2

1

1

0

2

1

121

0.5576

Global

MSN

78

13

8

3

0

4

4

3

0

3

116

0.5346

Greek

Visto

45

15

7

0

3

0

1

0

1

0

72

0.3318

Greek

Ano-Kato

37

9

1

3

0

1

1

0

1

0

53

0.2442

Greek

Anazitisis

26

16

5

2

0

0

0

0

0

0

49

0.2258

Greek

Phantis

11

4

2

0

1

0

2

0

0

1

21

0.0968

2006 Global

AltaVista

118

23

8

1

2

0

3

2

0

1

158

0.7281

Global

Google

131

10

5

1

2

2

1

1

0

1

154

0.7096

Global

Yahoo

104

12

8

5

3

1

2

0

2

0

137

0.6313

Greek

MSN

79

9

8

6

3

2

1

1

1

0

110

0.5069

Global

A9

82

7

9

3

3

1

2

1

1

0

109

0.5023

Global

Trinity

94

6

3

2

0

1

1

0

0

0

107

0.4930

Greek

Visto

63

16

4

3

4

0

1

1

1

0

93

0.4286

Greek

Ano-Kato

32

11

3

1

2

2

1

0

1

0

53

0.2442

Greek

Anazitisis

12

5

3

0

2

1

1

2

0

1

27

0.1242

Greek

Phantis

18

1

1

1

1

0

0

0

0

1

23

0.1059

Total queries 217

A closer look at Table 8 which depicts how the 10 engines handled Greek queries shows the following interesting results. If conclusions were to be made based on the success rates for both years, clearly the last four places are occupied by four Greek engines with almost identical feeble performance, Visto, Ano-Kato, Anazitisis and Phantis, the worst. For the remaining six engines, Trinity dropped from fourth place to sixth in 2006, A9 from second place to fifth place, Yahoo climbed from fifth to third, while MSN also climbed from sixth to fourth position. AltaVista made an overall first leaving Google in the second place, while Google was a clear winner in 2004. However, and this must also be highlighted, if the analysis were to be based solely on how the engines performed, during both years, on how they scored in returning results in the first rank, the classification ought to have been slightly disturbed. More specifically, Google would have outperformed AltaVista, Trinity would have been two positions higher up in both years, Yahoo would maintain its original positions as for its overall performance, and MSN would have remained unchanged in sixth place for both years. A9 would have dropped one place in 2004 only by a unit difference from the second AltaVista, but it would considerably drop to fifth place in 2006. For the remaining four Greek engines, the classification would be unaltered for both years. These findings suggest that the global engines scored better than their Greek rivals, not only overall, but also at the highest rank.

123

368

Inf Retrieval (2009) 12:352–379

Table 9 Rank distribution of results for English queries, 2004–2006 Greek/ global

Search engines

Rank 1

2

3

4

5

6

7

8

9

10

Total found

P@10

2004 Global

Google

55

4

3

2

1

1

0

1

0

1

68

0.7391

Global

AltaVista

51

7

2

1

1

1

0

0

1

1

65

0.7065

Global

A9

44

7

6

0

2

0

1

0

1

0

61

0.6630

Greek

Yahoo

31

16

6

0

2

1

0

0

0

0

56

0.6087

Global

Trinity

51

2

1

0

1

0

0

0

0

0

55

0.5978

Global

MSN

29

12

3

4

2

1

0

0

0

0

51

0.5543

Greek

Ano-Kato

22

3

3

1

0

0

0

0

0

0

29

0.3152

Greek

Anazitisis

19

4

3

2

0

1

0

0

0

0

29

0.3152

Greek

Visto

15

5

2

0

1

0

1

0

0

0

24

0.2608

Greek

Phantis

6

3

0

1

0

0

0

0

0

0

10

0.1087

Global

Google

68

1

3

0

0

1

0

0

1

0

74

0.8043

Global

AltaVista

48

7

2

1

0

2

1

0

0

2

63

0.6848

Global

Yahoo

29

18

5

2

0

2

1

1

0

0

58

0.6304

Greek

Trinity

48

4

2

1

0

0

0

0

0

0

55

0.5978

Global

A9

24

10

2

0

1

4

1

1

0

0

43

0.4673

Global

MSN

22

9

4

0

2

1

2

0

0

0

40

0.4348

Greek

Ano-Kato

21

5

2

1

0

0

1

0

0

0

30

0.3261

Greek

Visto

15

4

2

1

0

2

0

0

0

0

24

0.2609

Greek

Phantis

5

4

1

0

0

0

0

0

0

0

10

0.1087

Greek

Anazitisis

5

2

1

0

0

0

0

0

0

0

8

0.0869

2006

Total queries 92

A similar analysis of the results in Table 9, where the performance of how the 10 engines handled English queries is recorded, shows the following. The ordering is more stable than in Table 8, with the discrepancies now taking place at the lower end of the scores among the Greek engines and towards the middle of the English ones. More specifically, Anazitisis dropped from eighth position to tenth and A9 from third to fifth. For the rest of the engines, the classification remained unaltered with the global ones scoring better yet again against the Greek ones. The only Greek engine that managed to crawl a little bit higher was Trinity. The clear winner in both years was Google yet again. However, should again the analysis be based on the first rank classification, Trinity would have tied in the second place with AltaVista for both 2004 and 2006. The remainder engines would have maintained their initial positions based on the overall success rate. A further analysis based on second, third, etc., ranking, is pointless as the scores are not statistically sound. Nevertheless, these findings do suggest that Google has performed better than the other engines. To substantiate the claims made above, an analysis of variance (ANOVA) of the results obtained from the searches was performed. The analysis was performed for the results obtained for 2004 as well as for those of 2006. The statistical package SPSS was used to conduct the analysis. This showed that there is a significant difference at the 100% level in

123

Inf Retrieval (2009) 12:352–379

369

the mean performance of all 10 search engines when the entire sample of all queries, both Greek and English, is taken into account. This is true for both groups of search engines Greek and global. When the Greek queries were evaluated separately, there was also found a 100% significant difference between the means of the Greek search engines. However, this could not be said for the global search engines when handling Greek queries. Conversely, when the English queries were analyzed, the results showed that the global engines showed a 97% significant difference in their mean performance, whereas this could not be substantiated with confidence for the Greek engines. The situation when all queries are considered together is similar to the latter finding. The main reason for such a situation, is accredited to the excess zero entries in the ranks between fourth and tenth position scored by the Greek search engines when handling English queries. Tests run on the paired differences show that in some cases it can be argued with certainty that some engines performed worse than all others. For example, Anazitisis for all search results and for Greek queries, A9 and MSN for all search results and MSN for English queries. The engines that performed better among the other groups are Google, and Trinity for all search results, Trinity, AltaVista and Google for Greek and finally Google and Trinity for English. It must be mentioned here that the ANOVA tests run on the 2004 and 2006 data sets, showed the same behavior for both these groups. This suggests that the engines behaved in a similar manner during both periods of the study, 2004 and 2006. Moreover, what can be confidently argued is that after having run t-student tests for paired comparisons to the results of 2004 and 2006 by rank and category, i.e. 2004 Greek engines-Greek queries against 2006 Greek engines-Greek queries, the samples are statistically similar showing that the means of the samples are the same. This holds for all searches for both Greek and/or English queries and for both groups of engines. Hence it can be said that the different runs during 2004 and 2006 showed an overall similar behavior on the part of the search ability of the engines. 5.3 Mean reciprocal rank The mean reciprocal rank (MRR) for all searches that were presented in Sect. 5.2 above was calculated and is presented in Table 10. The data in the table are sorted using the MRR Table 10 Mean reciprocal rank for Greek and English/Latin queries in 2004 and 2006 Greek/ global

Search engines

GR Q 2004 GR Q 2006 EN Q 2004 EN Q 2006 All Q 2004 All Q 2006

Global

Google

0.58

0.64

0.64

0.76

0.60

0.68

Global

AltaVista

0.52

0.62

0.61

0.58

0.55

0.60

Global

Yahoo

0.48

0.53

0.45

0.44

0.47

0.50

Greek

Trinity

0.52

0.46

0.57

0.55

0.54

0.48

Global

A9

0.54

0.42

0.55

0.33

0.54

0.39

Global

MSN

0.41

0.41

0.41

0.31

0.41

0.38

Greek

Visto

0.26

0.34

0.20

0.20

0.24

0.30

Greek

Ano-Kato

0.20

0.18

0.27

0.27

0.22

0.21

Greek

Phantis

0.07

0.09

0.08

0.08

0.07

0.09

Greek

Anazitisis

0.17

0.08

0.25

0.07

0.19

0.07

123

370

Inf Retrieval (2009) 12:352–379

results for all queries in 2006. The analysis of the data showed that for all queries (both Greek and English) Google, AltaVista, and Yahoo had increases in MRR performance from 2004 to 2006. However, Google was the only search engine that had an increase in MRR for both Greek and English queries between 2004 and 2006. All other engines either remained the same or had worse performance. Yahoo, AltaVista and Visto had some increases in the MRR performance of the Greek queries, but a drop for English queries for the same period. 5.4 Search results by subject category and NQ-DCG Using the Navigational Query Discounted Cumulative Gain (NQ-DCG) method discussed in the evaluation criteria Sect. 4.4.1.4 all queries were scored and then grouped by category. This enables a finer evaluation of the performance of the search engines in the study. Table 11 shows the results of this evaluation grouped by language and by subject category for 2004 and 2006. Based on the scoring the larger the number the better the retrieval performance of a search engine. Google from the global engines and Trinity from the Greek engines outperformed the other engines in their respective groups. But, this is not to say that Trinity’s performance is good. On the contrary, when comparing the Greek and global engines the Greek engines failed miserably. Based on the aggregate results for all search engines per category for Greek queries the coverage of the categories is in the following rank order: travel agencies, universities, banks, government departments, newspapers, colleges (TEI), radio stations, museums, transportation & communication services, TV stations. Similarly, the aggregate results for all search engines for English queries show that the rank order of the coverage of the categories is: universities, newspapers, banks, government departments, colleges, transportation & communication services, travel agents, radio stations, and TV stations. Travel agencies category has the most variation in rank amongst Greek and English, positions 1 and 7 respectively. Newspapers also ranged from rank 5 for Greek queries to rank 2 for English queries. The statistical analysis of variance (ANOVA) of the results by subject category in the Greek queries (Table 11) shows a 100% significant difference in the mean performance of all engines, whereas in the English queries the difference is at the 95% level. Again here the ANOVA tests carried out on both samples for the years 2004 and 2006 showed an overall similar behavior on the part of the search ability of the engines when the subject category classification was considered. The only discrepancy is found in the results of two engines, Anazitisis, for both Greek and English queries, and A9, for English queries only. At the per category classification of the searches (Table 11) both Anazitisis and A9 didn’t pass the comparative tests, although they did pass the tests with respect to the rank classification. The reason being as can be seen from the results (Table 11 and also in Fig. 3), that the 2006 results are much worse for Anazitisis and A9, meaning that these two engines performed better in 2004 than in 2006. 5.5 Live versus dead links For each of the ten search engines the 309 queries could generate up to 3090 results. These results were further evaluated by measuring the percentage of live (active) versus dead (non-active) links. Such evaluation measures the freshness of the index of the search engine and is an indicator for the levels of frustration the searcher would have to undergo.

123

1473

Visto

1579

1945

MSN

Yahoo

Greek

2284

865

714

1655

350

Anazitisis

Ano-Kato

Phantis

Trinity

Visto

English language

1848

2206

2289

2145

2206

1403

2338

891

767

407

1699

1341

1831

1926

1540

850

1487

332

410

844

63

437

200

643

930

1435

1262

819

797

1866 1377

1776 1027

1334 1342

850

1488 1185

412

502

2004

300

1528

437

865

413

1243

1475

573

1265

2481

1243

1957

554

120

505

261

80

897 1060

1376

130

849

90

70

170

9119

1073

1030

1312

1311

1030

930

1429

483

200

321

2006

Transportation & commun. services

16810 16500 12260 10988 8757

2619

Google

Total

2126

AltaVista

2619

638

1748

Trinity

794

Phanits

1269

Ano-kato

2006

2004

2004

2006

Newspaper

Government departments

Anazitisis

Global A9

Greek

Greek language

Search engines

1985

2126

2953

2666

2044

1340

2386

532

580

702

2006

2035

1941

1966

1851

2263

60

2309

887

1153

909

2004

1953

2418

3039

2385

2304

2073

2614

978

1221

776

2006

Universities

1434

1182

2234

1894

1606

930

2163

678

542

3166

2004

1519

1242

1841

2046

1306

930

271

652

460

529

2006

1538

1255

1561

1231

1648

0

1711

382

190

749

2004

1527

1430

1817

1493

1478

316

1900

459

190

243

2006

Radio stations Colleges

2457

2017

2306

2833

1921

1540

2798

1045

2000

6593

2004

488

834

717

1650

490

400 928 1049

524

2827

2114

655

410

818

524

3100 1158 1169

3192

2094 1119

806

502

634

396

300

672

80

200

950

2006

1532 1396

1272 1260

1642 1712

1363 1602

1622 1304

300

1523

80

200

1142

2004 2006 2004

3641 1150

1065

2268

1375

2006

Travel agents TV stations Museums

290

1330

408

261

1333

290

1519

495

261

381

87

2697

1245

1291

2714

75

2818

1197

1371

958

290

444

152

642

1402

290

363

152

642

180

0

723

225

0

76

0

950

242

0

106

290

434

241

290

695

280

298

271

290

149

0

354

64

0

0

0

298

81

0

7

16661 17314 15374 19761 15829 10796 10265 10853 25510 23326 7949 6822 10676 9476

1647

1308

2153

1968

2136

770

1874

751

680

3374

2004

Banks

Sum of the scores of the top ten results by subject category, language, and search engine

Table 11 Sum of the scores of the top ten results by subject category, language, and search engine

Inf Retrieval (2009) 12:352–379 371

123

123

1385

1461

871

1385

12374

Google

MSN

Yahoo

Total

1404

822

1515

1409

1994

1648

1117

703

1057

688

1875

1109

593

590

887

648

792

2004

4058

495

531

688

494

541

2006

Transportation & commun. services

9295 14720 11459 5536

1224

822

1609

1275

2006

2004

2004

2006

Newspaper

Government departments

AltaVista

Global A9

Search engines

7922

679

381

1226

876

1138

2004

Banks

2863

2585

3109

2848

3047

2004

2785

2391

2926

2844

2656

2006

9661 22486 20021

1119

1244

1650

1458

1244

2006

Universities

Sum of the scores of the top ten results by subject category, language, and search engine

Table 11 continued

4870

333

337

480

366

424

2004

2437

130

96

352

136

96

2006

6373

983

968

1169

988

1241

2004

6734

1044

1047

1169

1085

1091

2006

Radio stations Colleges

3526

326

200

353

344

353

2004

2338

254

135

281

245

135

2006

172

100

100

181

361

988 1300

90

0

100

290

90

2004 2006 2004

2006

Travel agents TV stations Museums

372 Inf Retrieval (2009) 12:352–379

Inf Retrieval (2009) 12:352–379

373

This is an additional way of evaluating the precision of the search and the cost to the user should they follow the dead links. The results presented in Table 12 show the aggregate number of returned URLs for all 309 queries that were submitted to each search engine. These are further divided into those that were active (live) and those that were non-active (dead). In 2006, of the global search engines A9 had the highest percentage number of active links (96.78%), whereas Yahoo the highest percentage number of dead links (8.65%). Of the Greek search engines, Trinity had the highest percentage of active links (94.49%), whereas Ano-Kato the highest percentage of dead links (27%). These results despite being a good indication of the freshness of the index should not be considered in isolation. When compared to Table 5 which shows the overall success rate of the search engines it can be seen that for example, A9 has poor precision results successfully retrieving only 49.19% of the correct answers, while the dead links found in these results are 3.22%. Google, on the other hand, retrieves 73.79% of the correct results, while the non-active links are 4%. From a user’s point of view getting higher precision in the top ten results is probably more valuable. The performance of the Greek search engines with respect to active links in their result sets is disappointing. The dead links for all engines but Trinity (5.51%) range from 13.05% to 27%. Trinity is the best performing Greek search engine. Figure 5 presents the cumulated results for the active (live) and dead links found in the result sets of the Greek and English queries. From 2004 to 2006 for the results for Greek queries there is a 4.88% percent increase in the live links, whereas the increase in dead links is 25.11%. For the same period for the results for the English queries there is a drop in active links by −2.66% and a dramatic increase in the dead links (44.6%). 5.6 Response time The response time data found in Table 13 and Fig. 6 show an increase in speed from 2004 to 2006. Most search engines improved response time with percentage changes ranging from −0.40 to −0.95. The highest increase in speed is seen in MSN, which was the slowest engine in 2004. Three engines, two Greek (Phantis and Visto) and one global (A9) Table 12 Active versus dead links for all queries Search engines

2004

2006

Active %

Non active

%

04–06 percent change

Active %

Non active

%

Active (%)

Non active (%) −34.46

A9

2712

94.83 148

5.17 2917

96.78

97

3.22

7.56

Google

2807

96.99

87

3.01 2927

96.00 122

4.00

4.28

40.23

AltaVista

2790

95.25 139

4.75 2910

95.57 135

4.43

4.30

−2.88

MSN

2717

96.97

85

3.03 2862

95.27 142

4.73

5.34

67.06

Trinity

2795

96.15 112

3.85 2572

94.49 150

5.51 −7.98

33.93

Yahoo

2670

94.95 142

5.05 2651

91.35 251

8.65 −0.71

76.76

Anazitisis

2598

88.76 329

11.24 2538

86.95 381

13.05 −2.31

15.81

Phantis

2611

93.69 176

6.31 2519

86.41 396

13.59 −3.52

125.00

Visto

1317

71.69 520

28.31 1745

74.80 588

25.20 32.50

13.08

645

79.73 164

20.27

73.00 226

27.00 −5.27

37.80

Ano-Kato

611

123

374

Inf Retrieval (2009) 12:352–379 18000 16000 14000 12000 10000 8000 6000 4000 2000 0

16175

16964

Greek Latin

7487

7288

1346 Active

1684 804

556

Non-Active

Active

2004

Non-Active 2006

Fig. 5 Live versus dead links, 2004–2006 Table 13 Average response time over all searches per SE in seconds

Fig. 6 Average response time by engine, 2004–2006

SE

2004

2006

Percent change

MSN

6.067

0.324

−0.95

Trinity

2.439

0.364

−0.85

AltaVista

3.589

0.634

−0.82

Yahoo

3.491

0.637

−0.82

Ano-Kato

2.635

0.492

−0.81

Google

2.278

0.586

−0.74 −0.40

Anazitisis

2.681

1.602

Visto

0.760

1.205

0.59

Phantis

2.075

3.418

0.65

A9

0.537

1.070

0.99

7.00

6.07

2004

6.00 2006 5.00 3.59 3.49

4.00 3.00

2.44

3.42 2.64

2.28

2.08 1.60

2.00 1.00

2.68

0.32 0.36

0.63 0.64 0.49 0.59

1.21 0.76

1.07 0.54

0.00

decreased their response time with a percent change of 0.59 to 0.99. The overall faster response times could be attributed to hardware upgrades that were implemented at the Greek universities during that period. For example, better computers at the university labs, and the GRNET network upgrade both in the universities and in the backbone infrastructure (1–2, 5 Gbps). Similarly, hardware upgrades at the search engines could have also contributed to the faster response times.

123

Inf Retrieval (2009) 12:352–379

375

5.7 Search engine coverage of the Greek web To measure the extent of coverage of the Greek Web (.gr) and the freshness of the index of the search engines a sample of 32480 top level domain URLs that were crawled from the Greek Web was used (see Sect. 4.4.2). This sample was estimated to be about 40% of the registered.gr domains in 2004 (Efthimiadis and Castillo 2004). Table 14 and Fig. 7 present the results by year, search engine, and actual numbers and percentages of indexed and notindexed URLs as well as the URLs that were dead in 2006. Table 15 shows the percentage change between 2005 and 2006 for the indexed and not-indexed URLs. In 2005, Google with 98.04% has almost all the URLs indexed. Yahoo follows with 84.06%, A9 with 77.41% and MSN with 66.63%. Anazitisis with 61.72% is at the top of the Greek search engines, while Phantis with having indexed a dismal 4.87% at the bottom of the list. Visto is not included in 2005 because the data was corrupted. For 2006, we observe a drop of the indexed URLs across all search engines but Trinity and Phantis. In Table 14 the data for 2006 take into consideration the decay of the URLs searched in 2005. Although the initial sample contained 32480 URLs, the table in the “URLs checked” column reports fewer numbers of URLs per engine. This is due to network problems during searching. It was therefore decided to include only the successful returns because this way a more accurate picture is reported. The “URLs not-indexed by SE and dead” were subtracted from the total number of “URLs checked” in order to avoid penalizing search engines that did not index them. Since only URLs that were not found as indexed by a search engine were checked to verify if these were live, a number of dead URLs that were still in the search engines’ indices were accepted without penalizing the search engines per Hawking (2001). In 2006, Google maintained its lead over all other search engines however it dropped its coverage to 91.37% of the sampled URLs. A9 also dropped to 72.75% while Yahoo and AltaVista, which use the same index, increased their coverage to 86.2% and MSN to 71.61%. The bar charts in Fig. 7 provide a visual representation of the indexed versus the notindexed results from the search engines. The bars are stacked showing the 2005 results left of the middle gridline and the 2006 results right of it. In both 2005 and 2006 Google seems to have the best coverage of Greek URLs followed by Yahoo, AltaVista, A9, and MSN. The Greek search engines have rather different order in the two years. In 2005 the best coverage was provided by Anazitisis, and followed by Ano-Kato, Trinity and Phantis. In 2006 Trinity had the best coverage and was followed by Ano-Kato, Anazitisis, Visto, and Phantis. A closer examination of the results, especially at the percentage change between 2005 and 2006 of the indexed and not-indexed URLs (Table 15) reveals that Google, A9, and Anazitisis had the biggest losses in coverage. Google had dropped from its index about four times the URLs it did not have in 2005, but still has the best coverage of all engines. Yahoo, AltaVista and MSN improved their coverage at about the same rate. The most noticeable improvements in coverage came from Trinity and Ano-Kato albeit still performing below that the global engines.

6 Conclusions This study aimed at evaluating how search engines handle Greek language queries, assessing whether the Greek or global search engines are more effective in satisfying user requests for navigational queries, and, evaluating the extent of coverage of the Greek web

123

123

20046

14313

32397

32371

32459

32442

32479

32479

Yahoo

A9

MSN

Anazitistis 32479

32479

AltaVista

Anokato

Trinity

Phantis

Visto

21616

32407

Google

1581

9957

25126

27195

27232

31773

URLs URLs checked indexed by SE

Search engines

2005

7333

5176

5165

634

URLs not indexed by SE but live

4.87 30898

30.66 22522

44.07 18166

61.72 12433

66.63 10826

77.41

84.01

84.06

98.04

%

Table 14 Indexed versus not-indexed URLs, 2005–2006

URLs not indexed by SE and dead

95.13 0

69.34 0

55.93 0

38.28 0

33.37 0

22.59 0

15.99 0

15.94 0

1.96 0

%

0

0

0

0

0

0

0

0

0

32407

32478

32467

32401

29851

32474

32473

32474

32479

32466

4206

4488

4345

7390

3707

3358

3802

2886

2892

2783

% URLs URLs not checked indexed by SE and dead

2006

28201

27990

28122

25011

26144

29116

28671

29588

29587

29683

6212

3528

19040

14141

6878

20851

20859

25515

25500

27121

URLs net indexed URLs for 2006 by SE

8265

7812

4073

4087

2562

9082 22.03 21989

12.60 24462

67.70

56.54 10870

26.31 19266

71.60

72.75

86.23

86.19

91.37

% for URLs not net indexed by SE but live

77.97

87.40

32.30

43.46

73.69

28.39

27.25

13.77

13.81

8.63

% for net

376 Inf Retrieval (2009) 12:352–379

Inf Retrieval (2009) 12:352–379

Indexed 2005

377

Not-Indexed 2005

Indexed 2006

Not-Indexed 2006

Phantis Visto Anazitisis Ano-Kato Trinity MSN A9 Altavista Yahoo Google

0

32480

64960

Fig. 7 Indexed versus non-Indexed URLs, 2005–2006 Table 15 Percent change for indexed and not-indexed URLs, 2005–2006 Indexed URLs

Not indexed URLs

2005

2006

% change

2005

2006

% change

Google

31773

27121

−15

634

2562

304

Yahoo

27195

25515

−6

5176

4073

−21 −21

AltaVista

27232

25500

−6

5165

4087

A9

25126

20859

−17

7333

7812

7

MSN

21616

20851

−4

10826

8265

−24

9957

19040

91

22522

9082

−60

Ano-Kato

14313

14141

−1

18166

10870

−40

Anazitisis

20046

6878

−66

12433

19266

55

Trinity

Visto Phantis

6212 1581

3528

21989 123

30898

24462

−21

and the freshness of their indices. The study evaluated ten search engines, five Greek and five global. Our results corroborate and extend the findings of (Lazarinis 2007). The analysis shows that the global search engines ignore the characteristics of the Greek language, hence treating Greek queries differently. Despite this finding the performance of the global search engines outperforms that of the Greek engines in both years of the evaluation, 2004 and 2006. A set of 309 navigational queries was used in the evaluation. The rank distribution of all search results indicates that on average the search engines retrieved the relevant target URL in the first three rank positions. However, the rate of success leaves much to be desired as the most successful engine, Google, was able to find

123

378

Inf Retrieval (2009) 12:352–379

the correct answer to only 73.91% of the English and 60.37% of the Greek queries. The global engines seem to have good coverage of the Greek web relative to the sample of 32480 URLs tested, but the results returned by the engines are different depending on how the searcher has typed the Greek query, e.g., with or without accents. Therefore, the implications for Greek users are many as they need to be aware of the nuances to searching using Greek. The study was conducted during different periods of time, in 2004 and 2006 for the navigational queries, and in 2005 and 2006 for the indexing coverage of the sample of 32480 URLs. The coverage of the URLs for 2006 ranged from as low as 12.6% for Phantis to as high as 91.37% for Google. Although, Google’s coverage seems high, it has dropped from 2005. The results obtained, were statistically analyzed to substantiate that the sample means of the search outcomes per engine were different. Also, it was statistically justified that the behavior of the engines was similar for the different years. A possible explanation of the poor performance of the Greek search engines might be the lack of sophisticated crawling, searching, and ranking algorithms found in the global search engines. The Greek search engines have a very low coverage of the Greek web (see Table 14), where it ranges from 4.87%–61.72%. Although the global search engines outperformed the Greek engines, there is much room for improving their performance in both retrieval effectiveness and coverage. Given the better performance of the global engines as reported in this study, it could be expected that if the global search engines were to take into account Greek language characteristics their performance would be further ameliorated. Acknowledgements The authors gratefully acknowledge the anonymous reviewers and Kalervo Ja¨rvelin for their helpful comments and suggestions.

Appendix List of search engines used in the study Global search engines 1. 2. 3. 4. 5.

A9: (http://www.a9.com) Google (http://www.google.gr) Yahoo (http://www.yahoo.com) AltaVista (http://www.altavista.com) MSN (http://search.msn.com)

Greek search engines 1. 2. 3. 4. 5.

Anazitisis (http://www.anazitisis.gr) Ano-Kato (http://www.ano-kato.com) Phantis (http://www.phantis.gr) Trinity (http://www.trinity.gr) Visto (http://www.visto.gr)

123

Inf Retrieval (2009) 12:352–379

379

References Alevizos, T., Galiotou, E., & Skourlas, C. (1988). Information retrieval and Greek-Latin text. Paper presented at the Online Information 88. 12th International Online Information Meeting. Babiniotis, G. (1998). Short history of the Greek language. Athens. Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28. Bitirim, Y., Tonta, Y., & Sever, H. (2002). Information Retrieval Effectiveness of Turkish Search Engines. Paper Presented at the Advances in Information Systems. Second International Conference, ADVIS 2002. Proceedings (Lecture Notes in Computer Science Vol. 2457), Izmir, Turkey. Broder, A. (2002). A taxonomy of web search. SIGIR Forum, 36(2), 3–10. Efthimiadis, E. N., & Castillo, C. (November 13–18, 2004). Charting the Greek Web. Paper Presented at the Proceedings of the American Society for Information Science and Technology (ASIST) Annual Conference, Providence, RI. Efthimiadis, E. N., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2008). An Evaluation of How Search Engines Respond to Greek Language Queries. Paper Presented at the Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences. Griesbaum, J. (2004). Evaluation of three German search engines: Altavista.de, Google.de and Lycos.de. Information Research-an International Electronic Journal, 9(4). Hawking, D., & Craswell, N. (2002). Overview of the TREC-2001 Web Track. Paper Presented at the Tenth Text REtrieval Conference (TREC 2001). Hawking, D., Craswell, N., Bailey, P., & Griffiths, K. (2001). Measuring search engine quality. Information Retrieval, 4(1), 33–59. Hawking, D., Craswell, N., & Griffiths, K. (2001). Which search engine is best at finding airline site home pages? CSIRO Mathematical and Information Sciences. Ja¨rvelin, K., & Kekalainen, J. (2000). IR Evaluation Methods for Retrieving Highly Relevant Documents. Paper Presented at the Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Ja¨rvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. Kalamboukis, T. Z. (1995). Suffix stripping with modern Greek. Program, 29(3), 313–321. Karakos, A. (2003). Greeklish: An experimental interface for automatic transliteration. Journal of the American Society for Information Science and Technology, 54(11), 1069–1074. Kelly-Holmes, H. (2006). Irish on the World Wide Web: Searches and sites. Journal of Language and Politics, 5(2), 217–238. Lazarinis, F. (2007). Web retrieval systems and the Greek language: do they have an understanding? Journal of Information Science, 33(5), 622–636. Mayer, T. (2005, posted 8/8/05). Our blog is growing up, and so has our index. Yahoo! Search Blog. Retrieved May 15, 2007, from http://www.ysearchblog.com/archives/000172.html. Moukdad, H. (2004). Lost In Cyberspace: How Do Search Engines Handle Arabic Queries? Paper Presented at the Access to Information: Technologies, Skills, and Socio-Political Context. www.cais-acsi.ca/ proceedings/2004/moukdad_2004.pdf. Moukdad, H., & Cui, H. (2005). How do search engines handle Chinese queries? Webology, 2(3). Sroka, M. (2000). Web search engines for polish information retrieval: questions of search capabilities and retrieval performance. The International Information & Library Review, 32(2), 87–98. Internet World Statistics. (2007a). Greece: Internet usage and marketing report. Retrieved May 15, 2007, from http://www.internetworldstats.com/eu/gr.htm. Internet World Statistics. (2007b). Internet world users by language (as of June 30, 2007). Retrieved September 5, 2007, from http://www.internetworldstats.com/stats7.htm. Tzekou, P., Stamou, S., Zotos, N., & Kozanidis, L. (2007). Querying the Greek Web in Greeklish. Paper Presented at the Improving Non-English Web Searching (iNEWS07) SIGIR07 Workshop. Vaughan, L. W., & Thelwall, M. (2004). Search engine coverage bias: evidence and possible causes. Information Processing and Management, 40(4), 693–707. Vaughan, L. W., & Zhang, Y. J. (2007). Equal representation by search engines? A comparison of websites across countries and domains. Journal of Computer-Mediated Communication, 12(3). Voorhees, E. M. (1999). The TREC-8 Question Answering Track Report. Paper Presented at the Eighth Text REtrieval Conference (TREC-8).

123