Improving Web Site Search Using Web Server ... - Semantic Scholar

3 downloads 151295 Views 265KB Size Report
Anchor text is used to create another representa- tion. They are .... /~mth108/fractals/index.html HTTP/1.1" 200 .... tracted from the anchor text in the entry referrer.
Improving Web Site Search Using Web Server Logs Jin Zhou 1 Chen Ding 2 Dimitrios Androutsos 3 1,3

Department of Electrical and Computer Engineering, Ryerson University 2

Department of Computer Science, Ryerson University 350 Victoria St., Toronto, ON, Canada M5B 2K3 {j3zhou, cding, dimitri}@ryerson.ca

network model with web log information achieves the best result.

Abstract Despite the success of global search engines, web site search engines are still suffering from poor performance. Since a web site is different from the whole web in link structure, access pattern, and data scale, it is not always successful when the methods which improve the performance of web search are applied to web site search. In this paper, we propose a novel algorithm to improve the retrieval performance by using web server logs. Web server logs are grouped into different sessions and the relationships of web pages in the session are analyzed based on their similarities. Then, a new web page representation is generated. Anchor text is used to create another representation. They are combined with original text-based representation in web site search. Two kinds of combination methods are investigated and tested: combination of document representations and combination of ranking scores. Our experimental results show that our algorithm can improve the retrieval accuracy for the four retrieval models we tested: Inference Network Model, Okapi Model, Cosine Similarity Model and TFIDF Model. The highest performance increase from web log analysis is from TFIDF model, and overall, inference

1 Introduction The World Wide Web has permeated our daily life and has changed the way we think, work and live. In recent years, search engines such as Google [2], Yahoo! [6] and MSN [5], have been a great help for users to retrieve their desired information from the growing web. After people view the results returned from search engines, they usually click some URLs to visit those web sites, and once at a site, they might find information they want immediately, or they might browse through hyperlinks or conduct a web site search to further their information seeking tasks. For example, a student might search course-related information on a departmental web site; a potential customer of Amazon might search for a specific book or DVD on the Amazon website. Although web site search is frequently used, users often suffer from its performance deficiency. When a query is submitted, a lot of irrelevant results are returned or the most relevant pages are listed out of the top 10 results. In the Forrester survey [20], the search facilities of 50 websites were tested, but none of them obtained a satisfactory result according to the query. For example, the most relevant pages were rarely put in the first page of results, the best matched web pages

Copyright  2006 Jin Zhou, Chen Ding, Dimitrios Androutsos and Ryerson University. Permission to copy is hereby granted provided the original copyright notice is reproduced in copies made.

1

ing different sources for web site search; 3) experiment on several retrieval models, namely Cosine Similarity, Okapi, TFIDF and Inference Networks. From our experimental results, our retrieval model of using WLA and combination methods improve the retrieval accuracy in terms of precision for Inference Network retrieval model by 7.9%, Okapi retrieval model by 10.6%, Cosine Similarity retrieval model by 22.9% and TFIDF retrieval model by 45.4%. The paper is organized as follows: Section 2 reviews the related work in the area. In section 3, we present our novel WLA algorithm to generate new web page representation based on web server logs. Section 4 introduces the combination of three representations and ranking scores. The experimental results are discussed in section 5 and then we draw our conclusion and present future directions in section 6.

weren’t retrieved, and many irrelevant results were returned. Many approaches, such as link analysis [12, 21] and click-through based ranking [3], which are very successful in web search engines, do not work well in web site search. First, the core of link based algorithms such as PageRank [12] or HITS [21] is to compute the ranking score of each page based on the hyperlinks between web pages. If a page gets more in-links, the ranking score of the page is higher, and thus this page is more important. As [34] analyzed, in a web site, hyperlinks are usually created to organize the whole web site into a hierarchical or linear structure. When applying these link-based algorithms to a single web site, pages, such as homepages, will get higher ranking scores, but usually, these pages are not what the web user is searching for. Secondly, click-through based algorithms such as DirectHit [3] have their own limitations. After a query is submitted, DirectHit looks at previous log data on the same query and assigns higher ranking scores to pages with more clickthroughs. One problem in DirectHit is that it assumes that a page is relevant to a query if it is visited by a web user. If a user accesses a page unintentionally, it is not right to rank this page higher just because it is visited by the user. In this paper, we propose a new approach to improve web site search accuracy by using web server logs. Web server logs record the details of a web user’s browsing and searching behaviors. Thus, they contain more complete relevancy judgment information than click-through data. When users navigate through a web site, the pages they choose to browse reflect their judgment on the relatedness of page content to their information tasks. We apply our algorithm - Web Log Analyzer (WLA), to web server logs so that we can generate a new document representation to represent each page appearing in server logs. In our algorithm, we do session detection and term propagation in each session. When propagating terms in a session, we consider access time of pages visited, duration of visit and also similarity between two pages. Since anchor text has proven to be able to improve web site search, we extract anchor text in web pages to create another document representation to supplement the original text-based representation. The major contributions of this paper are: 1) a novel algorithm to improve web site search by using web server logs; 2) investigation in combin-

2 Related Work Because web site search has become increasingly more important in recent years, much research and product development has been devoted to this area. Most techniques use traditional search technologies which retrieve a number of documents containing the same keywords as the query and rank them according to the similarity between the web page and the query. Xue et al. [33] proposed a log mining model to improve the performance of site search. They utilized web server logs and taxonomies to extract a set of generalized association rules which reflect associations not only between leaf nodes but also in different abstraction levels. When a user submits a query, a full-text search engine finds all the web pages that match the query. These pages are then re-ranked by the association rules. Their algorithm improved the precision of web site retrieval. When they constructed taxonomies, they assumed that a web site is hierarchically organized based on page content. In fact, this assumption is not accurate. There is no exact correspondence between site hierarchy and taxonomy based on site semantics. Cui and Dekhtyar [15] proposed the LPageRank algorithm for web site search. First they analyzed web logs and constructed a probabilistic graph of the web site. Then LPageRank was performed on the graph. From their experimental results, which were from a preliminary stage, it can be seen that LPageRank combined with

2

bined ordering of pages. Our approach is different from these methods. Our system takes web server logs as a new source of evidence.

TFIDF is not good enough to outperform Google’s web site search. Different from these approaches, we use web server logs to generate page representations, which is a more traditional information retrieval technique. There is much research on query expansion, however they are not directly related to our work and the interested reader is directed to the references [9, 17, 22, 25, 30]. Billerbeck and Zobel [10] explored the use of document expansion as an alternative to query expansion. They proposed two new methods for document expansion. The first method treats each document as a query and then augments it with expansion terms. The second method is to treat each term in the collection as a query, and then the original query term is added to the top-ranked documents. Although query-time costs are low, the index-time costs are considerable and the corpus-based document expansion cannot significantly improve the effectiveness. Another issue is that the topic of the expanded document is significantly changed from the original topic, which could lead to a poor performance. Many relevance propagation methods have been proposed recently to enhance relevance weighting schemes [12, 29, 31]. Qin et al. [28] provided a comparison study on the effectiveness and efficiency of various propagation models and also proposed a generic relevance propagation framework. Through their experiments, they found that relevance propagation can boost the performance of web information retrieval. Combination of multiple sources of evidence [11, 24, 32] is very effective in web search. Ogilvie and Callan [27] investigated the combination of document representations, while their focus is on known-item search, which is different with site search. They proposed a mixture-based retrieval model and experimented on the use of citations, which used link text and additional structure information on the web [13, 14, 23]. In their results they found that using link text is beneficial. They also examined many different meta-search algorithms. In [26], Myaeng et al. used Bayesian inference networks to combine terms found in different document representations. Fagin et al. [19] proposed a novel architecture that combined various ranking heuristics through the use of a rank aggregation algorithm. Their aggregation algorithms take multiple ranked lists from the various heuristics as input and produce a com-

3 Web Log Analyzer (WLA) Web server logs are important source to benefit the web site search because they reflect judgments on the page content from web users. In this section, we introduce how WLA generates a document representation based on web server logs. Figure 1 illustrates the workflow of WLA. As we can see, it uses web server logs as input, and outputs a new document representation. There are several steps involved, and we discuss them in details below. Web Logs

New Representation

Log Purge

Page Merging

Session Detection

Term Propagation

Figure 1: The workflow of WLA.

3.1 Web Log Purge Server logs record all requests from web users. Each log entry contains the following fields: client IP, user name and password (normally empty and represented by “-”), access time, HTTP request method used, URL, protocol used, status code, number of bytes transmitted, referrer. An example log is as shown here: 120 - - [01/Sep/2005:01:15:57 -0400] "GET /~mth108/fractals/index.html HTTP/1.1" 200 6006 http://scs.ryerson.ca/~mth108 In this example, we replace IP with a user ID. A user with ID 120 has requested web page /~mth108/fractals/index.html on September 1st, 2005 at 01:15:57. The size of the page transmitted is 6006 bytes. The client is referred from http://scs.ryerson.ca/~mth108/. Since not every log entry is useful, web logs are purged by filtering out all image, video, and audio requests and failed page requests.

3

Figure 2 illustrates the session detection algorithm. In Figure 2, Max_Duration_Page is the maximum time staying on one web page, which we set to 20 minutes, Max_Duration_Session is the maximum duration of one session, and we set it to 2 hours. These thresholds are chosen based on our observation on user’s access patterns and related research in data mining [16]. To update SD j means that every time it computes the dura-

3.2 Session Detection and Representation A user session is defined as a set of clicks over a limited time by a user. In this paper, a user is assumed to be uniquely defined by the IP address recorded in the http request. The whole web server log can be represented as a set of sessions S = ( S1 , S 2 ,..., S n ) , where n is the number of sessions. Each session S j consists of a number

tion of the current session based on formula (3-1), (3-2) and (3-3), to new a session means to create a new session, where entry URL is the URL of log i , and entry referrer is the referrer of log i , to add the current log i into the session means to

of fields, such as session ID, user ID, entry URL, entry referrer, URLs, referrers, etc. Entry URL is the first web page visited in a session and entry referrer is the referrer of the entry URL. If the entry referrer is “-”, it means that the user typed the URL directly in a browser window, that is to say, there is no referrer for the entry URL. URLs are a set of URLs user visited in the session. Referrers are a set of referrers corresponding to URLs. The set of URLs is defined as follows: (url0 , url1 ,..., urlm ) and the set of referrers is defined as: (ref 0 , ref1 ,..., ref m ) where m+1 is the number of web pages the user visited in a session S j . Duration is the time stay-

append the URL of log i into the URL list of this session, and append the referrer of log i into the referrers of this session. 1. for each log i with the same user 2. if urld i > Max _ Duration _ Page or

SD j > Max _ Duration _ Session or

ref i =="−" or ref i ∉ (url 0 , url 2 ,..., url i −1 )

ing on each web page. A set of corresponding durations can be represented as: (urld0 , urld1,..., urldm ) Duration of each page except the last one can be calculated by: (3-1) urldi = itimei +1 − itimei

else add log i into the current session S j update SD j

Figure 2: Session detection algorithm in WLA.

3.3 Term Propagation To get more terms for each page, propagation is applied within each session. There are two steps in the propagation process: (1) extracting entry terms, and (2) propagating terms along the access path. In a session, terms for entry URL are extracted from the anchor text in the entry referrer which links to the entry URL or from the query terms if the entry referrer is a search engine result

i

(3-2) m The duration on a session can be calculated after getting each web page’s duration: m

SD j = ∑ urld i

update SD j

7.

m −1

urld m =

4.

S j +1

8. end if 9. end for

the time a user exits a session is not recorded, we use the following formula to approximate the duration of the last web page in a session. The basic assumption is that duration usually will not deviate too much from the average duration value: i =0

new a session

5. 6.

Where i is an integer between 0 to m, and itimei is the time when the user accesses page p i . Since

∑ urld

3.

(3-3)

i =0

Where SDj is the duration on session Sj.

4

page. For other pages in the session, terms are extracted from two parts: one is from the anchor text in their referrer pages and the other is propagated from previous pages. The process is shown in Figure 3. During the propagation, the term weight for each page in the session is calculated by the following formula: TW i = α × S ( p 0 , p i ) × R ( p 0 , p i ) × TW 0 + β ×

i −1



j =1

S ( p j , p i ) × R ( p j , p i ) × TW

+ γ × TWi −1

two pages p j , p i

propagation is stopped. So, in this case, (m + 1) is the number of pages propagated. Entry referrer Anchor / Query Terms

url2

TV _ url 0

TV _ url 3

TV _ url 2

TV _ url1 url0

(3-5) α + β +γ =1 Where TWi is the weight for terms in page p i , TW0 is the term weight for the entry page in the

Anchor Terms

Anchor Terms

Anchor Terms j −1

(3-4)

url 3

url1

For the web pages:

url0 : TV _ url 0

url1 : TV _ url 0 + TV _ url1

session and S ( p j , p i ) represents the similarity

url2 : TV _ url 0

between two pages p j , p i . The similarity score is computed by the cosine method [8]: p Tj • pi S ( p j , pi ) = || p j || • || pi ||

is less than a threshold,

+

TV _ url1 + TV _ url 2

url 3 : TV _ url 0 + TV _ url1 + TV _ url 2 + TV _ url 3

(3-6)

The similarity function only applies to the first two parts in (3-4) without affecting the third part. The reasoning behind this is that terms extracted from an immediate ancestor in an access path can represent the content of the linked page with higher confidence, while terms from farther ancestors might deviate from the original content, and thus they are reliable only when similarity scores between two pages are high. Function R computes the distance between two pages with the formula given by:

Figure 3: The propagation process in one session.

3.4 Page Merging In one session, a web page may appear more than once, and in the whole log, a page may appear in different sessions. Because each web page in one session is represented by a term vector and a corresponding weight vector, when merging the pages from different sessions, it is necessary to merge the term vectors and weight vectors into one term vector and weight vector. Therefore, the page is finally represented by one term vector and one corresponding weight vector respectively: Term _ Vector = {Term 1 , Term 2 ,..., Term l } Weight _ Vector = {Weight 1 , Weight 2 ,..., Weight l } Where l is the number of terms in the vector. When we merge the pages from different sessions, the session weight is included in the calculation. The session weight considers two factors: how long the session continues, measured by session duration, and how recently the session happens, measured by session recency. So, the session weight can be expressed as: SW j = log( SD j × SR j × factor ) (3-8)

1 (3-7) i − j +1 So for each web page p i in the session, its R( p j , pi ) =

terms and corresponding weights are represented as: V _ urli = {(TV _ url0 , W _ url0 ),..., (TV _ urli , W _ urli )}

Where TV _ url i represents anchor window texts on the link from page p i −1 to page p i , W _ urli is the weight of the terms in TV _ url i . So in a session, URLs’ vector is represented by: Vector_ URL = {V _ url0 , V _ url1 ,..,V _ urlm } Where (m + 1) is the total number of web pages in the session. If the similarity S ( p j , pi ) between

5

Where SD j is the session duration, computed by (3-3), factor is for scaling purposes (set to 100), and SR j is the session recency, defined as:

SR j =

4 Combination of Multiple Sources of Evidence In [27], Ogilvie and Callan found that using anchor text is helpful for site search. In this work, we also analyze anchor text to create the third document representation other than the original full text representation and web server log representation. It is different than the anchor texts we used in term propagation process. It is now extracted from the static link graph of the web site, while the anchor text in propagation process is considered only when user actually follows the access path within a session. Anchor text is defined as the “highlighted clickable text” that is displayed for a hyperlink in an HTML page [18]. To get the anchor text database, two steps are taken: The first one is to parse each web page in the web site and extract all hyperlinks with their corresponding anchor texts. The second is to merge the anchor text for each URL. In this work, we use anchor window text instead of anchor text. Anchor window text is the text before and after the anchor text within a window plus the anchor text itself. We set window size to 80. After obtaining three document representations, we investigate two combination methods. The first one is the combination of ranking scores, which applies a certain retrieval model to each of the document representations and combines the results from these different document representations to produce a single ranked list. The second is the combination of document representations, which first combines the different document representations into a single document representation and then applies a retrieval model to the single document representation to produce the result list.

1 (3-9) log(Tnow − Titime j )

Where Tnow is the current system time, and

Titime j is the start time of visiting the entry URL. The earlier the web user visits the page, the lower the recency is. Figure 4 illustrates the algorithm of merging the term vectors and weight vectors for page pi from different sessions. For each session containing page pi , the weight of each term in the vector is calculated by the current session weight multiplied by the current term weight. The reasoning behind this is that when the session weight is higher, it means that the user is more serious on the information seeking task in the session, and as a consequence, weight from this session should contribute more to the final page vector. Eventually, we merge the two vectors Term_Vector and Weight_Vector as follows: TW _ Vector = {(Term1 ,Weight1 ),..., (Terml ,Weight l )} 1. for each session s ∈ ( s1, s2 ,..., sn ) 2. 3. 4. 5.

for each url u ∈ (url1 ,..., url m ) in session s if

u == pi for each term vector

tv ∈ {TV _ url0 ,...,TV _ urlk } in u

6.

for each term t ∈ {Term 0 ,..., Term j } in tv

7. 8.

if term t ∉ Term _ Vector term t appends into the Term Vector

9.

of

pi : Term_Vector

10. w = W _ url (u ) × SW ( s ) 11. else 12. w = w + W _ url (u ) × SW ( s) 13. end if 14. end for 15. end for 16. end if 17. end for 18. end for

4.1 Combination of Ranking Scores The similarity score between query q and web page p is computed by the following formula: score( p, q ) = α1 × simcontent ( p, q )

+ β 1 × sim anchor ( p, q ) + γ 1 × sim log ( p, q )

α 1 + β1 + γ 1 = 1

Figure 4: The algorithm of page merging from different sessions.

6

(4-1)

where simcontent , sim anchor and simlog are the

q, and idf i is the inverse document frequency in

similarity scores based on original web page, anchor window text and web logs respectively. Aslam and Montague [7] summarized the combination by stating: “The systems being combined should (1) have compatible output (e.g., on the same scale), (2) each produce accurate estimates of relevance, and (3) be independent of each other.” Because we apply the same retrieval model to different document representations, the requirement (1), (2) and (3) are met for full text representation and anchor window text representation. But for web log representation, since the term weights are calculated differently, we need to consider the requirement (1) – compatible output. In our experiment, we use Cosine Similarity, TFIDF and the Okapi models [8] to compute the similarity scores between queries and web pages, so we need to calculate the scores from the web log representation separately to keep them on the same scale with those from the other two document representations.

the database, wi , p is the weight of term i in page

p in the web log database and m is the total number of terms in query q. c. Okapi model Since the scores calculated from the web log database are in the same scale as the other two, they do not need to be additionally processed. d. Inference Network model We use the Indri [1] implementation for the inference network model, which also combines language modeling. Combination of belief in the Indri model is different from the above three models. We first combine the three document representations to a single document representation. When combining the document representations, some tags are added as the field signs to differentiate the terms which come from different document representations. For example, if the terms come from anchor text database, tags and are added to the combined document representation. In the same way, tags and are added for terms extracted from the web logs. Then, we set

a. Cosine Similarity model Since two scores from the original web page database and anchor window text database are in the range of 0 to 1 in the cosine similarity model, the score from the web log database should be kept in the same scale, so normalization is performed to achieve this. The formula for computing the score from the web log database is defined by:

wi , p

m

simlog ( p, q ) = ∑ i =1

t

∑w j =1

parameters α1 , β1 and γ 1 in the structure queries to weight the terms differently.

4.2 Merging of Document Representations

(4-2)

The goal of the combination approach is to combine document representations into one document representation by merging the terms into one document, and then apply different retrieval models. Thus,, no weighting parameters for scores are calculated on different representations. This approach, to a certain extent, is similar to document expansion. Since there is no weighting scheme involved in any database when combining three representations, term weights in the web log representation are ignored. For the Indri model, in the structure query, we ignore the difference of terms from different fields and retrieve the results based on the whole combined document representation.

j, p

Where wi , p is the weight of term i in the page p in the web log database, m is the number of the terms in the query q, and t is the total number of the terms in page p. b. TFIDF model The following formula can be used to compute the score for web log database: m

simlog ( p, q) = ∑ tf (q i ) × idf i × wi , p 2

i =1

(4-3) Where tf ( qi ) is the term frequency in the query

7

tion. A more optimal way is to cut off the propagation when the similarity between two pages is less than a threshold. We chose 24 queries, which were either extracted from web logs or comprised from terms randomly extracted from log and anchor window text databases. These are listed in Figure 5.

5 Experiment and Analysis 5.1 Data Set We conducted our experiment on the department of Computer Science web site at Ryerson University. There were 9,739 web pages in total, without considering audio, image, and video pages. They comprised the original web page representation database. There were 5,780 web pages in the anchor window text database.

5.2 Implementation Our system was implemented using the Lemur Toolkit [4]. We applied the Cosine Similarity model, TFIDF model, Okapi model and the Indri model to the two methods of combinations. For the linear combination of ranking scores, we chose the following parameter settings as shown in Table 1.

lab facilities policies coop prospective description artificial intelligence course introduction human computer interaction course outline cps109 assignment submission instruction cps125 review problem final exam cps607 lecture notes java programming admission requirement computer science course introduction data communications topic unix account management web related research unix tutorial cps125 extra practice question image animation design textbook reference pliant installation winter course eng203 lab schedule computer graphics assignment research project algorithm analysis operating system course grade

ID

α

β

γ

ID

α

β

γ

1 2 3 4 5 6 7 8 9 10 11 12 13

0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5

0.0 0.1 0.2 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.0

0.2 0.1 0.0 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 0.5

14 15 16 17 18 19 20 21 22 23 24 25 26

0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.3

0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.4

Table 1: Parameter settings for experiment. For the Indri model, we constructed the query as illustrated in the following example to implement the combination of ranking scores. In this example, the weight for terms from original web pages is 0.5, the weight for anchor window text is 0.2, and the weight for web log text is 0.3.

Figure 5: Queries in the experiment. We downloaded the web logs from January 1, 2005 to December 31, 2005 from the web server. There were 582,799 records after the logs were purged. For example, we kept those pages with file extensions such as .html, .shtml, htm, etc. Two databases from the web logs were obtained by two different methods in the page propagation phase. In the first method, we did not perform the propagation, and we just considered the entry page within each session. In the second session, we set the depth to 3, and dealt with a maximum of 4 pages within one session. The set of depth threshold is only for the simplicity of implementa-

#combine (#wsum (0.5 cps109.(html) 0.2 cps109.(anchortext) 0.3 cps109.(logtext)) #wsum (0.5 exam.(html) 0.2 exam.(anchortext) 0.3 exam.(logtext)) #wsum (0.5 bank.(html) 0.2 bank.(anchortext) 0.3 bank.(logtext)))

8

α 1 = 0.4, β 1 = 0.2, γ 1 = 0.4 , the improvement

We then used the following query to implement the combination of document representations:

reaches 22.9%. In the Okapi retrieval model shown in Figure 6 (b), both performances of representation combination and ranking score combination are better than that of the original method. The performance of the combination of three document representations is better than that of the combination of two document representations, i.e., original web page content and anchor text. The former can improve the precision by 10.4%, while improvement in the latter is 9.1%. When combining three document representations, the performance of combining web logs with propagation of depth 3 (10.6%) is better than that without the propagation (10.1%). It is interesting to find that if we just combine two ranking scores – original web pages and web logs, there is no improvement, while if combining three document representations, the performance is improved by 10.6%, which happens to be the highest among all methods. From Figure 6 (c), we notice that combination is the best way to improve the TFIDF retrieval model. The performance of the combination of three document representations is better than that of the combination of two document representations, i.e., original web pages and anchor text. The former can improve the precision by 11.1%, while the latter is 6.6%. When combining three document representations, the performance of combining the web logs with propagation of depth 3 (12.3%) is better than that without the propagation (9.8%). In this model, combination of ranking scores is better than combination of document representations. If we set , the improvement of α 1 = 0.3, β 1 = 0.3, γ 1 = 0.4 the retrieval is 45.4%. For the Indri retrieval model, the precisions are shown in Figure 6 (d). If we consider the weighting scheme, only the combination of anchor window text can help to improve the performance, but overall, the combination of three document representations is the best way to improve the retrieval accuracy in these methods, by as much as 7.9%. Figure 7 shows the top 10 precision results of the four retrieval models for 24 queries. Two methods are chosen in each model. Specifically, when α1 = 0.4, β1 = 0.2, γ 1 = 0.4 for Cosine Simi-

#combine( #wsum( 1.0 cps109 ) #wsum( 1.0 exam ) #wsum( 1.0 bank ) )

5.3 Evaluation Methods There were 20 volunteer students that were recruited to help evaluate the relevance of the results according to their understanding and knowledge. Each student evaluated 8 queries or more, so that each query was evaluated by at least 8 students. The final relevance judgment for each page was decided by a majority vote. We evaluated our algorithm by top 10 precision. The precision of each method was computed in this following way: check how many pages are relevant to the query in the top 10 results. If there are 3 relevant pages in the top 10 list retrieved by one method, the precision of this retrieval method is 3 /10 = 0.3.

5.4 Results and Analysis In the results, precision values from the original database (i.e. “ori”, and “indri_ori”) are taken as the baselines. Figure 6 (a) shows the performance of using the Cosine Similarity model. From this figure, we can see that both performances of combination of document representations and ranking scores are better than that of original method. The performance of the combination of three document representations – original web pages, web server logs and anchor window text is better than that of the combination of two document representations alone, i.e., original web pages and anchor window text only. The former can improve the top 10 precision by 10.3%, while the latter can improve precision by 8.5%. It shows that even without considering the weight scheme in the logs, logs can provide useful evidence to improve the retrieval accuracy. Also we find that the performance of the combination of three ranking scores is better than that of the combination of two ranking scores – original web pages and web logs. By increasing the weight on original web pages and decreasing of the weight on web logs, the performance is slightly reduced. When we set

9

Okapi Retrieval Model

all_811

all_802

all_721

all_703

all_631

all_604

all_541

Method

Method

(a)

(b)

TFIDF Retrieval Model

Indri Retrieval Model 0.5

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0.45 0.35 0.3 0.25 0.2 0.15 0.1

all_721

0.05

in or i dr i_ in lo g dr i_ in anc dr i_ in 334 dr i_ in 460 dr i_ in 523 dr i_ in 550 dr i_ in 640 dr i_ in 721 dr i_ in 730 dr i_ in 802 dr i_ in 811 dr i_ 82 in 0 dr i_ al l

0

in d

ri _

all_703

all_631

all_604

all_541

all_505

all_451

all_406

ori

all_334

log_3

anc +ori+log_3

anc +ori+log_0

anc

To p 10 Precision

0.4

anc +ori

Precisio n

all_505

ori

all_406

anc+ori+log_3

anc+ori

anc+ori+log_0

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

all_811

all_802

all_721

all_703

all_631

all_604

all_532

all_505

all_424

ori

all_406

anc +ori+log_3

anc +ori

Precision

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 anc +ori+log_0

Precisio n

Cosine Sim ilarity Retrieval Model

Method

M ethod

(c)

(d)

Figure 6: Precisions of different retrieval models and methods.

larity model, the precisions of most queries are increased. It is more obvious in the TFIDF model when we set α 1 = 0.3, β 1 = 0.3, γ 1 = 0.4 . Figure 7 illustrates that more than 2/3 queries get higher precisions using the combination method than those not using any combination method.

the anchor window text and original full text to build other document representations. We investigated combination algorithms to find the best combination methods for different retrieval models. The key of WLA is to propagate the terms along the access path within each session of the web logs. It involves two steps: extract terms to represent the document and calculate term weights at the same time. When we calculate term weights, we take into account the similarity between pages, which reduces the risk of introducing noise to each page, and the term recency,

6 Conclusion In this paper, we proposed a novel algorithm the WLA, which makes use of web server logs to improve the performance of web site search. In order to cover the whole web site, we also used

10

Okapi Retrieval Model

Cosine Similarity Retrieval Model 1.2

0.8

1

0.6

0.8

Precision

P re c is io n

1

0.4 0.2

0.6 0.4 0.2

0

0

1 3 5 7 9 11 13 15 17 19 21 23

1

4

7

10 13 16 19 22

Query ori

Query ori

all_424

anc+ori+log_3

(a)

(b) Indri Retrieval Model

TFIDF Retrieval Model 0.8

1.2

0.7

1

0.5

Precision

Precision

0.6 0.4 0.3 0.2

0.8 0.6 0.4 0.2

0.1

0

0 1

3

5

7

1

9 11 13 15 17 19 21 23

3

5

7

9 11 13 15 17 19 21 23

Query ori

Query indri_ori

all_334

(c)

indri_all

(d)

Figure 7: Precisions of different retrieval models and methods based on queries. which makes more recently accessed pages more important than those accessed earlier. In our research, we investigated two kinds of combinations: combination of document representations and combination of ranking scores by testing different parameter settings. We also tested different retrieval models to verify our algorithms. From our experiments, we find that two types of combinations based on WLA could improve the retrieval performance of the web site search for Cosine Similarity, Okapi, TFIDF and Indri retrieval models. In particular, in terms of top 10 precision, for the Cosine Similarity model, a

22.9% improvement can be attained.. For the Okapi model, the improvement is 10.6%, for TFIDF model, the improvement is 45.4% and for the Indri model, the improvement reaches 7.9%. In addition, we find that for the Okapi and Indri models, the performance of combination of document representations are better than those of combination of ranking scores, while for the Cosine Similarity and TFIDF models, the performance of combination of ranking scores are better than those of combination of document representations. There are several directions in our future

11

[5] MSN search, http://search.msn.com/.

work. Firstly, when we do the propagation, we could determine the propagation depth in each session by setting one particular threshold value for the similarity between two pages. Secondly, in order to include more document representations, we could build representation of each web page by considering structure information such as title, heading, or image alternate text. Thirdly, we would like to test the performance of WLA and combination methods on a different web site, e.g. a commercial web site, or a larger-scaled web site than the one we used.

[6] Yahoo, http://www.yahoo.com. [7] J.A. Aslam and M. Montague, Models for Metasearch, In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001. [8] R. Baeza-Yates, and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999. [9] B. Billerbeck, F. Scholer, H.E. Williams, and J. Zobel, Query expansion using associated queries, In Proceedings of the 12th International Conference on Information and Knowledge Management, 2003.

Acknowledgements We thank Romulo Velasquez and Vincent Wu for their programming work, and Dr. Eric Harley for helping us in the experiment setup. This work was sponsored by Natural Science & Engineering Research Council of Canada (grant 299021-04).

[10] B. Billerbeck and J. Zobel, Document Expansion versus Query Expansion for Ad-hoc Retrieval, In Proceedings of the 10th Australasian Document Computing Symposium, 2005.

About the Author

[11] B. T. Bartell, G. W. Cottrell and R. K. Belew, Automatic combination of multiple ranked retrieval systems, In Proceedings of the 17th annual international ACM SIGIR conference on Research and Development in Information Retrieval, 1994.

Jin Zhou earned his Master degree in the Department of Electrical and Computer Engineering at Ryerson University when this work was conducted. His research was focused on web site search. Currently, he is working at IBM.

[12] S. Brin and L. Page, The anatomy of a Large Scale Hypertextual Web Search Engine, In Proceedings of the 7th International Conference on WWW, 1998.

Dr. Chen Ding is an assistant professor in the Department of Computer Science at Ryerson University. She received her PhD degree from National University of Singapore. Her main research area is on web information retrieval.

[13] K. Collins-Thompson, P. Ogilvie, Y. Zhang, and J. Callan, Information filtering, novelty detection, and named-page finding, In Proceedings of the 11th Text REtrieval Conference (TREC-11), 2002.

Dr. Dimitrios Androutsos is an associate professor in the Department of Electrical and Computer Engineering at Ryerson University. He received his PhD degree from the University of Toronto. His main research areas are in image processing, image retrieval and multimedia processing.

[14] N. Craswell, D. Hawking, and S. Robertson, Effective site finding using link anchor information, In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.

References [1] Indri Retrieval Model,

[15] Q. Cui, and A. Dekhtyar, On Improving Local website Search Using web Server Traffic Logs: A Preliminary Report, In Proceeding of ACM-WIDM, 2005.

http://ciir.cs.umass.edu/~metzler/indriretmodel.ht ml.

[2] Google, http://google.com. [3] DirectHit, http://www.directhit.com.

[16] R. Cooley, B. Mobasher, and J. Srivastava, Data preparation for mining World Wide

[4] Lemur Project, http://www.lemurproject.org/.

12

Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.

Web browsing patterns, In Knowledge and Information Systems, 1(1):5-32, 1999. [17] S. Cronen-Townsend, Y. Zhou, and W.B. Croft, A Framework for Selective Query Expansion, In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, 2004.

[27] P. Ogilvie and J. Callan, Combining Document Representations for Known-Item Search, In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.

[18] N. Eiron and K. S. McCurley, Analysis of Anchor Text for Web Search, In Proceedings of the 26th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.

[28] T. Qin, T. Liu, X. Zhang, Z. Chen, and W. Ma, A Study of Relevance Propagation for Web Search, In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005.

[19] R. Fagin, R. Kumar, K. S. McCurley, J. Novak, D. Sivakumar, J. A. Tomlin, and D. P. Williamson, Searching the workplace web, In Proceedings of the 12th World Wide Web Conference, 2003.

[29] A. Shakery, C. X. Zhai, Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Frack Experiments, In the Proceedings of the 12th TREC, 2003.

[20] P. Hagen, H. Manning, and Y. Paul, Must search stink? The Forrester report, Forrester, 2000.

[30] F. Scholer, H.E. Williams, and A. Turpin, Query Association Surrogates for web Search, Journal of the American Society for Information Science and Technology, pages: 637 – 650, 2004.

[21] J. Kleinberg, Authoritative Sources in Hyperlinked Environment, In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithm, 1998.

[31] R. Song, J. R. Wen, S. M. Shi, G. M. Xin, T. Y. Liu, T. Qin, X. Zheng, J. Y. Zhang, G. R. Xue, and W. Y. Ma, Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004, In the Proceedings of the 13th TREC, 2004.

[22] K.L. Kwok, TREC2004 Robust Track Experiments using PIRCS, In Proceedings of the 13th Text REtrieval Conference, 2004. [23] W. Kraaij, T. Westerveld, and D. Hiemstra, The importance of prior probabilities for entry page search, In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002.

[32] C. C. Vogt and G. W. Cottrell, Predicting the performance of linearly combined IR systems, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.

[24] J. H. Lee, Analysis of multiple evidence combination, In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997.

[33] G. Xue, H. Zeng, Z. Chen, W. Ma, and C. Lu, Log Mining to Improve the Performance of Site Search, In Proceedings of the Third International Conference on Web Information Systems Engineering, 2002.

[25] S. Liu, F. Liu, C. Yu, and W. Meng, An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases, In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004.

[34] G. Xue, H. Zeng, Z. Chen, W. Ma, H. Zhang, and C. Lu, Implicit Link Analysis for Small Web Search, In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.

[26] S.H. Myaeng, D.H. Jang, M.S. Kim, and Z.C. Zhoo, A flexible model for retrieval of SGML documents, In Proceedings of the 21st

13