Mining search engine query logs for social filtering-based query ...

3 downloads 144844 Views 199KB Size Report
Mar 27, 2008 - Providing related queries for search engine users can help them find the .... specifc website (also known as search engine optimization) than.
Author's personal copy

Applied Soft Computing 8 (2008) 1326–1334

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Mining search engine query logs for social filtering-based query recommendation§ Zhiyong Zhang, Olfa Nasraoui * Knowledge Discovery and Web Mining Lab, Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, USA

A R T I C L E I N F O

A B S T R A C T

Article history: Received 8 October 2007 Accepted 8 November 2007 Available online 27 March 2008

This paper presents a simple and intuitive method for mining search engine query logs for fast social filtering, where searchers are provided with dynamic query recommendations on a large-scale industrialstrength search engine. We adopt a dynamic approach that is able to absorb new and recent trends in web usage trends on search engines, while forgetting outdated trends, thus adapting to dynamic changes in web user’s interests. In order to get well-rounded recommendations, we combine two methods: first, we model search engine users’ sequential search behavior, and interpret this consecutive search behavior as client-side query refinement, that should form the basis for the search engine’s own query refinement process. This query refinement process is exploited to learn useful information that helps generate related queries. Second, we combine this method with a traditional text or content based similarity method to compensate for the shortness of query sessions and sparsity of real query log data. ß 2008 Elsevier B.V. All rights reserved.

Keywords: Query log Social filtering Web mining Recommendation

1. Introduction Providing related queries for search engine users can help them find the desired content more quickly. For this reason, many search engines started displaying related search keywords at the bottom of the result page. Examples include Yahoo’s ‘‘Also try’’ feature , Ask Jeeves’s ‘‘Narrow Your Search’’, ‘‘Expand Your Search’’, and ‘‘Related Names’’ features, and Amazon’s ‘‘Related Searches’’, etc. The main purpose is to give search engine users a comprehensive recommendation when they search specific topics. Our statistics1 show that the hit rate of the related search keywords is quite high (more than 10%). Recommending users with the most relevant search keywords not only enhances search engine’s hit rate, but also helps users to find the desired information more quickly. Behind these query recommendation interfaces, query log study plays an important role. That is, we can get query recommendations by mining search engine query logs, which contain abundant information on past queries. In this paper, we present a simple and yet well-rounded solution that has already been implemented with success as part of an industrial-strength search engine on one of the largest portals (www.Sina.com) in

§ Part of this work was done while Z. Zhang was with Search Engine R&D Group, SINA Corporation, Beijing, China. * Corresponding author. E-mail addresses: [email protected] (Z. Zhang), [email protected] (O. Nasraoui). 1 http://www.iask.com.

1568-4946/$ – see front matter ß 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2007.11.004

China. We further combine the association or correlation-type information with the textual content between queries in order to (i) improve coverage and (ii) compensate for the sparsity of the query terms. Our association-type information between queries are computed in a simple and efficient incremental manner from query logs. A soft relation matrix is built to store the relation between consecutive queries that occur within the same session (i.e. submitted by a user in one visit). Queries that are submitted immediately one after the other receive a maximal relation value, while this relation value is dampened for queries that are farther apart from each other during the same search session. Moreover, the relation values obtained from the past sessions are accumulated incrementally with each new session to obtain a global and scalable model. As will be discussed in Section 2, much of the related work relies on the click-through data for query log clustering or generating query recommendations. But no work has considered one user’s consecutive query behavior alone and without click-through data. In fact, one user’s consecutive query behavior alone could provide abundant information which can be exploited for generating query recommendations. Based on common sense, two consecutive queries generated by the same user may generally reflect this user’s own query refinement or learning process. Also, using the query log without click-through data could discover query groups that cannot be identified by using the click-through data. And there is no work yet that combines the query content similarity with the user query-behavior-related similarity together to generate wellrounded query recommendations.

Author's personal copy

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

In this paper: (i) we present a method of utilizing a user’s consecutive query behavior combined with content-based similarity measurement for generating related queries and (ii) for the content-based similarity analysis, since we do not record the clicked documents, we cannot get the term frequency of the query term for assigning the weight factor when using the TF  IDF method. Hence, we use the Search Frequency, SF (defined as the frequency of occurrence of a term in all queries in the query log) instead. SF offers a better alternative to term frequency (TF) since it reflects the search engine users’ cumulative behavior (or popularity). However, for the inverse document frequency (IDF), we can rely on the entire indexed document collection, which is available for use. Then we use what we termed an SF  IDF method for weighting each term in calculating the content-based similarity. To evaluate our method, we used three months worth of query logs from SINA’s search engine to do off-line query mining. Based on the evaluations of independent editors on a query test set, our method was found to be effective for finding related queries, despite its simplicity. In addition to the subjective editors’ rating, we also performed tests based on actual anonymous user search sessions. We organize the paper as follows: in Section 2, we review related work, then in Section 3, we introduce our method for detecting related queries. In Section 4, we discuss our implementation details, and in Section 5, we show our experimental results. Finally, in Section 6, we make our conclusions. 2. Related work The study of query logs is not a new topic. In Refs. [10,20], the query logs of Excite and Altavista were studied and analyzed, respectively, and some search patterns were reported. The analysis of the user sessions and query correlations encouraged further exploitation of these statistics. Recently, in Ref. [3], through analyzing the query logs of a web-site’s internal search engine, the authors give us some characteristics of a government’s web-site search engine that differs from general-purpose search engines. While there are some differences in user search patterns between a general purpose search engine and a web-site’s internal search engine, the correlation between user queries in user sessions may remain the same, which is not addressed in the latter. Also in Ref. [14], Lau and Horvitz try to model the users’ search behavior with a set of predefined classifications. By assigning query refinement classes and informational goals to each user’s consecutive search behaviors, and by studying users’ inter-query interval and adjacent refinement actions, they try to construct probabilistic user models for representing the potential relationships between them. Similarly in Ref. [18], Rieh and Xie walk along the same investigative line further by categorizing the users’s query reformulation behavior in content, format, and resources. In Ref. [15], the authors try to generate query templates from query logs, which include query keywords and clicked articles. Although there are some specific differences in the methods they use, they all suppose that there are some predefined classes that the consecutive query reformulation behaviors should belong to. When there is a users’ behavior that they cannot classify to one specific group, they either resort to some general or obscure classes, or simply mark them as undefined classes. In contrast to the above classification techniques, another group of query log clustering techniques emerged. Among them, Beeferman and Berger [2] use a ‘‘content-ignorant’’ approach and a graph-based iterative clustering method to cluster both the URLs and queries. In his method, the users’ query and click-through data are recorded and utilized. The Lycos’s 500,000 click-through records are used for their experiments. Later in Refs. [22,23], Wen et al. proposed a hybrid solution for query log clustering by

1327

combining content-based clustering techniques and cross-reference-based clustering techniques. One-month’s user logs from the Encarta web site were used for their evaluations. Also in their technique, the user’s click-through data play an important role. In these works, the clicked documents were not only used for query content similarity calculation (term frequency), but also formed the basis for the cross-reference similarity calculation. Putting emphasis on the users’ clicking behavior after searching one query is reasonable since such behavior reflects the users’ relevance judgements. These unsupervised clustering techniques have some merits over the predefined classification techniques especially for dealing with the fast-changing appetites of search engine users’ behavior. However, heavily relying on click-through data may have some drawbacks. First, as mentioned in Ref. [11], the relevance feedback collected through clicked documents constitute relative relevance feedback, not absolute relevance feedback. The users would more likely click the top-ranked links presented by the search engine. Hence, these methods may be biased by the search engine’s original ranking of related links. Second, this approach does not fully consider the real motive behind a specific user’s query reformulation or focus shifts. When a user retypes another query that is different from the original query, she may either consider the original query to be not accurate enough, or may simply want to shift her focus from one topic to another related topic. In the latter case, the user may never click on the same documents as those clicked during the search with the original query, but the two queries may have good reason to be related with each other. In such a case, the click-through data method would not detect such correlations. Another very similar topic would be query expansion or query refinement. In Refs. [5,6], Cui et al. developed a query expansion technique based on mining search engine query logs. First they confirmed the mismatch between the query space and the document space. Then they developed the query expansion method based largely on the user’s click-through data. They share a similar intuitive motivation with Wen et al.’s work [22,23]. In their conception, a session contains one query and a set of documents that the user clicked on. Again they based their experiment on the Encarta Web site query logs. Another study by Ohura et al. [17] used an enhanced K-means method for clustering the query logs of the internet yellow page services. In Refs. [8,13], support vector machines (SVM) classification and anchor text mining are used, respectively, for query modification and refinement. One of the benefits for mining the search engine query log and clustering this log would be to provide query recommendations. In Ref. [1], Baeza-Yates et al. present methods to get query recommendation by utilizing the click-through data, in which their definition of user session is similar to Cui and Wen’s; while in Ref. [7], Davison et al. discover queries that are related to the queries typically leading to a specific website being highly ranked. However, this study is entirely Website-focused and not userfocused. That is, it is more focused on improving the ranking of a specifc website (also known as search engine optimization) than on improving the search experience for the user. Another novel approach is in Ref. [4], in which Chien studied the semantic similarity between search engine queries using temporal correlation. The central idea they use is to infer that two queries are semantically related if they are temporally correlated. For this purpose, they define a series of time units and the frequency of a query q over a particular time unit i as the ratio of the number of occurrences of q in i to the total number of queries in i. Then their measure of the temporal correlation between two queries p and q over a span of many time units is defined as the standard correlation coefficient of the frequencies of p and q. Their method may work well for identifying semantically related queries that are

Author's personal copy

1328

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

Table 1 Query log study and applications Study by

Methods

Application

Jansen et al. [10]; Silverstein et al. [20] Beeferman and Berger [2] Wen and Cui [22] Baeza-Yates et al. [1] Fonseca et al. [9] Our approach

Query log analysis Content-ignorant; use click-through data; agglomerative clustering Content plus cross-reference; use click-through data Use click-through data Use association rules Content plus consecutive query behavior; do not use click-through data

Give search pattern Query log clustering Query log clustering; query expansion Query recommendation Related queries Query recommendation

‘‘event driven’’—those queries whose frequencies vary greatly near a specific event, but may not work so well as for discovering nonevent driven queries. In Ref. [9], Fonseca et al. use association rules for generating related queries. The idea of mining association rules to discover search engine related queries is interesting, and in a way is the most similar to our approach, though our approach is much simpler and takes into account the order of the queries in a session. Within the soft computing community several works have addressed the problem of query recommendation. Chen and coworkers [12] used fuzzy rules to adapt user queries in Information Retrieval. In this work, fuzzy rules were extracted from the fuzzy clusters, describing a group of textual documents of interest to a user, and discovered by the Fuzzy C-Means clustering algorithm [19]. Since these rules can be used to characterize the semantic connections between keywords in a set of documents, they can be used to improve the user queries for better retrieval performance. The work presented in this paper differs from this approach since no clustering is used, and the rules are not based on the text documents. They are instead only based on the query stream itself and the notion of query sessions. This makes our approach fast and able to adapt continuously and dynamically to changing query inputs. Tajima and Takagi [21] proposed a search engine which conceptually matches input keywords and text data. The conceptual matching is realized by context-dependent keyword expansion using conceptual fuzzy sets, which are realized using Hopfield Networks. The proposed technique is more inline with a search engine which can execute conceptual matching dealing with context-dependent word ambiguity, and the conceptual fuzzy sets are extracted from external sources that may have to be manually entered. In contrast, the method proposed in this paper tends to automatically learn fuzzy associative memories at the ‘‘query’’ level and does so directly from the query log. Miyata et al. [16] proposed a query expansion method based on fuzzy abductive inference. However this system relies a rigid representation of domain knowledge that is similar to WordNET for all the assumed concepts. In the work proposed in this paper, we make no such assumption about any pre-assumed concepts or domain knowledge. Instead all of the fuzzy associative memory correlations are learned directly from the data (the query log), which means that the proposed approach can adapt to any search engine regardless, for example, of language. For a summary of the comparison between different approaches for query log mining and query recommendation, see Table 1. 3. Mining related queries from query sessions In our methods, we used SINA’s search engine query logs. SINA is one of largest Internet Portals in China which provides Web search directory services and other internet portal services. We used the query logs generated from its web searching services (about 30 GB query logs in total for our evaluation). Each request contains a CookieID identifying the user, an IP address of the client machine, a time stamp, and the query. In case we cannot get the cookieID, we use the user’s IP address combined with a timeout for identifying one user. In our query session definition, in order to get

as much information from a single user as we can, we define the query session as a sequence of queries input by one specific search engine user as long as the time interval between two consecutive queries is below a threshold t. QuerySession ¼ (query1, query2, query3, . . .), where each query is a set of keywords. After extracting query sessions, we make an assumption that all the queries in the same query session bear some conceptual similarities. This is the same assumption made by methods that use association rules [9]. However, association rules cannot distinguish between the strength of association of two queries that have been submitted immediately one after the other in the same session, and two queries that may be separated by more other queries in between. Below, we will illustrate how we can calculate the recommendation score from this association. 3.1. Calculating consecutive-relation scores from consecutive queries Suppose that qk and q j are the kth and jth queries (such that j > k) in the same (mth) session, the consecutive-relation score between q j and qk from this session is given by the elementary relation: Rm ðqk ; q j Þ ¼ Rm k; j ¼ d

jk

(1)

where d 2 ð0; 1Þ is a constant damping factor to account for the temporal distance between the queries.2 If we add up all the sessions that contain qk and q j , then we can obtain the cumulative consecutive-relation between q j and qk using the following sum:

mk ðq j Þ ¼ mðqk ; q j Þ ¼

s X

Rm k; j

(2)

m¼1

where s is the number of sessions that contain both qk and q j . We show an example in Fig. 1, where we have four consecutive queries (after pruning identical queries) in one query session, the vertices in the graph represent the queries and the arcs between them represent their elementary consecutive relationship. For two consecutive queries in the same search session, their relation score is set to the damping factor d, while for queries that are not immediate neighbors in the same session, their relation value is calculated by multiplying the values of the arcs that join them. Hence, in Fig. 1 (left), the consecutive-relation score between 2 query1 and query3 would be Rð1; 3Þ ¼ d  d ¼ d . The longer the distance, the smaller the relation. After getting the relation values of each query pair in each session, we add them up to get the cumulative relation between any two queries. This process is also shown in Fig. 1 (right). For example, in our experiment, using our method, when we input ‘‘River Flows Like Blood’’, a novel written by Chinese writer ‘‘Haiyan’’, we obtained not only the writer himself as the topranked related queries, but also his other popular novels in the topranked recommendations. The relation calculation is based on the following searching intuition. Suppose that a user inputs a query A to get the desired 2

We set d ¼ 1=2 in our experiments.

Author's personal copy

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

Fig. 2. Combining consecutive query ranking and content-based similarity to get each recommendation.

Fig. 1. Relation graph for one session (left) and two sessions (right).

information about a specific topic. After looking at the results, the user may find that query A was not the exact query for this search, and then may try query B, which may either be synonymous to A or could more accurately describe the information that the user was seeking. Thus, query A and query B can be considered to be related based on the search behavior. However, this relation is not strong if based on a single user. Rather, it is the cumulative occurrence of the pair over many search sessions that will indicate relatedness. Also, the related search queries can change with the users’ needs over time. For these reasons, we should generate related search keywords dynamically.

therefore complement each other. When using the first (consecutive-relation-based) method, we can get recommendations that reflect the users’ consecutive search behavior. While when using the second (content based) method, we can group together queries that have similar textual composition. We combine the two groups of related queries in the following way. For one specific query, we will get two groups of related queries, which are sorted according to their cumulative relation levels. We then merge the consecutive-based recommendations with the content-based similarity together to calculate the recommendation score (Fig. 2): Recðq j Þ ¼ amknorm ðq j Þ þ b cosineðqk ; q j Þ

3.2. Content-based similarity calculation For the content-based relation, we use the cosine similarity between two queries p (with m terms) and q (with n terms) as follows: P 2 tk 2 p \ q wk ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiqP cosineð p; qÞ ¼ qP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 ti 2 p wi tj 2q wj

(3)

where wk is the weight of the kth common term in the query p, and q and wi and w j are the weights of the ith term in query p and jth term in query q, respectively. For weighting the query terms, instead of using TF  IDF (term frequency multiplied by inverse document frequency), we use SF  IDF, which is the SF multiplied by the IDF, where SFi ¼number of occurrences of Termi over all queries in the query log

(4)

and N IDFi ¼ log 2 n

1329

mknorm ðq j Þ ¼

mk ðq j Þ mmax

(6) (7)

where a and b are the coefficients or weights assigned to consecutive-query-based relations mk ðq j Þ and content-based relations cosine ðqk ; q j Þ, respectively. Note that the consecutivebased and content-based relations are both normalized in ½0; 1. In our experiments, we give a and b the same value, which was determined by trial-and-error (after testing several combinations during preliminary tests and choosing the best results). Fig. 3 illustrates the entire process. 4. Implementation In this section, we will discuss our implementation details, which include the six-step process to get recommendation scores, the online recommendation policy, and the dynamic adaptation and update methods. 4.1. Six steps for generating related queries

(5)

where N is the number of documents indexed in searchable collection, and n is the number of Documents where term T i occurs at least once. The reason why we use SF instead of TF is that the sparsity of the query terms in a query makes the TF of all terms in that query equal to 1. On the other hand, using SF relies on a dynamic stream of search queries, thus measuring popularity, while IDF is acquired from the whole document set, thus capturing only language properties.

In our implementation, shown in Fig. 4, we follow the steps discussed below. 4.1.1. Step 1: extracting query sessions We isolate through cookieId the user sessions. According to our statistics, most people will accept cookies. So from almost every

3.3. Combination of the consecutive query relation and content-based relation Using the above two methods, we can get two different types of query sets. The two methods have their own advantages, and can

Fig. 3. Process diagram showing the combination of query recommendation with input query qk .

Author's personal copy

1330

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

normal visits could be falsely detected as abnormal visits. But according to our editors’ sample inspection and comparison, the incorrectly detected visits were very rare, while some undetected attacking visits remain. However using the above three heuristics, the abnormal visits were removed from our database to a large extent. Also, there would be many other rules that can be designed for pruning abnormal visits.

Fig. 4. Steps for generating related queries.

query log entry, we could get the cookieID. Also in this step , we remove queries which are too long, blank or illegal. By illegal, we mean some queries that cannot be decoded into normal Big5, GB2312 Chinese or English queries. 4.1.2. Step 2: filtering robot and abnormal visits Most commercial search engine servers receive many daily abnormal and robot/crawler visits. We need to filter these queries in the data preparation phase in order to get a cleaner data sample for further analysis. To recognize such abnormal visits, we rely on the following three heuristics:

(i) One user with too many sessions. If one user searches too many times in one day, their visits may not be considered normal since a normal user would not send too many queries to a search engine. After sampling from our user track log, we found that if the number of queries exceeds 30 for the same user, then these visits may be deemed abnormal. After sampling from these filtered results and analysis by our editors, all the sampled query sessions that exceed this threshold were found to be indeed robot visits. (ii) One user with too many repeated queries. If a user submits too many identical queries, we may consider his/her visit as abnormal, since a normal user would not squander too much time for such meaningless repetitions. In our implementation, we calculate the total number of queries for a user and then calculate the total number of distinct queries for that user. If the total queries are more than two times the total distinct queries, we identify the user’s visit as abnormal and prune such sessions. (iii) One user with too many close visits. By close visits, we mean that there is a very short time interval between two consecutive visits or sessions. If a user’s close visits are too numerous in one day, we also consider that user’s visits as abnormal. Generally, if one user inputs a valid search keyword, they may need some time to view the result pages. On the other hand, for most machine-generated visits, the interval between two consecutive visits may be too short. Notice that there are also many other abnormal visits that cannot be detected by solely using the above heuristics. Also some

4.1.3. Step 3: obtaining the consecutive queries and frequency table After filtering abnormal visits, we begin to get what we defined as consecutive visits (visits that are not separated by any intermediate visits). For consecutive visits, we define a valid consecutive interval between t 1 and t 2 . If the time between two visits falls outside this valid consecutive interval, that is, if the interval is too short or too long, we will no longer consider them as consecutive visits. We set the t 1 lower limit here again to avoid some abnormal visits, since a normal visit may require some time at least to explore the returned pages. A series of consecutive visits by the same user within the valid time interval form one user session. Notice that one specific user may have several sessions in a day, each one possibly targeting a different topic. Since these query sessions are temporally divided, we do not consider them to be in the same session although they may be generated from the same user. Moreover, in order to avoid having too many consecutive visits in one session, and for processing simplicity, we set an upper limit U for storing the number of queries in one session. At the same time, we use our cleaned query sessions to record the search frequency (SFi) of each distinct query (qi ) and sort these queries in descending order according to SFi. This frequency table is later used for generating content-based related query recommendations as in Ref. [1]. 4.1.4. Step 4: obtaining the consecutive-based related query set (ConsecutiveSet) Indexing related queries. After logging the consecutive queries in the users’ search sessions, we begin to extract related queries offline. For each related query pair, we calculate the consecutiverelation mk ðq j Þ, and store it in a related query table that maps each pair of queries qk and q j to their membership mk ðq j Þ. This calculation is performed in an incremental manner as new sessions are logged, by accumulating the membership mk ðq j Þ whenever the related queries occur in a new session, and storing them in a database, indexed by the key values k and j. To capture the symmetry of the related queries, the entries for ðk; jÞ are identical to ð j; kÞ. Filtering top queries. In practice, we have also found that exploiting the consecutive and content-based relations between queries is not sufficient because of the exceptional spamming effect of extremely popular queries. For example, the query ‘‘Film download’’ may be the top (most frequent) search query regardless of the topic being searched, and it may spam the query relation memberships accumulated. If too many people submit query B after searching with query A, we may either consider query B as related to A, or consider query B as one of the very popular queries (from now on referred to as top queries). Therefore, before extracting the related queries, we prune all the top queries for the given day to reduce their spamming effect. For this purpose, we get everyday’s top N (20 in our experiment) queries, and remove their related pairs from the cumulative memberships. 4.1.5. Step 5: obtaining the content-based related query set (ContentSet) For content-based similarity calculation, if we compute the similarity for every two pairs (without repetition) in the frequency table, the complexity will be Oðn2 Þ, which entails a high

Author's personal copy

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

computational cost. In order to reduce this cost, we made the following simplification and adjustment. We set a visit frequency threshold V for processing a specific day. After we collect the visit frequency of all queries in a day’s session, we only calculate the similarity of the queries whose visit frequency is higher than V. We use SF  IDF for weighting each term in the query. By definition, IDF is used to differentiate the importance of different search terms in the searchable document collection. The IDF value of the more common terms which occur in many documents would be lower than the terms that occur in only a few documents. Thus using the IDF value, we can tell which terms are more important in the query. For example, consider the query ‘‘Beijing Real-Estate’’, parsed into two terms ‘‘Beijing’’ and ‘‘Real-Estate’’. Without the IDF value, one may get too many related queries such as ‘‘Beijing University’’, ‘‘Beijing Map’’, ‘‘Beijing Transportation’’, etc., ranked higher since these are frequently submitted queries. These queries may not be closely related to the original query ‘‘Beijing Real Estate’’. By using the IDF value to differentiate the two terms, we emphasize the related queries that contain the term that is highest in IDF value, in this case the term ‘‘Real Estate’’, such as the queries ‘‘Shanghai Real Estate’’, ‘‘Nanjing Real Estate’’, etc. Therefore, the IDF value plays a more important role in ranking the content-based similarities than SF and helps in getting related queries that are unbiased by the top searched queries. 4.1.6. Step 6: merging the ConsecutiveSet and ContentSet In steps (4) and (5), we get two different sets, which are named ConsecutiveSet and ContentSet. To get the final related query set, we need to merge them by using the formula developed in Section 3.3. Through the above steps, we can get the daily related query table. Because of the sparsity of typical query logs, we must accumulate enough query logs to obtain the related queries. For one user’s search pattern in 1 day, it may seem casual or random that after they search with the query ‘‘Zidane’’, they then type the query ‘‘ World Soccer Cup’’. But if within weeks or months time, many people submit the query ‘‘Zidane’’ after searching the query ‘‘World Soccer Cup’’, the query pair ‘‘Zidane’’ and ‘‘World Soccer Cup’’ must have some correlations. Certainly the larger the data set, the lower the bias or the random error that will occur. 4.2. Online recommendations The above process is conducted offline. After getting the final related query table, we perform further optimization, which includes pruning pornography or objectionable queries, to generate the applicable data set. Note that we do this pruning in order to avoid giving unintended offensive recommendations to regular users and not as a censoring measure. We finally recommend the top K related queries to the user. The real-time recommendation part is implemented online. The online recommendations are rather efficient, since we use a hash table to retrieve the related queries and then present them to the user. 4.3. Adapting to evolving related queries In order to adapt to new queries, while at the same time gradually forgetting old queries, and in the process also avoiding a potential overflow, we use a limited active sliding window, for storing and processing the query logs, that spans an active period of 100 days, where we keep both the mkoverall ðq j Þ that reflects the cumulative consecutive-relation score from query logs during the most recent active period, as well as mkoneday ðq j Þ that reflects the sub-cumulative query logs received in each day during the past period. Every time that a related query pair falls outside of the active window period, for instance, more than 100 days ago, we deduct its single-day

1331

contribution to the membership value from the total accumulated value. At the same time, the contributions of the most recent/current queries keep getting accumulated. In this way, the cumulative version of mk ðq j Þ will always reflect the related queries according to the most recent period, therefore evolving to the dynamic nature of user searches. This process is shown in Fig. 5. Subtracting the outdated data’s influence can fulfill the need to adapt to the newest trends, while forgetting the outdated trends. As shown in Fig. 5, we collect and store 100 days’ logs and processing results. We merge these 100 days’ consecutive-relation into one final result table. From then on, the program will update the final results automatically by folding in only the newest day’s sub-cumulative relation values and subtracting the oldest day’s relation values, as follows (using the notation mðk; jÞ p ¼ mk ð jÞ during period p):

mðk; jÞnew ¼ mðk; jÞoriginal þ mðk; jÞnewest day  mðk; jÞoldest day

(8)

where mðk; jÞnew is the new total consecutive-relation between q j and qk after the update, while mðk; jÞoriginal is the consecutiverelation between q j and qk before the update. mðk; jÞnewest day and mðk; jÞoldest day are the consecutive-relation between q j and qk for the newest day and the oldest day, respectively. This process can be illustrated in the following simple example:  Cumulative result before updating: Original query (qk ) q1 q1 q1 q1

Related query (q j )

Consecutive-relation mðk; jÞoriginal

q2 q3 q4 q5

10,000 8,000 6,000 512

 Sub-cumulative result of D1 (newest day: to be added): Original query (qk ) q1 q1

Related query (q j )

Consecutive-relation mðk; jÞnewest day

q2 q3

512 256

 Sub-cumulative result of D100 (oldest day: to be deleted): Original query (qk ) q1 q1

Related query (q j )

Consecutive-relation mðk; jÞoldest day

q5 q3

512 128

 Cumulative result after updating: Original query (qk ) q1 q1 q1 q1

Related query (q j )

Consecutive-relation mðk; jÞnew

q2 q3 q4 q5

10,512 8,128 6,000 0

mðq1 ; q2 Þnew ¼ mðq1 ; q2 Þoriginal þ mðq In this example, 1; q2 Þnewest day  mðq1 ; q2 Þoldest day ¼ 1000 þ 512  0 ¼ 10; 512. mðq1 ; q3 Þnew ¼ mðq1 ; q3 Þoriginal þ mðq1 ; q3 Þnewest day  mðq1 ; q3 Þ oldest day ¼ 800 þ 256  128 ¼ 8128, etc. This updating process essentially uses a sliding window mechanism, which has a size of 100 days, and with 1-day sliding increments. Each day when new data is available, it will be merged into the final result in an incremental way. Subtracting the outdated data’s influence can gradually adapt to the newest search trends, while forgetting the outdated search trends.

Author's personal copy

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

1332

Table 3 Anonymous session coverage result Session length

Number of sessions

n1

n2

Average coverage (%)

2-Query 3-Query 4-Query 6-Query

147 37 14 2

35 23 5 2

29 19 9 0

21.8 28.4 16.7 10

session session session session

Total average coverage (200 sessions/473 queries)

22.3

have a new user session which was not processed during our training phase and that we have the following queries: NewSession ¼ query1; query2; query3; query4; query5; query6:

Fig. 5. Dynamic update mechanism of query mining results.

5. Experimental results Table 2 lists a sample of our related search query examples, where we have translated the Chinese queries into English. For more examples, please visit iask.com, where the query recommendations are displayed in the bottom of the main frame. Below, we discuss how we evaluated the query recommendation results. We use the traditional precision and coverage method in information retrieval. Moreover, we used two evaluation methods to obtain both quantitative and qualitative validation. 5.1. Anonymous session coverage evaluation method and results In Information Retrieval, coverage is defined as the ratio of the number of relevant records retrieved to the total number of relevant records in the collection. It is usually expressed as a percentage: coverage ð%Þ ¼

A  100 AþB

(9)

where A is the number of relevant records that are retrieved and B is the number of relevant records that are not retrieved. In our query recommendation case, coverage is defined as the percentage of correct recommendations relative to the true query set. A should be the number of correct recommendations, B is the number of relevant queries not recommended. For example, suppose that we

After the user inputs query1, if all the remaining queries are given by our recommendation system as related queries, we consider the coverage to be 100%. And if we only recommend query2 and query3, then we consider that the coverage is only 40%, etc. For this evaluation, we randomly chose 200 testing sessions (forming a total of 473 queries), each of which containing two queries or more. See Table 5 for some real session samples. These sessions were extracted from 2 days’ query logs from April 2005, so they have no overlap with the query logs used in the training set. The latter contained the query logs for September, October, and November in the year 2004. We did our test based on a maximum of K ¼ 10 recommendations per query. We then counted the number of recommendations that matched the remaining queries of one session. Table 3 shows the testing results. In Table 3, n1 is the number of remaining queries in the session that fall into our recommendation set for the first query in the same session; while n2 is the number of remaining queries (including the first query) in the session that fall into our recommendation set for the second query in the same session. From Table 3, we can see that the coverage is not very high (but this is typical of all query recommendation systems). If we increase the maximum number of the recommendations, to say 50 top ranked recommendations, this coverage would increase. However, this would not be practical. In just the same way that a typical search engine user would only have the patience to view the first one or two pages of the search results, he/she would only have enough patience to scan a few top ranked query recommendations. These observations cause us to consider the following limitations that the evaluation measurements suffer from. First, as discussed above, depending on the number of query recommendations, the precision and coverage may vary. Second, these query recommendations are not exactly ‘‘personalized’’ query recommendations, but rather ‘‘collaborative’’ recommendations, also

Table 2 Query recommendation sample Query

Recommendations

great wall

great wall computer; great wall broad-band; great wall motors; great wall pictures; ten-thousand-mile great wall; southern-great-wall cup; great wall securities; great wall cooperations; great wall lubricant; Badaling great wall

lilac

lilac download; lilac lyrics; lilac FLASH; lilac guitar music; story of lilac; lilac MP3; Tanglei lilac; lilac songs; lilac blossom; ‘‘lilac’’

jeans

jeans beauty; jeans picture; beauty jeans; tight jeans; jeans brands; LEVIS jeans; tight jeans beauty pictures; tight jeans beauty; LEE jeans; jeans beauty pictures

Fragrance Hill

Fragrance Hill park; Fragrance Hill red autumnal leaves; Fragrance Hill Hotel; Beijing Fragrance Hill; Fragrance Hill Plant arboretum; Honolulu; Fragrance Hill maps; Beijing Fragrance Hill Park; Fragrance Hill weightloss; Fragrance Hill sage

shopping guide

LIFE STYLE; Beijing shopping guide; Shanghai shopping guide; LIFE STYLE newspaper; online shopping; OLAY shopping guide; Hongkong shopping guide; shopping; home appliances; air-conditioners

super market

super market management; carrefour supermarket; supermarketers; wal-mart supermarket; supermarket promotion plans; supermarket good shelves; Quanzhou information supermarket; Fresh Flowers supermarket; supermarket management policies; Watsons supermarket

Author's personal copy

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

know as social filtering. That is, our methods take into account the ‘‘cumulative’’ effects of a community of searchers, and there is no personalization for one specific user. For all users, we recommend the same recommendation set as long as they input the same query. Besides, in the above example, it may not be fair to consider the coverage for the 6-query session to be 40% just because the query recommendation set for the first query contains only query2 and query3, since if the user accepted our previous recommendations (query2 or query3), then the query recommendation set for query2 and query3 may contain query4, query5, or query6. In this case, the effective coverage would be perceived to be higher than the one computed in our experiments. Therefore, we need to improve our methods to evaluate testing in a real search scenario by taking into account the recommendations during the entire span of a search session. Measuring the practical users’ clickthrough rate of the related queries would be more objective, but it also suffers from the same problem of not accounting for the accumulating effects. 5.2. Editors rating evaluation method and results Search engine users’ subjective perceptions may be more likely to indicate the success or failure of a search engine. For this reason, we also used editors’ ratings to evaluate our query recommendations. In this method, we count our editors’ likelihood of clicking on the related queries. We let our editors give their likelihood of clicking one of the query recommendations from 0 to 5. If all three editors rate their likelihood of clicking one of the query recommendations as 5, then this would indicate that during practical search circumstances, all of them would click one of the query recommendations, and we can thus consider our recommendations to have high coverage. We should note that, the quality of the query recommendation is sometimes subjective and dependent on one user’s domain knowledge of a specific field. Some closely related queries may seem totally unrelated for other users who have no such knowledge. This is the reason why we chose three independent editors. The precision is also listed in the right column of Table 4. The general definition of precision is: precision ð%Þ ¼

Session

query1

query2

query3

1

posting pictures

beauty posting pics

Taiwan beauty posting pics

2 3 4 5 6

beauty Zhang ling gratitude glaucoma computer Assets Evaluation

beauty pictures Mian yang heart with gratitude glaucoma symptom Assets Evaluator

7 8 9 10

Translation Jian Li Raising dogs Download

Translation Company Time Flows Like Water Dogs meat Free music download

HeBei Self-taught Test papers

editors according to several criteria. These criteria include that they have high visit frequency and that they cover as many different fields as possible. Using these queries, we generate the queries’ recommendations. For each query, we ask our editors to rank the recommended query results from 5 to 0, where 5 means very good and 0 means nothing. For precision, we ask them to review whether the related queries are really related to the original query. For coverage, we ask them whether they would click on the query recommendations in a real search scenario. Table 3 lists the average rating results from these evaluations. In Table 4, the numbers mark how many query recommendations are ranked in the specific rank among the 100 queries. For example, in the second column, which is for coverage, 35 queries were rated as very good, 33 were rated as good, 13 were rated as marginally good, etc. And in the third column, which is for precision, 29 were rated by the editors as very good, 41 were rated as good, etc. In Table 5, we list several sample sessions from real queries received by the search engine (iask). From Table 4, we can see that among our editors’ evaluations on coverage and precision, about 70% are very good or good. We also ran some earlier experiments before we combined the two similarity methods together, and before we filtered the robots’ visits, all of which showed less satisfactory results, e.g. in general, about half of the recommendations were rated as bad by our editors in these earlier tests.

(10)

where A is the number of relevant records retrieved and C is the number of irrelevant records retrieved. For precision, we also rely on our editors judgement. We also scale the editors’ comments from 0 to 5. When one editor rates a query’s recommendations as 5, we consider that all the 10 query recommendations are closely related to the original query, while 0 means that none of the recommendations are related to the original query. To evaluate our method, we tested 100 queries. These queries were selected by our Table 4 Editor evaluation results

Average rating S.D.

Table 5 Practical session sample

6. Conclusions

A  100 AþC

Ranking Very good (5) Good (4) Marginally good (3) Bad (2) Very bad (1) Nothing (0)

1333

Coverage (among 100)

Precision (among 100)

35 33 13 9 7 3

29 42 11 8 7 3

3.71 1.37

3.69 1.32

In this paper, we presented a simple and intuitive solution to social filtering on a search engine that has already been implemented with success on iask.com. We combine two methods together to extract relations between submitted queries to get recommendations. Our association-type consecutive-relation between queries that occur in the same session is computed in a simple and efficient incremental manner from the query logs. Moreover, these weights are accumulated incrementally with each new session to obtain a global, adaptive and scalable model. Furthermore, our method adapts to the dynamic searching behavior of the users. Compared to most work that use click-through data, our work considers a user’s consecutive query behavior alone and without click-through data. This information can provide abundant information for generating query recommendation. Our work, also benefits from combining both the query content-based similarity with the cumulative search sessions’ consecutive-relations together to generate well-rounded query recommendations. Acknowledgment This work is partially supported by a National Science Foundation CAREER Award IIS-0133948 to Olfa Nasraoui.

Author's personal copy

1334

Z. Zhang, O. Nasraoui / Applied Soft Computing 8 (2008) 1326–1334

References [1] R. Baeza-Yates, C. Hurtado, M. Mendoza, Query recommendation using query logs in search engines, in: Proceedings of the International Workshop on Clustering Information over the Web (ClustWeb, in conjunction with EDBT), Crete, Greece, March 2004. [2] D. Beeferman, A. Berger, Agglomerative clustering of a search engine query log, in: KDD ’00: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, USA, 2000 , pp. 407–416. [3] M. Chau, X. Fang, O.R.L. Sheng, Analysis of the query logs of a web site search engine: research articles, J. Am. Soc. Inform. Sci. Technol. 56 (13) (2005) 1363–1376. [4] S. Chien, N. Immorlica, Semantic similarity between search engine queries using temporal correlation, in: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, ACM Press, New York, NY, USA, 2005, pp. 2–11. [5] H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma, Probabilistic query expansion using query logs, in: WWW ’02: Proceedings of the 11th International Conference on World Wide Web, ACM Press, New York, NY, USA, 2002, pp. 325–332. [6] H. Cui, J.-R. Wen, J.-Y. Nie, W.-Y. Ma, Query expansion by mining user logs, IEEE Trans. Knowl. Data Eng. 15 (4) (2003) 829–839. [7] B.D. Davison, D.G. Deschenes, D.B. Lewanda, Finding relevant web-site queries, in: Poster Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003. [8] G.W. Flake, E.J. Glover, S. Lawrence, C.L. Giles, Extracting query modifications from nonlinear SVMS, in: WWW ’02: Proceedings of the 11th International Conference on World Wide Web, ACM Press, New York, NY, USA, 2002, pp. 317–324. [9] B.M. Fonseca, P.B. Golgher, E.S. de Moura, N. Ziviani, Using association rules to discover search engines related queries, in: LA-WEB ’03, Proceedings of the First Conference on Latin American Web Congress, IEEE Computer Society, Washington, DC, USA, (2003), p. 66. [10] B.J. Jansen, A. Spink, J. Bateman, T. Saracevic, Real life information retrieval: a study of user queries on the web, SIGIR Forum 32 (1) (1998) 5–17. [11] T. Joachims, Optimizing search engines using click-through data, in: KDD ’02: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, USA, 2002, pp. 133–142.

[12] D. Kraft, J. Chen, M. Martin-Bautista, M. Vila, Textual information retrieval with user profiles using fuzzy clustering and inferencing, in: P. Szczepaniak, J. Segovia, J. Kacprzyk, L. Zadeh (Eds.), Intelligent Exploration of the Web, Physica-Verlag, 2002. [13] R. Kraft, J. Zien, Mining anchor text for query refinement, in: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, ACM Press, New York, NY, USA, 2004, pp. 666–674. [14] T. Lau, E. Horvitz, Patterns of search: analyzing and modeling web query refinement., in: UM ’99: Proceedings of the Seventh International Conference on User Modeling, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1999, pp. 119–128. [15] C. Ling, J. Gao, H. Zhang, W. Qian, Mining generalized query patterns from web logs, in: HICSS ’01: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), vol. 5, IEEE Computer Society, Washington, DC, USA, (2001), p. 5020. [16] Y. Miyata, T. Furuhashi, Y. Uchikawa, Query expansion using fuzzy abductive inference for creative thinking support system, in: IEEE Conference on Fuzzy Systems, vol. 1, May 1998, 189–193. [17] Y. Ohura, K. Takahashi, I. Pramudiono, M. Kitsuregawa, Experiments on query expansion for internet yellow page services using web log mining, in: VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, Hong Kong, China, August 20–23, (2002), pp. 1008–1018. [18] S. Rieh, H. Xie, Patterns and sequences of multiple query reformulations in web searching: a preliminary study, in: Proceedings of the 64th ASIST Annual Meeting, vol. 38, 2001, pp. 246–255. [19] E.H. Ruspini, A new approach to clustering, Inform. Control 15–1 (1969) 22–32. [20] C. Silverstein, H. Marais, M. Henzinger, M. Moricz, Analysis of a very large web search engine query log, SIGIR Forum 33 (1) (1999) 6–12. [21] M. Tajima, T. Takagi, Query expansion using conceptual fuzzy sets for search engine, in: IEEE Conference on Fuzzy Systems, 2001, 1303–1308. [22] J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Clustering user queries of a search engine, in: WWW ’01: Proceedings of the 10th International Conference on the World Wide Web, ACM Press, New York, NY, USA, 2001, pp. 162–168. [23] J.-R. Wen, J.-Y. Nie, H.-J. Zhang, Query clustering using user logs, ACM Trans. Inform. Syst. 20 (1) (2002) 59–81.