Map-Based Range Query Processing for Geographic Web Search ...

6 downloads 73 Views 342KB Size Report
applications, we have been developing a geographic web search system, ... scope optimization method for that supports geometric operations in geographic.
Map-Based Range Query Processing for Geographic Web Search Systems R. Lee1, H. Shiina1, T. Tezuka1, Y. Yokota1, H. Takakura2, Y.J. Kwon3, and Y. Kambayashi1 1

Department of Social Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan {ryong, shiina, yahiko}@db.soc.i.kyoto-u.ac.jp 2 Academic Center for Computing and Media Studies, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan [email protected] 3 Department of Telecommunication and Information Engineering, Hankuk Aviation University, Dukyang, Kyounggi-Do, Korea [email protected]

Abstract. In order to utilize geographic web information for digital city applications, we have been developing a geographic web search system, KyotoSEARCH. When users retrieve geographic information on the web, specifying geographic location is an essential function. However, most current web search systems do not utilize location information sufficiently well to identify the user's intentions; available methods employ just keywords (for location names) and limited map functions. Furthermore, most map interfaces are used to select a determined geographic-hierarchy level or point to a specific location (sometimes specifying a radius). In this paper, we introduce twodimensional range query processing for geographic web search, where users are able to specify a geographic area freely on a map interface. In order to handle such range queries more rapidly and efficiently, we adopt geometric operations to retrieve proper web pages. Without optimization techniques, however, the recall and precision of the search results become very low. Major problems come from erroneous extension of the computed geographic area due to i) same names for different geographic objects/locations (geowords), ii) redundant geographic hierarchy information, and iii) existence of non-important geowords. By resolving these problems, we can improve range query processing for geographic data. To that end, we propose an effective geographic scope optimization method for that supports geometric operations in geographic web search; experiments conducted on an implemented system are described.

1 Introduction With the explosive adoption of the Web throughout the globe, we should be able to get useful and practical geographic information over the Web. This demands the ability to search for useful geographic information and locate it on a map. In order to make this rather complex job easier, we have been developing the geographic web search system called KyotoSEARCH [6, 7]. The major purpose of the system is to P. van den Besselaar and S. Koizumi (Eds.): Digital Cities 2003, LNCS 3081, pp. 274-283, 2005.  Springer-Verlag Berlin Heidelberg 2005

Map-Based Range Query Processing for Geographic Web Search Systems

275

help users find comprehensive geographic web information through a user interface integrated with a map (its name came from the targeted area, Kyoto City in Japan). In the system, a map interface is used for specifying a geographic area for localizing web searches. For a practical system, it is important to construct i) easy location specification methods that can express the search intentions of the user and ii) geographic web index methods that find the proper pages given the geographic query. Most users want to get the best results with the least effort; this shows the need for servers that provide effective light indices, which are better than keyword indices, and clever search methods whose computation costs are reasonable. In order to satisfy these requirements, we introduce a two-dimensional range query processing method based on geographically-indexed web pages. For geographic web indexing, we use MBR (Minimum Bounding Rectangle) to represent the 'geographic scope' of a page, using geo-referential words appearing in the content. Geographically-indexed web pages can be managed with any of the well-known two-dimensional index methods such as R-Tree [1]. We have actually implemented the above system in a server-client model; the server system hosts the geographic web index function, while the client provides a map-based search interface to the users. With the proposed index and map-based interface, the user can make and restrict a query by specifying a geographic area; the answer is returned by the server. Furthermore, search results can be manipulated by geometric operations (such as the 'Contain' relationship) between the user-drawn rectangular query and a MBR corresponding to a web page. Assuming that the geographic scope of a web page is calculated from location names in the page content as an MBR, we have to identify the geographic scope of a web page to the smallest possible region. It is clear that recall is greatly degraded if unwarranted areas are assigned to a page when the 'Contain' operation is used. This problem can be solved by using the 'Intersection' operation or query correction. Unfortunately, this will degrade the precision of search results. When a user makes a rectangle query, in most cases they are assuming ‘Contain’ operation. Therefore, we focused on reducing the geographic scope of pages to support the use of the 'Contain' operation. The major contribution of the paper is how to optimize the MBR for a given web page. For accurate computation of a page's MBR, we suggest some strategies and evaluate them in an experiment. Section 2 discusses related work. Section 3 describes the basic problems that arise when web pages are used as geographic information sources, focusing on the characteristics of geographic location names. We also give an overview of our approach. Optimization strategies are introduced in Section 4 and the experiment in Section 5.

2 Related Work In order to search web pages with map interface, we should consider how to associate the pages with a location of map and what efficient and accurate methods are possible to answer geographic search queries. Some related studies have attempted to compute the geographic scope of web pages or sites. The intrinsic problem comes from the deficiency of the geography of the

276

R. Lee et al.

web: that is the web page location itself is independent of real world locations. In order to resolve this and similar problems, there are two approaches for indexing the pages as follows: – Using Internal Information [2, 9, 8]: A web page or a site has its own geographic area information, like addresses, phone number, etc. While using internal information yields relatively accurate results, few pages include such information. In order to apply this approach, we must prepare lists of geowords and find all location names in each page. – Using External Information [3, 4]: Another approach utilizes web-links. Internal information can be inaccurate such as the example of the 'NY Times' site introduced by [4]. The geographic scope of 'NY Times' should be larger than just the New York area, if we consider the geographic distribution of users assessing the site. However, limiting the extent to which the external links are followed is a major problem. A rather impractical solution is to depend on manual tagging such as is seen in Yahoo!'s Geographic Category and the registered pages. With above techniques for computing geographic scope, its representation method is an important issue, since the selected method greatly affects the search costs and the accuracy of search results. Actually, representation methods of geographic scopes of web pages and user queries should be considered together. Each one can have two representation methods; one is geometric and the other is keyword-based. Geometric representation usually used in GIS research fields can effectively reduce search costs better than keyword matching, which is adopted in most web search systems. If indexing and query methods are different, we need to convert one to the other; when there is the index of web pages associating certain pages with a corresponding position on the map of Kyoto, a keyword like ‘Kyoto Station’ issued by a user specifying a location needs to be converted into appropriate points/polygons/rectangles utilizing a lookup table transforming keywords into geometric figures. Then, geometric comparison of the query and index is realized and generates suitable search results. Table 1. Comparison of Approaches to Computing Geographic Scope Comparison Query Index Index Cost Used Info. Search Operation Search Cost Search Results

Our Method Kokono[10] Rectangle Circle MBR Polygon Low High Internal Internal Geometric Distance only Low Middle Logical

GeoLink[5] Localness[11] Point/Keywords Keywords Point MBR Middle Low External Internal Keywords Matching High Middle Semantic

As shown in Table 1, there are several types of geometric representation of web pages; point/polygon/MBR. Each of them has the different management and search costs of databases. When we use MBRs for indexing, the search costs will be lower than polygons or points. There are also various types of geometric search operations; distance, include, overlap, and so on. Though we do not discuss use cases of the

Map-Based Range Query Processing for Geographic Web Search Systems

277

specific geometric operations, these are useful tools on maps comparing to the conventional keyword-based search operation. In this paper, we focus on the geometric comparison of users’ queries and web pages indexed geometrically. This approach will ensure logical search results in the respect of map semantics with very fast comparison.

Fig. 1. Range Query Processing based on Geographic Web Index

3 Our Approach and Contribution We adopted the approach of using 'Internal Information', since our geographic area of interest is a local area (Kyoto City, Japan). That is, we determine the geographic scope of a web page from the geographic location names in the page's contents [2, 9, 8]. We first show all the process performed between a user client and the server system in fig. 1. 1. First, the user makes a query by drawing a rectangular region on a map interface and selects the desired geometric operation, either from 'Contain' or 'Intersection' (left-side buttons in the interface). The client system then submits the query to the server system. 2. Next, the range query which has its own coordinates for the user interface is then translated into the coordinate format used in the geographic web index (Latitude/Longitude in this case). 3. Finally, the selected geometric operation is applied to retrieve web pages; In ‘Contain’ search, sometimes there is no search result; it occurs when actually related web pages are corresponding to an area larger than the user query. In

278

R. Lee et al.

order to get such partially overlapped pages on maps, alternative methods like ‘Intersection’ or ‘Extension of User Query Range’ can be used. To address the problems created by the characteristics of geowords, we represent each page as an MBR region in the server side of Fig. 1 and manage these MBRs by using an R-Tree [1] index (other two-dimensional index methods are also possible). This geographic web index can be used to retrieve web pages with geometric operations as follows: Contain (Qa, Pa) : the query area (Qa) specified by user covers the area (Pa) of a web page geometrically. Intersection (Qa, Pa) : the query area (Qa) has non-empty common area with the area (Pa) of a web page. Geometric operations on web searches are discussed in other studies. There are several types; i) zoom-based range specification (available in most popular portal sites such as Yahoo!), and ii) a circle based on a center point with a diameter [10]. In addition, we propose a two-dimensional rectangular type query, which provides significant flexibility to the user. Problems: As already noted, using geowords (names of geographic locations/objects) raises several serious problems: Object name wrongly identified as geoword: A tough problem is that many objects have names that are possible geowords. In the Internal Information based geographic scope computation, there is a serious problem that the corresponding geographic area can erroneously be huge if all the homonymic locations were included. We must select one of the candidate locations that have the same name. This problem must be solved in order to get an appropriate geographic area corresponding to a page. Our solution is based on a heuristic that a homonym can be specified by geographic distances from other location names included in the same page. A homonymic geoword can be mapped to one unique location which minimizes the sum of the distances from other co-occurring geowords in the same page. Redundant geographic information: A page can have geographic information of different levels. For example, if a postal address is provided, the page will contain 'city', 'ward', and 'town' names; town names are the most informative in determine geographic scope so 'city' and 'ward' names are unnecessary except when needed to differentiate towns with the same name. Unimportant geowords: A web page can contain many valid geowords that are unimportant with regard to the user's search goal. Examples are the pages of tour guides and the branch offices of companies. Given that the title and anchor text parts are much more important than the others, we should focus on the level of the HTML tags in selecting the true geoword. Figure 2 shows one example of how shared names can degrade search performance. In this case, adjacent wards have temples with the same name of 'Daijiin'. A web page that contains 'Daijiin' as a geoword can trigger unwarranted expansion of the MBR. The true MBR is defined by 'Hokyoji' and 'Daijiin' (Sakyo ward). Resolving this problem would yield the true MBR region as the small rectangle shown in the center of fig. 2.

Map-Based Range Query Processing for Geographic Web Search Systems

279

We implemented a testbed system utilizing 2 million web pages crawled by us and the R-Tree function of PostgreSQL DB. In tests of the developed system, we found that most users drew a rectangle with the intention to express the 'Contain' operation.

Fig. 2. Example of MBR for a Geographic Web Page

4

Optimizing Geographic Scope

In this section, we will describe the problems of geographic scope optimization in detail and propose a comprehensive solution. To compute effective MBRs, we need to consider two types of geowords for Addresses from Geographic Objects' Names such as buildings, since there are many partially written Addresses. In that case, we need to guess full address from partial information. As the source data, we used map data for Kyoto City from (c)ZENRIN6. – Addresses: In order to extract addresses from a page, we must consider a convention to write an address in various types for geographic hierarchy. For example, a full address 'Kyoto-prefecture/Kyoto-city/Sakyo-ward/YoshidaHonmachi' is used to specify the 'YoshidaHonmachi' (a town). However, there are other forms ignoring city name as 'Sakyo-ward/YoshidaHonmachi', 'YoshidaHonmachi', etc. These various forms should be normalized into the 'YoshidaHonmachi' location. To solve these problems also for the other levels of 'ward' and 'city', we made an address index as the below three types, and use 6

(c)Zenrin-TOWNII: a map data set by a Japanese company.

280

R. Lee et al.

them to recognize geographic hierarchy of each address by the maximum text matching in page contents. For these matching process, such overlapped 141,379 addresses are made from 4,966 towns names and 11 wards of Kyoto City. In our data, there are 30,063 MBRs in town-level addresses, but there are 3,658 distinct names. The average ambiguity of these distinct addresses is about 8.21. (This is relatively high, since we simply find other addresses including a town name. However, we are facing to such an ambiguity for addresses in Kyoto City). – Geographic Objects: Other geowords such as buildings also have corresponding MBRs. Actually, we have 120,286 geographic objects (except for personal names), where 4,916 names are corresponding to 12,273 objects. The average ambiguity is about 2.49. These data are also managed to have all corresponding MBRs as choice candidates. 4.1 Heuristic Optimization Strategies We now propose the following heuristic optimization strategies to deal with the geoword problems: S1) Focus on Title+Anchor: Title and/or anchor texts are assigned significant importance in web page processing. Accordingly, geowords in these parts are taken as candidates. We will distinguish geowords appearing in titles of pages from geowords appearing in page contents. It is better to give more attention to title part, as discussed in our previous work [11]. S2) Reduction of Ambiguity: Often a web page will have one or more ambiguous geowords. Since accurate selection is difficult in practice, our heuristic method is to utilize the unique geowords, if any, on same web page to resolve this ambiguity. S3) Reduction of Redundant Hierarchy Info: Generally, there is redundant geographic hierarchy information, which should be excluded in computing an MBR to a page. As described, if there are two geowords, one is a town name and the other is a ward name, and the town is included in the ward. In these cases, the ward name is redundant, and should be removed in the consideration. However, there are other cases where the ward name could not be removed. For example, we assume that: 1. A town name T1 is included only in a ward W1 (T1 is unique). 2. A town name T2 is included in both of wards, W1 and W2 (T2 is ambiguous). A web page can have following cases (Gp is a set of geowords appeared in a page): Case 1. Gp = {T1,W1} : T1 is not ambiguous and W1 is redundancy. Thus, W1 is removed. Case 2. Gp = {T2,W1, W2} : T2 is ambiguous, and W1/W2 are not redundancy. Thus, W1/W2 cannot be removed without other information. Case 3. Gp = {T2, W1} : T2 is not ambiguous and W1 is redundancy, since T2 becomes to be unique when W1 only appears. Thus, W1 can be removed. Case 4. Gp = {T1, T2,W1,W2} : W1 and W2 cannot be removed as like above case 2. However, we can resolve this problem logically by selecting one near T1 by minimum

Map-Based Range Query Processing for Geographic Web Search Systems

281

distance. T2's ambiguity, then, is resolved. T1 and T2 are together uniquely included in W1. Finally, W1 and W2 are removed.

5 Experiments This section describes the experiments conducted on actual web pages to show the effectiveness of the proposed algorithm. We created a set of web pages by following the links shown on a local information site7 about Kyoto City. For web data targeting Kyoto City area (in Japan), we crawled 6,754 web pages that had the string 'kyoto' in their URLs. From the crawled pages, we used only 3,075 (about 45%) of them that included town and ward names (inside Kyoto City) as the experimental data set. In order to show the relative effect of proposed algorithm, we compared PureMBR(representing the unoptimized geographic scope) size to the results gained by applying the Optimization Strategies described in Sect. 4.1.

Fig. 3. Experimental Results of Optimizing MBRs

Distribution of PureMBR Area Size: Each page in the experiment data set was parsed to find geowords and their locations; (i)Title+Anchor and (ii) all page contents. For (ii) of all page contents, PureMBR was computed by referencing a lookup table that transforms geowords into MBRs without any optimization method. The cumulative distribution of MBR size is shown in Fig. 3; less than 40% of the pages 7

Kyoto City Information: http://www.city.kyoto.jp/koho/ind_h.htm

282

R. Lee et al.

were within an area of 1km2, and more than 50% of the areas were larger than 8km2. It means that a half of experiment data set will not appear in search results to a range query smaller than 8km2. We also tested the combination of all optimization methods in the LastMBR algorithm. As shown in Fig. 3, more than 70% pages are smaller than 1km2. By the proposed area reduction strategies, at least hidden 30% pages appear in results. Without these reductions, we need to perform 'Intersection' operation or extend the query size to about 50km2 in order to get those hidden 30% data, while these alternative query operations will result in low precision.

6 Conclusion This paper described the problems raised by using geowords for extracting geographic information from web pages and proposed the utilization of conventional spatial indexing methods for geographic web searches. We also showed an implemented system that allows users to perform geometric operations to retrieve geographic web pages. Three heuristics were presented to overcome the problems and the results of experiments were shown to demonstrate the effectiveness of the heuristics.

Acknowledgements This research is supported by Informatics Research Center for Development of Knowledge Society Infrastructure (COE program of the Ministry of Education, Culture, Sports, Science and Technology, Japan) and by 'Universal Design in Digital City' Project in CREST of JST (Japan Science and Technology Corporation). We also thank one of our international collaborative members, Internet Information Retrieval Center of Hankuk Aviation University for their cooperation.

References 1. G. Antonin, R-TREE: A Dynamic Index Structure for Spatial Searching, In Proceeding of ACM SIGMOD, pages 47-57, 1984. 2. M. Arikawa, K. Okamura, “Spatial Media Fusion Project,” In Proceeding of Kyoto International Conference on Digital Libraries: Research and Practice, pp.75-82, Nov. 2000. 3. O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, and N. Shivakumar, “Exploiting geographical location information of web pages,” In Proceeding of the ACM SIGMOD Workshop on the Web and Databases, WebDB, 1999. 4. J. Ding, L. Gravano, and N. Shivakumar, “Computing Geographical Scopes of Web Resources,” VLDB2000, pp.545-556, 2000. 5. K. Hiramatsu and T. Ishida, “An Augmented Web Space for Digital Cities,” IEEE/IPSJ Symposium on Applications and the Internet (SAINT-01), pp.105-112, 2001. 6. R. Lee, H. Takakura, and Y. Kambayashi, “Visual Query Processing for GIS with Web Contents,” The 6th IF Working Conference on Visual Database Systems, pp.171-185, May 29-31, 2002.

Map-Based Range Query Processing for Geographic Web Search Systems

283

7. R. Lee, Y. Tezuka, N. Yamada, H. Takakura, and Y. Kambayashi, “KyotoSEARCH: A Concept-based Geographic Web Search Engine,” In Proceedings of 2002 IRC International Conference on Internet Information Retrieval, pp. 119-126, Koyang, Korea, Nov. 2002. 8. C. Matsumoto, Q. Ma, and K. Tanaka, “Web Information Retrieval Based on the Localness Degree,” In Proceeding of the 13th International Conference on Database and Expert System Applications 2002 (DEXA ’02), pages 172-181, 2002. 9. K.S. McCurley, “Geospatial Mapping and Navigation of the Web,” WWW10, 2000. 10. S. Yokoji, K. Takahashi, N. Miura, and K. Shima, “Location Oriented Information Collection, Structuring and Retrieval,” IPSJ Journal Vol.41, No.7, pp. 1987-1998, July 2000. 11. N. Yamada, R. Lee, H. Takakura, and Y. Kambayashi, “Classification of Web Pages with Geographic Scope and Level of Details for Mobile Cache Management,” The 2nd Int. Workshop on Web Geographical Information Systems, IEEE CS Press, Singapore, Dec. 2002.