A Framework of Faceted Search for Unstructured ...

2 downloads 169162 Views 467KB Size Report
which mentions apple fruit easily; most of the top results are about the Apple Inc., an. American multinational corporation. Furthermore, searching with keyword ...
A Framework of Faceted Search for Unstructured Documents Using Wiki Disambiguation Duc Binh Dang1(), Hong Son Nguyen2, Thanh Binh Nguyen3, and Trong Hai Duong1 1

School of Computer Science and Engineering, International University, VNU-HCMC, Ho Chi Minh City, Vietnam [email protected], [email protected] 2 Faculty of Information Technology, Le Quy Don University, Hanoi, Vietnam [email protected] 3 International Institute for Applied Systems Analysis (IIASA), Schlossplatz 1, 2361, Laxenburg, Austria [email protected]

Abstract. According to literature review, there are two challenges in search engines: lexical ambiguity and search results filtering. Moreover, the faceted search usually applies for structured data, rarely uses for unstructured documents. To solve the aforementioned problems, we proposed an effective method to build a faceted search for unstructured documents that utilizes wiki disambiguation data to build the semantic search space; the search is to return the most relevant results with a collaborative filtering. The faceted search also can solve the lexical ambiguity problem in current search engines. Keywords: Faceted search · Semantic search · Lexical ambiguity · Disambiguation · Wiki

1

Introduction

The major problem is shortage of specification of semantic heterogeneousness and ambiguity. We aim to look the apple fruit by putting the word apple in famous search engines including Google, Yahoo! and Bing. We could not find the page, which mentions apple fruit easily; most of the top results are about the Apple Inc., an American multinational corporation. Furthermore, searching with keyword java on the current search engines returns most of the results mentioning java as programming language, however, java is not only a programming language, java is the world’s most populous island in Indonesia, located in Indonesia; it is also breed of chicken originating in the United States or also a brand of Russian cigarette. We only have the Java as an island when using the query “Java Island”, as a chicken breed once the “java chicken” is entered. Almost of search-engines such as Google, Yahoo! or Bing are search engine, which are not knowledge engine. They are good at returning a small number of relevant documents from a tremendous source of webpages on the Internet; but they still experiences © Springer International Publishing Switzerland 2015 M. Núñez et al. (Eds.): ICCCI 2015, Part II, LNCS 9330, pp. 502–511, 2015. DOI: 10.1007/978-3-319-24306-1_49

A Framework of Faceted Search for Unstructured Documents

503

the lexical ambiguity issue, the presence of two or more possible meanings within a single word. In addition, the faceted search usually applies for structured data, rarely uses for unstructured documents. Dynamic queries defined as interactive user control of visual query parameters that generate a rapid animated visual display of database search results [2, 5-7, 13-16]. The authors emphasize the interface with outstanding speed and interactivities; in contrary to method of using database-querying language like SQL. This is one of outstanding works in faceted search in early phase. It is catalyst for interest of faceted search. Ahlberg and Shneiderman built Film Finder to explore the movie database [1]. The graphical design contains many interface elements; parametric search is also included in a faceted information space. However, the results of Film Finder returning to users are not proactive and users are still able to select unsatisfactory combination. Later on, Shneiderman and his colleagues addressed the above problem on query previews. Query Previews are prevent wasted steps by eliminating zero-hit queries. That is mean the parametric search is replace by faceted navigation. In general, it helps user to have an overview over the selected documents. In [12], both of the view-based search and query preview have same limitation that they do not support faceted search. The mSpace project [13] described as an interaction design to support user-determined adaptable content and describe three techniques, which support the interaction: preview cues, dimensional sorting and spatial context. According to aforementioned study, there are two problems in today search engines: • Lexical ambiguity: the question is that can the search engine returns the correct meaning of keyword when providing a limited information of search query? • Search results filtering: can the search engine offer a better results filtering mechanism such as collaborative filtering or content-based filtering? In addition, the faceted search usually applies for structured data, rarely uses for unstructured manner. To address the mentioned problems, our approach is to build a faceted search for unstructured documents that utilizes wiki disambiguation data to build semantic search space; the search is to return the most relevant results with collaborative filtering. The faceted search can also solve the lexical ambiguity problem in current search engines. The remainder of this paper is organized as follows. Section 2 provides methodology including disambiguation reference and semantic search space building using Wiki’s disambiguation. Section 3 presents experiment with result evaluation, and Section 4 reports the conclusions.

2

Methodology

2.1

Disambiguation Reference Building Using Wikipedia

Wikipedia is a free access as an open internet-encyclopedia. Wikipedia is one of most popular websites, constitutes the Internet's largest, and most popular general reference work [11]. In Wikipedia, a page (article) is the basic entry, which describes entity or event. Any entities or events always contain hyperlinks, which guide users to other related pages.

504

D.B. Dang et al.

Disambiguation of Wikipedia1 is the process of resolving the conflicts that arise when a single term is ambiguous-when it refers to more than one topic covered by Wikipedia. For example, the word Mercury can refer to a chemical element, a planet, a Roman god, and many other things. Disambiguation is required whenever, for a given word or phrase in which a reader might search, there is more than one existing Wikipedia article to which that word or phrase might be expected to lead. In this situation, there must be a way for the reader to navigate quickly from the page that first appears to any of the other possible desired articles. The content of a disambiguation pages are created by a huge community - internet users, which make them, are very easy to understand. The facet and sub-facets are legible entities; common facets are People, Places, Company, Computing or Music, which make disambiguation data suitable for generic users. In our database, all disambiguation entities are extracted from the articles in Wikipedia data; each disambiguation entity contains a list of all existing Wikipedia article of the given word; for example, Java is, an island, a programming language, an animal. Each meaning is categorized into facets and sub-facets. In Figure 1, facets and sub-facets of Java disambiguation are: • Facets: Places, Animal, Computing, Consumables, Fictional characters, Music, People, Transportation, Other uses. • Sub-facets: Indonesia, United States, Other are sub-facets of facet Places. In order to obtain the Wikipedia disambiguation data, we decide to download the Wikipedia dumps file, which contains Wikipedia contents, and load them into a MySQL database serve; the total size is more than 150 Gigabytes.

Fig. 1. Sample facet structure of Java 1

https://en.wikipedia.org/wiki/Wikipedia:Disambiguation

A Fraamework of Faceted Search for Unstructured Documents

505

dia, all the disambiguation pages are extracted out from the After importing Wikiped database based on templatee of Wiki Disambiguation; for every disambiguation enttity, an algorithm is applied to get the features of disambiguation documents. The whhole process is as Figures 2:

Fig. 2. Data preparation overview

. Fig. 3. 3 Document features extraction process

The document features are a used as search data. There are two following main stteps to construct the feature veector for each concept. First step, using the vector sppace model tf/idf to construct th he generalization of the documents. Second step, for eeach leaf concept, the feature veector is calculated as the feature vector of set of documeents associated with the conceptt. For each none-leaf concept, the feature vector is calcuulated by taking into consideraation contributions from the documents directly associaated with the concept and its direct sub-concepts. The detail of these two steps is presennted in [3]. The result of this step p is disambiguation reference model as shown in Figuree 4: 2.2

Semantic Search Sp pace Building Using Disambiguation Reference Mod del

This semantic space is con nstructed by organizing documents in the disambiguattion reference model (see Figure 4). Each document is represented by a feature vecctor. The similarity between a do ocument’s feature vector and a reference concept’s featture vector is considered as a relevant r weight of the belong from the document to the

506

D.B. Dang et al.

Fig. 4. A Part of Disambiguation Reference Model

reference concept. The similarity degree is used to assign which concept the document belongs to. The framework of the algorithm is depicted as Figure 5, which is described in [4].

Fig. 5. Semantic Search Space Building Process

A Framework of Faceted Search for Unstructured Documents

3

Experiment

3.1

Search Implementation

507

We illustrate our proposed system via a search situation. Figure 6 shows interface of our search program.

Fig. 6. Search user interface

The user interface of the faceted search is consisting of three UI components. 1) The search input field (top): user will enter his/her query there. An on change event is triggered on that input. After query is provided, the program will start looking for the results. 2) The search results section (left bottom): show all the relevant documents to the user. Each document is showed as a node. On double clicking on the node, search program will open the link to corresponding document. 3) The Facet graph (right bottom): show facets and sub-facets of tree hierarchy (which is the disambiguation reference). Beside results in result graph, this facet graph helps obtaining the documents quickly: a).The tree gives user an overview about the returned results, what the current results are related to. b) When most weighted vertex in result graph is not user's intention, he can use facet graph to interactively refine the result.

508

3.2

D.B. Dang et al.

Evaluation

Six tests are executed and the search program behaves differently for different profiles; in the search results, sizes of vertices for specific documents are varied for various users (Tom, Richard and Patricia). • Accuracy evaluation 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% java coffee obama alien film apple president company

Recall

joker music

python computing

Precision

Fig. 7. Measure search result using Precision & Recall

In Figure 7, the effectiveness of the returned results of our program is evaluated. In most of cases, all the relevant documents are retrieved but the accuracy is not very good, many times the irrelevant documents are shown to user. • User’s satisfaction evaluation Now we continue to look at the importance of vertices' size, normally the biggest vertex will draw user’s attentions. Table 1. Extracted facets/documents compared to original data

Query java obama president alien film apple company joker music python computing

Top weighted vertex Java (programming language) Barrack Obama Alien Sun Apple Inc. Joker Phillips Python language)

(programming

Satisfied Yes Yes No Yes No Yes

A Framework of Faceted Search for Unstructured Documents

509

• Filtering evaluation Because of the problem when matching words together, there will be the cases that user is not able to get his interested results (Table 4); in these situation, user can refine search results by selecting facet or sub facets in the facet graph.

Fig. 8. Search results for Java coffee are filtered by Consumables facet

In a common flow, user to be able to select the correct information after two steps in our search program; the first is to enter search query, next step is to refine the search result based on facet value on facet graph. • Efficiency evaluation Now we evaluate the efficiency of the search program, the time consuming for sample queries are measured as follow: Table 2. Query time consuming

Query Java java coffee Alien Obama president joker batman

Time consumed(milliseconds) 318 47 316 29 77

According to the time consumed for each query, we observe that querying for generic query (java, alien) takes around 300 milliseconds, but querying for more specific query (java coffee) where users have knowledge about what they are searching for, the waiting time is negligible. The reason is that the time when search program calculates weight of each vertex, the more the documents are in the search result, the more time are needed.

510

4

D.B. Dang et al.

Conclusion

We proposed a novel framework of faceted search for unstructured documents utilizing wiki disambiguation data. Wiki disambiguation is to learn a reference disambiguation model to build the semantic search space. The model is also used for result collaborative filtering. The faceted search can solve the lexical ambiguity problem in current search engines. In future work, we apply intelligence business techniques [910] to make visualization and interaction for facets to facilitate filtering.

References 1. Ahlberg, C., Shneiderman, B.: Visual information seeking: tight coupling of ynamic query filters with starfield displays. In: Adelson, B., Dumais, S., Olson, J. (eds.) Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 1994), pp. 313–317. ACM, New York (1994) 2. Brunk, S., Heim, P.: tFacet: hierarchical faceted exploration of semantic data using wellknown interaction concepts. In: proceedings of DCI 2011. CEUR-WS.org, vol. 817, pp. 31–36 (2011) 3. Duong, T.H., Uddin, M.N., Li, D., Jo, G.S.: A collaborative ontology-based user profiles system. In: Nguyen, N.T., Kowalczyk, R., Chen, S.-M. (eds.) ICCCI 2009. LNCS, vol. 5796, pp. 540–552. Springer, Heidelberg (2009) 4. Duong, T.H., Uddin, M.N., Nguyen, C.D.: Personalized semantic search using ODP: a study case in academic domain. In: Murgante, B., Misra, S., Carlini, M., Torre, C.M., Nguyen, H.-Q., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2013, Part V. LNCS, vol. 7975, pp. 607–619. Springer, Heidelberg (2013) 5. Heim, P., Ertl, T., Ziegler, J.: Facet graphs: complex semantic querying made easy. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088, pp. 288–302. Springer, Heidelberg (2010) 6. Heim, P., Ziegler, J., Lohmann, S.: gFacet: a browser for the web of data. In: Proceedings of the International Workshop on Interacting with Multimedia Content in the Social Semantic Web (IMC-SSW 2008), vol. 417, pp. 49–58. CEUR-WS (2008) 7. Heim, P., Ziegler, J.: Faceted visual exploration of semantic data. In: Ebert, A., Dix, A., Gershon, N.D., Pohl, M. (eds.) HCIV (INTERACT) 2009. LNCS, vol. 6431, pp. 58–75. Springer, Heidelberg (2011) 8. Hostetter, C.: Faceted searching with Apache Solr. ApacheCon US (2006) 9. Nguyen, T.B., Schoepp, W., Wagner, F.: GAINS-BI: business intelligent approach for greenhouse gas and air pollution interactions and synergies information system. In: iiWAS 2008, pp. 332–338 (2008) 10. Nguyen, T.B., Wagner, F., Schoepp, W.: Cloud intelligent services for calculating emissions and costs of air pollutants and greenhouse gases. In: Nguyen, N.T., Kim, C.-G., Janiak, A. (eds.) ACIIDS 2011, Part I. LNCS, vol. 6591, pp. 159–168. Springer, Heidelberg (2011)

A Framework of Faceted Search for Unstructured Documents

511

11. OECD Internet Economy Outlook 2012, 1st edn. [ebook] OECD (2012) 12. Pollitt, A.S., Smith, M., Treglown, M., Braekevelt, P.: View-based searching systems: progress towards effective disintermediation. In: proceedings of Online 1996 conference, London, England (1996) 13. Schraefel, M.C., Karam, M., Zhao, S.: mSpace: interaction design for user-determined, adaptable domain exploration in hypermedia. In: AH 2003: Workshop on Adaptive Hypermedia and Adaptive Web Based Systems, Nottingham, UK, pp. 217–235 (2003) 14. Shneiderman, B.: Dynamic queries for visual information seeking. IEEE Software 11(6), 70–77 (1994) 15. Tunkelang, D.: Faceted Search. Morgan & Claypool Publishers (2009) 16. Wagner, A., Ladwig, G., Tran, T.: Browsing-oriented semantic faceted search. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 303–319. Springer, Heidelberg (2011)